Steven M. Boker
Department of Psychology
The University of Notre Dame
Notre Dame, Indiana 46556
August 16, 1997
The Internet began its life as a network serving research scientists and has since grown to encompass most of the networked computers on the planet today. Services on the Internet were primarily limited to File Transfer Protocol (FTP), remote terminal (Telnet) and email until the advent of the World Wide Web (WWW or the Web) in 1991. The Web grew slowly for the first few years of its existence, but has grown at an exponential rate in the past few years doubling every 6 months or so and eclipsing FTP as the largest user of the Internet's capacity around March of 1995 (Matt Gray's Web Growth Summary).
The Web appears as pages that have hot buttons or hypertext links to other pages when view by a Web Browser, a software program running on the user's local machine. These pages are connected through a mechanism called a Universal Resource Locator (URL) that allows the links from one page to call pages on machines locally or on remote computers around the world without the user needing to know where the data physically resides.
The Web's pages are written in an ASCII text based markup language called Hypertext Markup Language (HTML). This language inserts tags into the ASCII text in order to create text formatting as well as to insert graphics and to create hypertext links. Hypertext links can not only send the user to another page, but can also perform calls to programs that run on the Web Server or provide a simple mechanism for the user to download information via FTP. Recent advances include the portable language JAVA that is designed to run inside the user's local Web Browser and thus allows programs to be run locally on the user's machine.
The result of this easy to use hypertext linking is that the Web has redefined how distance is represented in our access of information. Information used to be near each other if it were near in physical space. For instance the Dewey decimal system attempts to place books with similar subjects near to each other on a library shelf. However, this is a one dimensional structure; that is a book can only be next to two other books on the shelf. Hypertext linking is multidimensional so that many pages may be next to each other. This means that the Web has given us the ability to represent associational or semantic distance directly as part of the textual information itself.
So--called Intelligent Agents (see The BotSpot or The Web Robots Index) are software programs that are designed to search through the Web and bring back and index material that they find. Agents (also called Spiders, WebBots or KnowBots) have contributed greatly to the usefulness of the Web as a place where information can be found. As these Agents become more sophisticated (for instance, AltaVista's Scooter) all non--password protected pages on the Web began to be searched and indexed on regular basis. Recently, meta--indexing has become popular where all of the indexes are searched and crossreferenced by agents such as MetaCrawler or the World Wide Web Worm.
The Web has changed the definitions by which we consider computing to be limited. The combined storage capacity and computing power of all of the machines connected to the Web is truly enormous. Of the estimated 16 million plus computers attached to the Internet (Net Wizards), over 400,000 are Web Servers with more than 60,000,000 total Web pages comprising over 2,400,000,000,000 bytes (WebTechniques, May 1997). A group calling themselves (The Internet Archive) is planning to archive this massive multi--terabyte distributed network so that a rolling backup of the entire network will be continuously performed.
There are two main points of view from which to view the problems associated with data archiving: from the data analyst's perspective and from the data archivist's standpoint. These questions are for the most part complementary, although there are some competitive aspects as well. This section presents the main questions that must be addressed if the Web is to be used as a data archiving tool. These points will be covered in more detail in later sections.
The field of molecular genetics has produced a spectacular success story in data archiving, GenBank. GenBank is the NIH sponsored publicly accessible record of almost all known DNA sequences maintained by the National Center for Biotechnology Information. There are about 1.5 million DNA sequences recorded in GenBank and the system has been tailored to be able to be accessible over the Web.
GenBank has been remarkably successful in archiving and indexing data from the molecular genetics community. This success is due to two reasons, (1) the major molecular genetics journals came to an agreement that all articles submitted for publication must first deposit their sequence data in GenBank and receive a tracking number, and (2) the highly effective indexing and similarity matching software has made the archive an important part of a great many secondary analyses.
Figure 1. The organization of the three parts of GenBank.
GenBank has created a unique system in which the bibliographic information, keyword indexing and DNA sequence data are all interlinked as in Figure 1. Both the DNA Sequence data and the keyword data are associated using an algorithm that calculates a distance between each data point and its neighbors. In addition the bibliographic, keyword and DNA sequence data are all cross--referenced to neighborhoods in each other.
This means that a secondary analyst can search in a variety of ways. For instance, one researcher might look up a keyword, find that it is produces several bibliographic matches, each of which point to a DNA sequence. Those DNA sequences might be nearly like another sequence that then points back to a new keyword and bibliographic references. To illustrate, in the August 15th issue of Science, Koutarou et. al report a surprising finding in which a gene that is associated with longevity in the worm C. elegans was found to be similar to the gene for the insulin receptor in humans . This finding may have major consequences for the study of diabetes and would have been highly unlikely to have been discovered without the aid of GenBank.
Unfortunately, psychological data are not so neat and well constrained as the base pair sequences that comprise DNA sequences. Psychological data can take a variety of forms including: categorical or continuous numerical tables or matrices; mixed alphanumeric and numeric tables; tables with relations among them; natural language text; video and audio data. Each of these data types can have many possible storage formats and each of the storage formats can have many possible query formats, leading to a scientific tower of Babel phenomenon.
In order to assure that the data in an archive is readable by the maximum number of users for the maximum time, it is important to choose a data storage and query format that is (a) simple to implement, (b) an open, non-proprietary format, and (c) in wide use for a relatively long period of time. For data in table or matrix format, we recommend that the data be archived as ASCII formatted text files with fixed width columns separated by space characters. Although this format is not the most convenient for use from any one computer platform, it is the most likely to be able to be read by all computer platforms and the most likely to still be able to be read 20 years from now in 2017.
Video and sound files present different problems. Digital audio has been very popular on audio CDs for over 10 years now and the 16 bit linear format used on CDs seems like the best bet for digitizing stereo audio sound.
On the other hand, digital video is in flux. There are several competing formats and digital video requires so much storage space that it is nearly always stored in a compressed format. Recently the DVD format has appeared as a promising storage system, but it is too early to tell whether this format will dominate the market. On the web, formats such as Apple's QuickTime and Macromedia's ShockWave have become popular, but there is no clear single standard format for motion video.
Once the data format is chosen, the data archivist must decide how the metadata is to be stored. The metadata describes the format of the data and describes the experimental methodology used to gather the data. The metadata should be stored as ASCII text embedded in an HTML file that comprises an introductory page describing each particular dataset in the data archive. In this way the metadata can be searched by Web Robots and thus be automatically cross--referenced with other data archives on the Web.
To review, the demands of heterogeneous distributed data on the Web require that the data archiver must
After the data format has been chosen, there still remains the question of how delivery of the data is to be accomplished. The interested secondary data analyst will most likely only need a subset of the data on the archive. Two main possibilities exist for addressing this need for querying and subsetting. Either the secondary analyst can transfer the entire dataset to his local machine and then perform the query, or there a query mechanism can be provided by the data archive so that the query can be performed prior to delivery of the data.
Each system has its advantages and disadvantages. Retrieving the entire database and querying locally has the disadvantages that (a) the data storage is duplicated on both the local and remote machines, (b) network bandwidth is required to transfer the full data set, and (c) increased effort on the part of the data analyst is required in order to perform the query locally. However, retrieving the entire database has the advantage that the data archivist does not need to provide a method for local query, which can be a considerable expense of time and computing resource.
Providing a query mechanism requires the archivist use one of a few systems available for managing Web queries from databases such as Apple's WebObjects. Other sites such as The University of Virginia's Social Sciences Data Center have used a combination of Perl and CGI scripts to perform queries on SAS databases. These methods have the advantage of being easier to use for the data analyst and minimize the amount of network traffic. These considerations can be substantial when the archived database is very large (such as the U.S. Census data).
When the data to be archived is less than 10 megabytes in size, or is likely to be used by only a few secondary analysts, we recommend that the data archive deliver the entire data. However, when there is likely to be a large number of users and the dataset is large, we recommend that the archivists consider budgeting for the programming required to provide a local query mechanism.
Some secondary analysis projects require joining records from different databases. For instance, individual records might be linked to community level variables from the U. S. Census, or two sets of individual records might be linked together to provide longitudinal measurement information. There are a special set of problems that are inherent in data joining; some of these problems are technical and some concern issues in subject confidentiality.
In order to join two databases one must have an acceptable index key that will identify which record is to be matched to which. This index key may be a variable such as ZIP code when linking community level variables, or may be a subject identifier when linking records by subject. Of course, if two databases use different randomized subject IDs this variable becomes useless as an index key to link the two databases. For this reason, there must be some coordination between the data archivists whose databases are involved in a data joining project.
Typically, individual subjects are given a randomized ID number and their identifying data such as name and address are kept in a separate data file from the experimental data. If two data archivists are planning to allow a third party to perform a secondary analysis of an individual subject level join of their two databases, some method must be established to create a common randomized key.
One method that could be used for creating a common randomized key is private key encryption. The two data archivists could decide on a private encryption password. The Social Security Number (SSN) of each subject could then be encrypted using the encryption password. This process would create a unique subject identifier for each subject such that if the same subject appeared in each database, the unique identifier would be the same in each database. By using this method, the two data archivists would not need to reveal the identities of subjects to the secondary data analyst.
Another problem crops up as data is joined on the level of individual records. As more and more variables become present in a record, there is a greater and greater chance that a record is unique. Unique records that can be matched against publicly available data on individuals may be able to be identified even though names and addresses have been removed from the file. This is not a problem with experimental data, but when experimental data becomes linked to enough demographic data, there is the chance that an individual can be identified.
There are methods to overcome this problem of the creation of unique records through joining databases. One of the ideas is called jittering. Jittering involves adding a small normally distributed random value with mean of zero to all fields that might be able to be used to identify an individual by matching against publicly accessible records. This method adds only a small amount of uncertainty to summary statistics and variance/covariance analyses, but prevents direct matching of records to identify individuals.
As the information from more and more data archives becomes available on the Web, searching and indexing programs such as Web Robots will become increasingly useful as a method for organizing and cross referencing the metadata from individual data archives. The indexes that are created by these search engines will be able to be processed in a similar fashion to that which GenBank uses to measure similarities and associations in keywords and DNA sequences. One project that is using this sort of Self Organizing Map (SOM) techniques is the WEBSOM project at the Helsinki University of Technology. This project organizes natural language text from Internet discussion groups into associational maps in which major topics and minor threads become self--organized into an associational structure.
By maintaining the metadata from scientific data archives on the Web in HTML format, it can be expected that a variety of strategies for the use and organization of these metadata will develop in the free--form, constantly changing environment of the Web. The potential for advance in the area of self--organizing data structures on natural language is enormous, and is rapidly developing given the extraordinary amount of readily available raw material and the great need for its organization. In ten years, the current methods for searching for and obtaining data from the Web will be seem antiquated and painfully labor intensive.
Here is a practical set of steps to put your data archive onto the Web. These steps assume that you have written some text that carefully describes the format of the data, the meaning of the variables and the methods by which the experiment was performed. This set of steps has not been approved by the American Psychological Association, but is being offered as a suggestion to the APA Science Directorate as a method for allowing individual researchers to be able to make their data available and searchable over the Web until such time as APA guidelines for Web Data Archive publishing can be established.
If these suggestions are followed, secondary analysts will be able to search on the keyword ``PsychologyDataArchive'' and locate all of the participating data archives with psychological data without being distracted by all of the other millions of pages on the web. This one keyword is enough to bind all of the participating web sites into a single unit, a meta--archive that can be operated on by specialized Web Robots. As this meta--archive grows and Web Robots create self organized maps of the keywords on the participating archives, a searchable cross--referenced virtual archive will be the result; one in which each of the participating scientists will still be able to maintain control over their own data.
This report has presented an overview of techniques that can be used to create usable data archives and store them into a distributed meta--archive on the Web. Several specific recommendations were made concerning some of the issues faced by the data archivist hoping to use the Web.
The recommendations for making the data usable by the maximum number of secondary analysts are the following.
In addition, the recommendation was made that when a secondary analyst requests data to be joined on an individual record level with data from another archive, the two archivists may wish to use a password--encrypted individual ID code so that the data can be joined without revealing the identity of subjects to the secondary analyst.
The future of data archiving certainly will include the World Wide Web. The more rapidly the scientific community embraces this form of data sharing, the sooner the benefits of large scale heterogeneous database will be be enjoyed. Like the surprising effectiveness of GenBank, we expect the benefits of Web archiving of data to be greater than can be foreseen at this early stage in the the development of the distributed computing resource that the Internet is becoming.