Steven M. Boker
Department of Psychology
The University of Notre Dame
Notre Dame, Indiana 46556
J. J. McArdle
Department of Psychology
The University of Virginia
Charlottesville, Virginia 22903
August 19, 1997
The specific battery we created included a 38--item Power Letter Series task which has been previously used in studies of Fluid Intelligence ( see Cattell, 1971; Horn & Cattell, 1966; Horn & Cattell, 1982; Horn, 1988). The exact keystroke responses and the response time were stored in individual data files. Participation in this pilot experiment was requested by posting a news article on the Internet UseNet news. From approximately 40,000 readers of this newsgroup, we received ``requests for testing'' from N=170 individuals, and data from N=47 individual subjects were received electronically.
Statistical analyses of the resulting test data focused on the response times for the letter series items. Overall and individual letter series functions were also compared using a multi-level analysis. These analyses show consistency in individual response patterns and small but significant effects of age on the speed of response to complex reasoning problems. This experiment demonstrates the feasibility of using psychotelemetry to collect psychological test data.
The collection of individually administered mental ability tests is often a long and tedious task. Data are often collected on readily available volunteer samples of persons, such as college students, hospital patients, or elderly persons. It is extremely difficult to measure individuals who are middle aged, occupationally successful, or otherwise actively engaged in work or social events. It follows that subject selection biases are a persistent problem (e.g. Berk, 1983; Heckman, et al., 1986).
Specific kinds of measurement bias are created due to difficulties in testing. Some logistic problems can be managed with tests with multiple choice formats, or with time-limited tests, but such problems are exacerbated in difficult tasks of mental power. Difficult tasks are precisely the kind needed to measure individual differences in the cognitive abilities, such as the cognitive ability termed general fluid intelligence <; (after Horn & Cattell, 1982; Horn, 1988). By definition, requires ``the adduction of relationships which are not previously defined by the culture.'' [Horn, 1988, p. 660]. There is growing evidence that computer assisted testing procedures may be helpful in these kinds of psychometric measurements [Embretson, 1992].
Over the past two decades, the computer network now known as the Internet has exhibited remarkable growth [Krol, 1992]. The Internet connects computers used by researchers, government installations and technical businesses around the world. The experiment we describe here makes use of the Internet by using electronic mail ( email ) to facilitate electronic transfer of our testing program and the data generated by a subject's interaction with the program. This means that subject can participate in a psychometric experiment at a location convenient to the subject, without the need for a researcher to be physically present.
One premise of this experiment is that a large proportion of individuals using the Internet are the kinds of people which are not typically sampled in standard psychological research: successful, active and middle--aged. It is further supposed that individuals using the Internet would be attracted by the novelty of participating in a new type of experiment using the Internet technology. Some empirical data suggested that such an effect due to self--selection in computerized surveys might be observed [Walsh, Kiesler, Sproull & Hesse, 1992; Synodinos, Papacostas & Okimoto, 1994].
We propose the general term psychotelemetry to refer to the remote collection of psychometric data. This choice of terminology follows the accepted use of biotelemetry, which refers to the remote collection of biometric data. In this pilot experiment we examined the possibility of testing individuals remotely using a computer testing program called PsyLog [Boker & McArdle, 1982,Boker & McArdle, 1992]. The responses to individuals on a set of power letter series items (after Horn, 1988) were collected and analyzed using several forms of data analyses to examine the individual response patterns, and their relation to age.
Participation in this pilot experiment was requested by posting a news item on a NeXT computer user's newsgroup (comp.sys.next.misc) of the Usenet News. The Usenet News is an electronic information forum which is similar to a computer bulletin board and which is available to users of the Internet [Krol 1992]. To the estimated 40,000 readers of this newsgroup, we sent out a "request for testing". We received email replies from N=170 individuals. The testing program was mailed to all persons and data from N=47 individual NeXT users were received (a response rate of 27%). In all, a period of 10 days elapsed from the date of original call for volunteers to the date when the last set of data were received.
Subjects represented an unusually narrow set of selection criteria, since in order to participate they needed to a) use a NeXT computer which was directly connected to the Internet, b) be active readers of the Usenet news, c) be willing to volunteer for an experiment in measurement of abilities, and d) be persistent enough to finish the test in spite of minor technical difficulties with the program which presented the experiment.
A number of problems prevented willing volunteers from being able to participate. Some volunteers were not directly connected to the Internet. Roughly 30 volunteers (30/170) were prevented from participating by their employer who felt that their participation might occur during working hours. Anecdotally we also know that an unknown number of volunteers were simply unable to run the software which presented the experiment due to unforeseen and unknown circumstances. These problems are likely to be encountered in future psychotelemetric experiments.
We used a version of the PsyLog software [Boker & McArdle, 1992] to present the experiment and collect the data. PsyLog is software which was written specifically for the automatic presentation of psychometric instruments and the automatic collection and archiving of the resulting data. PsyLog reads a command file containing the psychometric instrument and then presents a series of computer screens one at a time to the subject. The subject is asked to respond to these screens and PsyLog stores the responses into a subject record within the software. This subject record is then either stored to disk if PsyLog is being used on a machine located at the experimenter's facility, or if PsyLog is being used at a remote location, the subject record is automatically returned electronically via email to the experimenter.
In this experiment, we distributed the PsyLog program to each volunteer subject electronically via email over the internet. Each subject then ran the program on their local computer. Once the instrument had been completed, the subject was informed that the experiment was over and asked to give consent for the data to be used. If the subject responded positively, the results were automatically emailed back to our laboratory for aggregation into a central database. In order to preserve the privacy of a subject, the PsyLog program encrypted the subject's record before emailing the result of the experiment.
One advantage of computer--presented testing is that response time data can be gathered for each item in the experiment [Rafaeli & Tractinsky, 1991]. PsyLog captures with millisecond accuracy the stimulus presentation time and response time, relative to the beginning of the experiment, of every event which occurs during the process of the experiment. This data presented us with the opportunity of an item--level analysis of the response time data.
The specific battery we presented here included a subset of items previously used in our studies of Fluid Intelligence ; (see Cattell, 1941; Horn & Cattell, 1967, 1984; Horn, 1988, McArdle, 1991). The Power Letter Series task we used was developed by John L. Horn for use in studies of aging research. In this task subjects were required to examine a series of five to ten letters and then enter the next best letter in the sequence. The subjects were told to take their time, answer each item in order, and use a ``no--answer'' option if they thought there was no best letter. The panel which presented the instructions to the subjects is shown in Figure .
Figure 1: The panel which presented the instructions to the subject.
In earlier experiments (see Horn, 1988) three practice items were self--administered and then 35 items were ordered into seven sets of five items. Each of the five items placed in theoretically increasing levels of difficulty, so a subject would get a few easy items, then some very hard items, and then go back to some easy items. This ordering was designed to have individuals continue to work on several items at their highest possible level without complete frustration. This electronic administration followed the exactly the same format. The overall battery presented by PsyLog included:
The exact keystroke responses and the time-of-response were stored in individual data files. No data was emailed unless the subject responded positively to the informed consent agreement.
In our initial planning we had hoped to analyze the pattern of correct and incorrect responses for the email subjects, relate these patterns to the age of the subjects, and compare these responses to the three other groups listed above. However, the great majority of the email subjects responded correctly to all the items so these comparisons will not be pursued further here. The response time Rt to each response (correct or incorrect) was available for all email subjects and it will be the main dependent variable used here.
In a first set of analyses we will describe the ``within--subjects'' variation in individual response times. There are a wide variety of ways to describe response time functions (see Cerrella, 1990 Luce, 1986 Link, 1992) but we will only use a simple polynomial approach here. In this case we write
where, for each individual n, is the response time (in seconds) to item i, is the pth polynomial regression coefficient, is the pth power of the item's W-scale score, and is a random error component. We fit these first--level models using standard least squares regression.
In a second set of analyses we examine the ``between--subject'' differences associated with age. Here we write another polynomial model as
where is the individual regression coefficient for individual n, is the qth power of the self reported age of the n subject, is the qth regression coefficient for the polynomial prediction of the individual regression parameter, and is the corresponding error component. We also fit these second--level models using least squares regression.
In a third set of analyses we will present a more formal simultaneous equations model of both analyses described above. We fit this model using a variation on the Multilevel model approach (as in Aitken & Longford, 1986 Goldstein & McDonald, 1988 Raudenbush, 1988 Bock, 1989 Bryk & Raudenbush, 1993). This kind of simultaneous equations model introduces latent variables at different levels we will fit these models using several new but available computer programs such as ML3 [Prosser, Rabash & Goldstein, 1991] and VARCLUS [Longford, 1987]. In these models we will assume that (1) the is normally distributed with mean zero and constant variance so that and (2) the is normally distributed with mean zero and constant variance so that (see Braun in Bock, 1989). Maximum--likelihood estimates and standard errors will be calculated for all model parameters, and goodness-of-fit can be assessed with a likelihood ratio test statistic.
In all statistical analyses to follow we evaluate significance at the standard test, and we also assume that a change in variance explained of over 5% is noteworthy.
The first column lists the demographic and psychometric characteristics of the current sample. We obtained N=46 valid responses from email subjects. These subjects reported a mean age of 29.6 (sd=6.8), and most were male (N=44 or 93.5%). The next three columns of Table  give a summary of three recent studies using this Power Letter Series task administered using the standard paper-and-pencil booklets. This table also includes data from several other groups: (1) UVa College Students, (2) UVa Aging Subjects, and (3) USC Aging Subjects.
Table 1: Some Demographic Characteristics of Several Studies on Power Letter Series
In Table  we also present a new scaling of the Power Letter Series task. All data in this table are presented in a special transformation of the Rasch ability scale termed the W-scale (see Woodcock, 1978 Woodcock, 1990). The W ability scale uses a log(9) transformation of a ``Rasch'' raw ability scale (including additive and multiplicative constants). This is theoretically an equal interval metric and is important for further change interpretation.
The specific W-values presented here were calculated by aggregating the information from the correct and incorrect response patterns across all items and groups listed above <using the BIGSTEPS program (by Wright & Linacre, 1992). This analysis suggested the 35 items had estimated W-scales that ranged from a low of 470 (item 3) to a high of 543 (item 8). Table  shows the first four moments of these W-scores for all four groups. Using this scaling we find that the email subjects are far superior and much narrower in correct/incorrect Letter Series performance. The UVa aging subjects have the lowest W-scores (W-mean = 504) with the largest range (W-sd = 15), and the email subjects have the highest W-scores (W-mean = 534) with the smallest range (W-sd = 7).
Figure [2a] is a plot of the data obtained from a randomly selected subject. These data points (circles) show the response time (in seconds) as a function of the theoretical difficulty level (in W-units) of the 38 individual items. The location of these data on the X-axis is fixed by our experimental design, but the Rt was limited only by the subjects willingness to continue to work towards an answer. For this specific subject, the average Rt recorded was just under 60 seconds, and the subject clearly responded quicker to the easier items (under 500). The solid line is the fitted line from a quadratic polynomial model (Equation  with P=2). This single curve was estimated with , the estimated Rt for an item with W=500, , the linear change in the Rt for a one unit change in item W, , the quadratic change in the Rt for a one-unit change in item W. This polynomial curve captures approximately % of the variance in the Rt scores for this individual.
Figure 2: (a) Response time as a quadratic function of item difficulty level for subject 1. (b) Response time quadratic functions for all subjects. (c) Response speed as a quadratic function of item difficulty level for subject 1. (d) Response speed quadratic functions for all subjects.
Polynomial regression models were fit to the Rt of each subject. In sequence, we fitted separate models to each subject:
Figure [2b] is a plot of the estimated quadratic curves for all N=47 subjects. The similarity of these curves is notable --- all subjects start out with very short Rt and then proceed to require longer time to complete the more difficult items (i.e., over W=500). Notice that the the items in these plots are ordered with respect to difficulty rather than in order of presentation. One subject seemed to be an outlier because this subject's mean Rt was greater than three minutes per item (upon debriefing, this subject volunteered that a household emergency had occurred in the middle of the test). The other N=46 subjects were used for all further analyses.
Figure [2c] is a plot of the same data for the first subject with the reciprocal of Rt as the Y-axis (see Tukey, 1977). This transformation of the Rt variable has the substantive interpretation of ``items per second,'' and it may have a more symmetric distribution. This figure shows decreasing scores as a function of item difficulty. Figure [2d] is a plot of the quadratic curves for all subjects using the this inverse Rt as the dependent variable. As it turns out, this inverse Rt has good behavior in all further regression analyses, and it could be used in place of the simple Rt fitted above.
In our second set of analyses we examined the systematic variation in the individual Rt parameters as a function of the subjects age.
Figure 3: (a) Response time intercepts () as a function of age. . (b) Response time slopes () as a function of age. . (c) Response time curvatures () as a function of age. . (d) Standard deviation of residuals as a function of age. .
Figure [3a] is a plot of the individual intercept parameters as a function of self--reported age. The average here is approximately 60 seconds. The quadratic age polynomial curve plotted in this figure accounts for about 21% of the total variance. These intercept terms show a systematic increase with increasing age, followed by a rapid decline for the two oldest subjects. Since this variable represents the speed of response, the average Rt to all items appears to be increasing with age. That is, except for the two oldest subjects, we see find a general slowing with increasing age (up to age 40).
Figure [3b] is a plot of the individual slope parameters as a function of self--reported age. The average here is approximately 3 seconds per W-unit change. The quadratic age polynomial curve plotted in this figure also accounts for about 21% of the total variance. Again, these slopes show a systematic increase with increasing age, followed by a rapid decline for the two oldest subjects. Since this variable represents the changing speed of response to more difficult items, we see progressive slowing of response to more difficult items. That is, except for the two oldest subjects, we see find another kind of general slowing with increasing age (up to age 40).
Figure [3c] is a plot of the individual curvature parameters as a function of self--reported age. The average here is approximately .1 seconds per W-unit change. The quadratic age polynomial curve plotted in this figure also accounts for only 5% of the total variance. Here we find great variation in some of these younger subjects, and the overall age pattern is not very clear.
Figure [3d] is a plot of the individual error parameters as a function of self--reported age. The average here is approximately 60 seconds per W-unit change. The quadratic age polynomial curve plotted in this figure accounts for about 18% of the total variance. Again, these error terms show some increase with increasing age, but there is little evidence for a systematic effect.
In order to examine the statistical characteristics of this model we also calculated several simultaneous equation versions of the previous analyses fitted by the ML3 computer program [Prosser, Rabash & Goldstein, 1991]. Table  presents results from four "multilevel models" applied to the Rt from these psychotelemetry data.
Table 2: Multilevel Modeling Results from the Psychotelemetry Experiment
The first column (model 0) gives results for the fit of a standard two-level (i.e., 1 within and 1 between) variance components model of the raw scores on Rt. This model includes significant components for the intercept or grand mean (60.8), the within persons or level--1 variance (8963), and the between person or level--2 variance (535). The overall fit is given by the log likelihood (of -2LL=20,417), and this will be used as a standard of comparison for the other three models.
The second column (model 1) lists parameters for a multilevel model where we have added a quadratic model to the within person variance terms. All three fixed parameters are significant (, , ) and the random (within person) error variance is reduced by about 50% (to 4612). Each of these three parameters has significant between persons (level 2) variance also (, , ). This model changes the overall likelihood (to ) yielding a significant improvement in fit of dLRT=1100 for 4 extra parameters.
The third column (model 2) lists results where we added 3 terms to allow for the correlation of the three first order terms (, , etc), and three parameters where we estimated the linear effects of age (between persons) on the three quadratic model parameters (within persons). These results show significant effects for all parameters except the linear effect of age on the curvature, but the net improvement in fit (dLRT=6 on dDF=3) is small but significant.
The fourth column (Model 3) lists results where a quadratic model for between group ages was fitted. Here we find the addition of a significant quadratic age effect for and , with an important reduction in second level error variance. In contrast to model 2, this model shows a nonsignificant improvement in fit (dLRT=3 on dDF=3). Thus, model 2 is chosen here as the best model.
In summary, these multilevel models suggest that (1) about 47% of the variance at the first level can be explained by the quadratic model, and (2) that a much smaller amount (1% to 30%) of the variance between people in these correlated coefficients can be explained by quadratic age effects. However, the changes in overall fit are not large and we expect other between persons effects to be likely. The simultaneous multilevel models are consistent with the separate analyses of Figures  and 
This research initially demonstrates that psychotelemetry, the remote collection of psychometric information is both possible and practical. The potential benefits of psychotelemetric measurement include rapid automated gathering of large samples, measurement of populations which have been traditionally difficult to sample, and the inclusion of response time data along with the item answers. Another benefit of psychotelemetric measurement is that instruments can be administered extremely quickly over a wide geographic area. This experiment gathered data from individuals in widely scattered locations around the United States and the time from the call for participation until the beginning of analysis was relatively short (ten days).
The substantive results of this experiment are informative as well. Initially, we found that subjects who are willing to participate in this kind of an experiment were high scorers but varied in a great deal in age (and presumably other demographic characteristics). This meant we could not obtain useful data on correct/incorrect response patterns as we had hoped; but, thanks to our data collection recordings, we could obtain accurate information on response times (Rt) for correct answers. These Rt data proved to be highly related to the pattern of difficulty levels of the items for most all subjects. Using a quadratic model we were able to reliably recast the repeated scores for each individual Rt into an individual response function, and this function showed differences in the way different persons slow down with difficult items. Furthermore, the demographic age variable was linearly related to the intercept and slope of this function; i.e., The older the age, the longer the Rt and the more peaked the slope. Taken together, these results both between and within individuals are consistent with a great deal of previous research in cognitive aging [Horn, 1988; Salthouse, 1991b; Salthouse, 1991a; Salthouse, 1988; Hoyer & Rybash, 1994].
Psychotelemetry is feasible and has benefits not afforded by traditional methods and other researchers are likely to attempt psychotelemetric experiments. It is therefore particularly important that the unanticipated problems which were encountered with this experiment be clearly stated so that they can be successfully avoided by others.
Since the subject may be using an arbitrary computer hardware and software environment, the psychotelemetry software must be thoroughly tested for compatibility with the widest available range of hardware before it is used in the field. Problems with the first version of our software made it impossible for some people to finish the test. We cannot over stress the importance of this point; if the successful completion of the experiment covaries with the psychometric variables in question, a potentially serious selection bias will be introduced into the data.
Our debriefings allowed subjects to enter comments and these showed that a few people attempted to use the program in unanticipated ways. The instructions must be made exceptionally precise and clear to the widest possible range of subjects. Since a psychotelemetric instrument by definition must be self--administered by the subject, there must be tight software controls restricting the subject's behavior to the range of experimental interest. This problem is difficult, but is not insoluble. Careful instrument and software design can overcome the problem of self--administration.
The physical environment of the subject is not as well controlled during psychotelemetry as it would be in a laboratory experiment. Arbitrary distractors may occur during the presentation of the instrument. A concerted effort to control or at least measure these distractors should be made, and debriefing should include an opportunity for the subject to relate unusual circumstances during testing. For example, the debriefing data for this experiment revealed that the one outlier was caused by an unavoidable household emergency which occurred during the administration of the test. These kinds of problems are likely to occur in any experiment, but they may be special threats to the validity and reliability of psychotelemetric data.
Overall, this experiment effectively demonstrated the feasibility of of psychotelemetric measurement. The future of this form of data collection seems to be extremely promising. As the Internet grows and more individuals have access to its services, the potential subject pool will grow proportionately. The recent development of the World Wide Web also has potential for gathering psychotelemetric data quickly and efficiently. Further experiments along these lines are being planned.
The PsyLog program has undergone considerable change since its original form which was coded in 1982. A number of design criteria have become apparent as we have applied this software technology to several psychometric and psychophysical experiments.
A PsyLog experiment is composed of some or all of the following components. These components are specified and defined by their inclusion in a PsyLog command file.
A consent panel is invoked by the following line of code in a PsyLog command file.
This command line refers to a textfile named ``priorConsent.txt'' which is stored on the same directory as the command file. The result of this command is the following panel:
A demographics panel is invoked by the following line of code in a PsyLog command file.
item demoItem 3
stimulusText " We would like to start by asking a little about yourself.
Please fill in all of the blanks, but please feel free to skip any question
which you don't want to answer."
responseText "Street Address"
responseText "Years of Formal Education"
responseText "How many years have you used a NeXT?"
commentText "Do you have any comments?"
This section of PsyLog commands builds the following panel which is presented to the user. The subject can fill in the blanks in this panel and uses a mouse to click on the button labeled ``OK'' when the subject is finished answering the questions.
An item panel is invoked by the following line of code in a PsyLog command file.
item LI01 1
stimulusText " Z Y X W V U T "
responseText "What is the next letter in this series? "
The subject is presented with the following panel. The subject has an opportunity to answer the questions and then uses a mouse to click on the button labeled ``OK'' when the subject is finished with the item.