Developing a Biodata Measure and Situational Judgment Inventory


Download Developing a Biodata Measure and Situational Judgment Inventory


Preview text

Journal of Applied Psychology 2004, Vol. 89, No. 2, 187–207

Copyright 2004 by the American Psychological Association 0021-9010/04/$12.00 DOI: 10.1037/0021-9010.89.2.187

Developing a Biodata Measure and Situational Judgment Inventory as Predictors of College Student Performance
Frederick L. Oswald, Neal Schmitt, Brian H. Kim, Lauren J. Ramsay, and Michael A. Gillespie
Michigan State University
This article describes the development and validation of a biographical data (biodata) measure and situational judgment inventory (SJI) as useful predictors of broadly defined college student performance outcomes. These measures provided incremental validity when considered in combination with standardized college-entrance tests (i.e., SAT/ACT) and a measure of Big Five personality constructs. Racial subgroup mean differences were much smaller on the biodata and SJI measures than on the standardized tests and college grade point average. Female students tended to outperform male students on most predictors and outcomes with the exception of the SAT/ACT. The biodata and SJI measures show promise for student development contexts and for selecting students on a wide range of outcomes with reduced adverse impact.

Within the complex and competitive admissions process, colleges and universities seek out the best students possible for their institutions, in which “best” can be defined in many ways. Traditionally, selection systems for college admissions typically use standardized tests of verbal and mathematical skills, and possibly records of achievement in specific subject matter areas. Such systems have worked well for decades, especially in comparison with alternatives that have been used or considered. Generally speaking, standardized cognitive ability tests are efficient for mass distribution and application, provide a standard of comparison across differing educational backgrounds, and demonstrate largely unparalleled criterion-related validities of approximately r ϭ .45 with cumulative college grade point average (GPA), in addition to smaller but practically significant relationships with study habits, persistence, and degree attainment (Hezlett et al., 2001). However, critics argue there is substantial room for improvement with respect to the validity and practical utility of current selection tools (Breland, 1998; Payne, Rapley, & Wells, 1973).
In fact, some individuals such as Atkinson (2001), head of the University of California system, have called for abandoning the SAT-I (test of general verbal reasoning, reading, and math problem-solving skills) and replacing it with a test or tests more closely related to school curricula, like the SAT-II (measures of English, history, science, social studies, math, and languages). Various stakeholders in admissions testing have been more strident in demanding new selection tools with adequate criterion-related validity, less adverse impact, and greater relevance to a broader
Frederick L. Oswald, Neal Schmitt, Brian H. Kim, Lauren J. Ramsay, and Michael A. Gillespie, Department of Psychology, Michigan State University.
We thank Michele Boogaart, Michael Jenneman, Elizabeth Livorine, and Justin Walker for their data collection efforts, as well as the College Board for their support of this project.
Correspondence concerning this article should be addressed to Frederick L. Oswald, Michigan State University, Department of Psychology, 129 Psychology Research Building, East Lansing, MI 48824-1117. E-mail: [email protected]

conceptualization of performance in college. If new selection tools are to improve on the demonstrated criterion-related validity of the current knowledge and cognitively based predictors, it is likely that these tools will need to be based on a clear identification, definition, and measurement of a broader set of performance outcomes (Sackett, Schmitt, Ellingson, & Kabin, 2001) than the GPA and graduation outcomes usually used to evaluate instruments such as the SAT and the American College Testing (ACT) battery and similar instruments. As evidenced in mission statements and other promotional materials in print and on the Internet, colleges clearly want students who will succeed in the college environment, whether that means succeeding academically, interpersonally, psychologically, or otherwise. If one takes seriously what colleges claim to want in their students, then we argue that it is appropriate to reconsider traditional GPA and graduation criteria on several accounts. First, traditional academic outcomes are useful for what they are intended to measure but insufficient when one considers the entire experience contributing to students’ performance and success in college. Furthermore, GPA as a composite measure is not standardized and may represent the outcome of some very different student behaviors, as reflected in different types of courses taught by different instructors. For instance, it is likely that students self-select into classes of different difficulty level or content domains on the basis of their ability and interest (Goldman & Slaughter, 1976), and thus student GPAs are not directly comparable.
Borrowing from current theories of job performance (Campbell, Gasser, & Oswald, 1996; Campbell, McCloy, Oppler, & Sager, 1993; Rotundo & Sackett, 2002), we reexamined the domain of college performance. Although traditional criteria for college student performance have tended to fall under the broad category of task performance (e.g., grades on specific assignments, measures of technical knowledge, GPA), the converging themes in college mission statements and other information about higher education encouraged us to expand into a criterion space that captures alternative dimensions such as social responsibility (Borman & Motowidlo, 1993) and adaptability and life skills (Pulakos, Arad, Donovan, & Plamondon, 2000). If the criterion domain of college

187

188

OSWALD, SCHMITT, KIM, RAMSAY, AND GILLESPIE

performance is in fact broader and more complex than traditionally conceived and measured, then this in turn implicates broader and more complex combinations of individual abilities and characteristics that predict performance. Specifically, with a broader college performance domain against which admissions decisions are validated, we should find that measures of noncognitive constructs, such as social skills, interests, and personality, are also valid predictors of performance in college. Using a broad set of predictors that capture noncognitive as well as cognitive individual characteristics may reduce the level of adverse impact that typically results from the large subgroup mean differences on cognitively based tests (Hough, Oswald, & Ployhart, 2001; Sackett et al., 2001). Combining cognitive predictors with less cognitive alternative predictors in a compensatory model (such as a linear regression model) should then allow for selecting individuals with somewhat lower levels of some cognitive abilities yet overall are still desirable college applicants.
Most of today’s colleges typically base applicant selection decisions on some combination of academic records, high school GPA, class rank, and SAT or ACT score (Breland, 1998; McGinty, 1997). Like standardized tests, high school GPA and rank appear to have relatively high criterion-related validities with college GPA, with correlations between .44 and .62 once corrections for measurement unreliability and range restriction are made (Hezlett et al., 2001). The use of other selection tools varies greatly because colleges can choose the “educational criteria, including racial diversity, that they wish to consider in admissions, so long as they do not apply different standards to different groups” (Regents of the University of California v. Bakke, 1978). Depending on their selectivity and demographic characteristics (e.g., public or private), colleges often request additional information about an applicants’ prior achievements, background experiences, nonacademic talents, and interpersonal skills, all of which are intended to provide a holistic view of applicants and indicate the likelihood of their success in or contribution to a college. Popular methods of obtaining such information include achievement test scores, letters of recommendation, personal statements, lists of extracurricular activities, interviews, and peer references. There exists some support for the incremental validity and practical usefulness of such measures over the more common predictors mentioned above (Cress, Astin, Zimmer-Oster, & Burkhardt, 2001; Ra, 1989; Willingham, 1985). However, these supplementary measures are problematic to the extent that (a) admissions personnel pay attention to, interpret, and weight this information in different ways; (b) admissions personnel rely on information about students’ past experiences that is to some extent idiosyncratic and not in a standardized format; (c) collecting and evaluating this information requires extra cost in time and resources; and (d) information is selfreported and may be difficult to verify (Willingham, 1998). Not implementing, scoring, or weighting such measures in a systematic manner across colleges, and not tying these measures to a relatively broad domain of college performance where supplementary measures may be more useful, preclude a solid conceptual understanding and a consistent and practical level of incremental validity above standardized test scores and high school GPA.
This article describes the development and validation of a situational judgment inventory (SJI) and biographical data (biodata) measure intended to evaluate students’ noncognitive attributes and to predict multiple dimensions of college student performance. We

also determine the incremental validity of these measures above the validity of the SAT/ACT and existing Big Five personality measures (Digman, 1990). Incremental validity above personality measures is important because personality is a major component of the noncognitive domain, and because there is some evidence that biodata and SJI measures correlate with personality measures (e.g., Clevenger, Pereira, Wiechmann, Schmitt, & Harvey-Schmidt, 2001; Stokes & Cooper, 2001). If existing measures of general personality constructs account for the same variance as newly constructed biodata and SJI measures, then it would be much more economical and theoretically parsimonious to use only personality measures. Additionally, we extend the usual validation study conducted in academic situations by considering not only college GPA but also class attendance and peer and self-ratings across a broad set of performance dimensions reflected in the goal or mission statements of a representative cross-section of American universities. Because these latter outcomes are likely to be more highly related to noncognitive determinants than is GPA, consideration of a broad array of less cognitively loaded predictors should be informative.
Biodata measures provide a structured and systematic method for collecting and scoring information on an individual’s background and experience (Mael, 1991; Nickels, 1994). SJIs are multiple-choice tests intended to appraise how an applicant (for a job, or in this case for college) might react in different relevant contexts (Motowidlo & Tippins, 1993). In the college context, both measures have the potential for increased criterion-related validity over traditional measures, because item content can be tailored to specific dimensions of student performance in college and to the goals of a particular college or college-admissions process. Biodata and SJIs may also have greater practical utility over alternative subjective evaluations, such as essays or reference letters that are commonly used in college admissions, because biodata and SJIs provide a fair and standardized method to obtain and score information about the broad range of prior educational and social experiences that applicants may have had. In developng any measure, however, the cost and time liabilities (e.g., developing substantively and psychometrically appropriate items, carefully developing empirical scoring keys) must be weighed against the potential advantages. In this context, however, it is likely the case that objectively scored biodata and SJI instruments will collect such information more efficiently than reading and scoring essays and application blanks.
Expanding the Criterion Space of College Student Performance
Conceptualizing and evaluating the successful development of college students should reflectsome function of the multiple goals and outcomes desired by students, the school administration, legislators, and others with a vested interest (Willingham, 1985). Theoretically, the concern in the educational literature for multiple dimensions of college performance parallels the development of multidimensional models of job performance in the industrial/ organizational (I/O) psychology literature (e.g., Borman & Motowidlo, 1997; Campbell et al., 1993). In an early attempt to understand multiple dimensions of college performance systematically, Taber and Hackman (1976) identified 17 academic and nonacademic dimensions to be important in classifying successful and

PREDICTING COLLEGE STUDENT PERFORMANCE

189

unsuccessful college students. Examples of these dimensions are intellectual perspective and curiosity, communication proficiency, and ethical behavior. Furthermore, college students actively engaged across numerous domains have tended to achieve greater success in their overall college experience as reflected in their scholastic involvement, accumulated achievement record, and their graduation (Astin, 1984; Willingham, 1985).
Our own effort to identify the number and nature of dimensions of college student performance was an exploratory informationgathering process that followed two primary guidelines. First, the number of dimensions should not be so many that the information is unwieldy, yet not so few that the domain of college performance is oversimplified and not appropriately represented. Second, we wanted to understand how a variety of stakeholders in the process and outcomes of college education define student success in college, because relying on one source alone could lead to biased or deficient definitions and representations of the college perfor-

mance domain. The 12 dimensions that resulted stem from themes contained within the mission statements and stated educational objectives we sampled across a range of colleges and universities (see the Method section for procedural details). These dimensions are defined in Table 1, and they are referred to in the text in abbreviated form. They deal with intellectual behaviors (Knowledge, Learning, and Artistic), interpersonal behaviors (Multicultural, Leadership, Interpersonal, and Citizenship), and intrapersonal behaviors (Health, Career, Adaptability, Perseverance, and Ethics).
Biodata
Biographical data, or biodata, contain information about one’s background and life history (Clifton, Mumford, & Baughman, 1999; Mael, 1991; Nickels, 1994). Despite the informal use of similar information in college applications (e.g., extracurricular

Table 1 Twelve Dimensions of College Performance

Dimension

Definition

Knowledge, learning, and mastery of general principles (Knowledge)
Continuous learning, and intellectual interest and curiosity (Learning)
Artistic cultural appreciation and curiosity (Artistic)

Intellectual behaviors
Gaining knowledge and mastering facts, ideas, and theories and how they interrelate, and understanding the relevant contexts in which knowledge is developed and applied. Grades or grade point average can indicate, but not guarantee, success on this dimension.
Being intellectually curious and interested in continuous learning. Actively seeking new ideas and new skills, both in core areas of study and in peripheral or novel areas.
Appreciating art and culture, either at an expert level or simply at the level of one who is interested.

Multicultural tolerance and appreciation (Multicultural)
Leadership (Leadership)
Interpersonal skills (Interpersonal)
Social responsibility, citizenship, and involvement (Citizenship)

Interpersonal behaviors
Showing openness, tolerance, and interest in a diversity of individuals (e.g., by culture, ethnicity, or gender). Actively participating in, contributing to, and influencing a multicultural environment.
Demonstrating skills in a group, such as motivating others, coordinating groups and tasks, serving as a representative for the group, or otherwise performing a managing role in a group.
Communicating and dealing well with others, whether in informal social situations or more formal school-related situations. Being aware of the social dynamics of a situation and responding appropriately.
Being responsible to society and the community and demonstrating good citizenship. Being actively involved in the events in one’s surrounding community, which can be at the neighborhood, town/ city, state, national, or college/university level. Activities may include volunteer work for the community, attending city council meetings, and voting.

Physical and psychological health (Health)
Career orientation (Career) Adaptability and life skills (Adaptability) Perseverance (Perseverance) Ethics and integrity (Ethics)

Intrapersonal behaviors
Possessing the physical and psychological health required to engage actively in a scholastic environment. This would include participating in healthy behaviors, such as eating properly, exercising regularly, and maintaining healthy personal and academic relations with others, as well as avoiding unhealthy behaviors, such as alcohol/drug abuse, unprotected sex, and ineffective or counterproductive coping behaviors.
Having a clear sense of career one aspires to enter into, which may happen before entry into college or at any time while in college. Establishing, prioritizing, and following a set of general and specific career-related goals.
Adapting to a changing environment (at school or home), dealing well with gradual or sudden and expected or unexpected changes. Being effective in planning one’s everyday activities and dealing with novel problems and challenges in life.
Committing oneself to goals and priorities set, regardless of the difficulties that stand in the way. Goals range from long-term goals (e.g., graduating from college) to short-term goals (e.g., showing up for class every day even when the class is not interesting).
Having a well-developed set of values, and behaving in ways consistent with those values. In everyday life, this probably means being honest, not cheating (on exams or in committed relationships), and having respect for others.

Note. Summary label for each dimension is in parentheses. These labels are used in subsequent tables.

190

OSWALD, SCHMITT, KIM, RAMSAY, AND GILLESPIE

activity lists and resumes), we undertook the development of a biodata inventory with standard multiple-choice responses to questions about one’s previous experiences, in a manner similar to that of biodata tests used in employee selection.
Several decades of research in the employment arena have indicated that biodata instruments are usefully related to job performance measures (see Hunter & Hunter, 1984; Mumford & Stokes, 1992; Schmidt & Hunter, 1998; Schmitt, Gooding, Noe, & Kirsch, 1984). Further studies (Brown, 1981; Rothstein, Schmidt, Erwin, Owens, & Sparks, 1990) have explored the degree to which such instruments generalize across companies and industries, and Stokes and Cooper (2001) have also demonstrated that biodata items can be written to reflect meaningful psychological constructs. Perhaps most relevant to the research reported in this article is work reported by Owens and his colleagues (Mumford, Stokes, & Owens, 1990; Owens & Schoenfeldt, 1979; Stokes, Mumford, & Owens, 1989) in which they developed measures of developmental patterns of life history experiences, collected data from large groups of college students, and reported meaningful relationships with a variety of subsequent academic and life outcomes. They also found evidence of considerable stability in these life history patterns over time.
The present study was undertaken for several reasons. First, we conceptualized and modeled the college student performance domain broadly, consistent with the stated objectives of a broad cross-section of U.S. universities. This led to the development of outcome measures corresponding to these performance dimensions, while still including GPA as a traditional measure of student performance. We also used this performance domain as the blueprint by which we developed biodata and situational judgment measures as predictors. Second, we evaluated the psychometric adequacy of these measures. Third, we assessed the relationship between the biodata and SJI measures and college performance outcomes as evidence of their validity. Fourth, in multivariate analyses, we assessed the degree to which the biodata and SJI provided incremental validity over the SAT or ACT and a structured and widely available Big Five personality measure. The latter was important because several of the dimensions measured with the biodata and SJI were similar to personality constructs. Fifth, we used item-level empirical keying methods to develop a subset of biodata and SJI items with high criterion-related validity across samples. Finally, because of the concern with the adverse impact resulting from standardized cognitive ability and achievement tests, we examined mean differences in our instruments across racial and gender subgroups.
Situational Judgment Inventories
Situational judgment inventories (SJIs) are measures in which respondents choose or rate possible actions in response to hypothetical situations or problems. SJIs tend to be less costly to construct and administer than more complex simulations like work samples and assessment centers (Motowidlo, Dunnette, & Carter, 1990). Although SJIs have been in and out of favor in employment contexts for more than 80 years, there has been a renewed interest because of their validity as employment tests designed to predict job performance. A meta-analysis by McDaniel, Bruhn-Finnegan, Morgeson, Campion, and Braverman (2001) estimated that SJIs have an overall criterion-related validity of ␳ ϭ .34, though there

appears to be substantial variability associated with that value (␴␳ ϭ .14, with a 90% credibility interval of .09 to .69), with job complexity as a potential moderator (Huffcutt, Weekley, Wiesner, DeGroot, & Jones, 2001). In the employment context, the use of SJIs usually reduces adverse impact for minorities compared with that of cognitive tests (Pulakos & Schmitt, 1996; Sackett et al., 2001), and the SJI produces favorable test-taker reactions (Hedlund et al., 2001) as well as high perceptions of face validity (Clevenger et al., 2001; McDaniel et al., 2001). Such support for SJIs in employment settings suggests that they may be a viable supplement or alternative to traditional cognitive ability testing in college admissions as well, although we are aware of only one previous application in academic prediction. Hedlund et al. (2001) found an SJI to have rather small incremental validity above GMAT scores for MBA students (⌬R2 ϭ .03). Our SJI development is based on a different set of considerations and methods than was the Hedlund et al. effort, including a broader definition of student performance.
Although the research on SJIs so far indicates that they hold promise as valid predictors of job performance, the construct validity of SJIs remains elusive (Clevenger et al., 2001). Unlike “purer” measures of ability or personality, SJIs reflect complex, contextualized situations and events. It is therefore reasonable to think that constructs measured by SJIs are related yet somewhat different from cognitive ability (Sternberg et al., 2000), having much in common with personality constructs as well, because SJIs rely on individuals’ subjective judgments of response-option appropriateness. SJIs are merely measurement methods with content tailored to a particular context, though, and therefore correlations with personality and cognitive ability may vary widely across situations in which SJIs are developed.
Method
Sample
Six hundred fifty-four first-year undergraduate freshman students at a large midwestern university volunteered for this study and received $40 for their participation. Of these, 644 provided usable data after screening for careless responses. Students were recruited through their classes, housing units, and through the student newspaper. Mean age was 18.5 years (SD ϭ 0.69). Seventy-two percent were female. Seventy-eight percent were White, 9.5% were African American, 1.9% were Hispanic American, 5.3% were Asian American, and 4.5% were from other racial/ethnic groups. This sample was very nearly identical to the university in terms of racial/ethnic identity: 77.3% were White, 9.8% were African American, 1.9% were Hispanic American, 5.4% were Asian American, and 5.6% other. Our sample overrepresented female students, as 55% of the university’s freshman were women. In the admissions process at this university, students completed an application that required the usual admissions materials including the ACT, high school transcript, basic demographic data, previous schools attended, extracurricular activities, and high school honors and activities. Typically neither the educational literature nor colleges themselves indicate clearly how and to what extent information such as activities, awards, and past experiences are used in making actual admissions decisions.
Measures
Dimensions of College Performance
Several of our measures (biodata, SJI, self-rated and peer-rated behaviorally anchored rating scales) were developed on the basis of 12 dimen-

PREDICTING COLLEGE STUDENT PERFORMANCE

191

sions of college performance. The process of establishing these dimensions first involved examining the Web pages of colleges and universities, selecting institutions of differing levels of prestige as indicated by U.S. News and World Report rankings. Specifically, we read the content of the home pages for 35 colleges and universities, searching for explicitly stated educational objectives or mission statements; if the Web page had a search engine, we also entered relevant search terms. These colleges and universities varied on characteristics such as public/private and large/small enrollment, and 23 institutions provided usable information. Institutions not providing usable information did not explicitly state their educational objectives or provide a university mission statement. There were no apparent systematic differences between those institutions providing usable information and those that did not. The information gathered off the Web pages were parsed into smaller discrete sentence fragments, retaining the original wording as much as possible. For example, the sentence fragment “promote a commitment to learning, freedom, and truth” was decomposed into “promote a commitment to learning,” “promote a commitment to freedom,” and “promote a commitment to truth.” Decomposing these fragments resulted in 174 separate goal statements (including content overlap across institutions). Independently, three of the present authors rationally sorted the statements into as many or as few clusters as they liked; then in a subsequent group meeting, they agreed on 12 dimensions through joint discussion of their independent sorts. It is clear that our sampling from colleges and universities was far from exhaustive; however, it was representative enough so that the college performance domain was truly multidimensional, representing a wide domain of the college experience. It would be difficult to imagine adding many more dimensions to the framework while remaining at this level of construct generality. However, to be sure, we concurrently interviewed a lead administrator at the Michigan State University Department of Residence Life, who provided us with University Residence Life materials that we content analyzed. Finally, criteria identified through our Web search and from university resources were compared against college performance criteria identified in other related educational research (Beatty, Greenwood, & Linn, 1999; Patelis & Camara, 1999; Sackett et al., 2001; Taber & Hackman, 1976; University of Pennsylvania, 2000; Wightman & Jaeger, 1998).
At this point we proceeded with the 12 performance dimensions, and the same three raters then independently re-sorted the goal statements, now reduced to 134 statements because of content redundancies. Of those statements, 85 (62%) were agreed upon by all three raters, and 129 statements (96%) were agreed upon by at least two out of the three raters. After this re-sorting task, each of the identified dimensions was compared with similar dimensions in the industrial and organizational, educational, and vocational psychology literature involving a college population. In some instances, the dimension labels and definitions were modified to be more consistent with the language of the current literature research. The end result of rationally and systematically combining information from these diverse sources resulted in the 12 dimensions of Table 1.
Predictor Measures
Big Five. The Big Five personality traits, also known as the FiveFactor Model (FFM), represent the most commonly, although not universally, accepted personality framework in the current psychological literature (Goldberg, 1993; McCrae & Costa, 1999; Wiggins & Trapnell, 1997). FFM personality traits were measured using a 50-item personality measure from the International Personality Item Pool (Goldberg, 1999). This measure is psychometrically comparable with other commonly used measures of the FFM of personality, such as the NEO–Personality Inventory (Costa & McCrae, 1992). Goldberg (1999) reported the mean coefficient alpha for each of the five scales (10-items each) to be .84, indicating an acceptable degree of internal consistency. Our data were consistent with this, with alphas of .88, .81, .83, .84, and .76, respectively, for the scales of Extraversion (E), Agreeableness (A), Conscientiousness (C), Emotional Stability (ES, essentially the opposite of Neuroticism), and Openness (O).

Social desirability. The tendency for respondents to give socially desirable responses on noncognitive measures such as the biodata and personality measures is well documented in the social and personality psychology literature (Paulhus, 1988). To assess the degree to which our measures might be susceptible to social desirability, we administered the Paulhus measures of self-deception and impression management. Each measure contained 19 items as we removed 2 items that seemed too intrusive to use in the present context (“I never read sexy books or magazines” and “I have sometimes doubted my ability as a lover”). Paulhus (1991) presented evidence that these measures possess adequate reliability; in our study coefficient alphas were .62 for self-deception and .80 for impression management, indicating marginal and acceptable levels of internal-consistency reliability, respectively.
SAT/ACT. Authorization to obtain SAT or ACT scores was obtained as part of the informed-consent procedures used in the data collection. Out of 644 participants, we obtained 151 SAT scores and 610 ACT scores. All participants had taken one of these tests, and many had taken both as part of their application to different universities. SAT and ACT composite scores were correlated .85; thus, these variables were standardized on national norms within each test and if necessary combined, resulting in a single index of the participants’ ability or preparation to do college work.
Biodata. Multiple sources were searched for preexisting biodata items that would relate to the aforementioned 12 performance dimensions (see Table 1). This search identified 197 items whose content was judged to be relevant to one of our dimensions and to the college context. Most of our biodata items were adapted from Pulakos, Schmitt, and Keenan (1994) and Mumford (2001). However, we also reviewed the content of the University of Georgia Biographical Questionnaire (Owens, Albright, & Glennon, 1966), the Assessment of Background and Life Experiences (Hough, Eaton, Dunnette, Kamp, & McCloy, 1990), the Personnel Reaction Blank (Gough & Arvey, 1998), a biographical questionnaire by Russell, Green, and Griggs (n.d.), and a biodata measure developed by Schmitt and Kunce (2002). Items varied in the type of response scale (frequency of behavior, Likert scale) and also in the nature of the constructs measured (past beliefs and attitudes, behaviorally based experiences). All item stems were modified to be appropriate for the college context. After this process, several of our 12 dimensions still lacked a sufficient number of items, so we rationally generated additional items for those dimensions.
Many if not most of the preexisting items had response options that did not apply to the college student population or included inappropriate response options, so item content and response options were rewritten accordingly. Revised items were pilot tested on six paid college students who supplied open-ended responses that were subsequently modified to reflect a reasonable range of response options, dropping items showing little variance or content redundancy with other items.
The stability of the structure of the rational or theoretical inventory was established by assessing interrater agreement on a rational sort of the items. Specifically, six researchers resorted all items back into the 12 dimensions. Items on which five of six raters agreed with the originally assigned dimension were retained; those on which four of six agreed were discussed and rewritten or dropped, and those with less agreement were discarded. When all six raters assigned an item to one dimension other than the one to which it was originally assigned, it was reassigned to that new dimension. Using these criteria, we discarded 5 items and reassigned several to a new dimension. The final biodata inventory then consisted of 115 items representing our 12 dimensions, each scored on a 4- or 5-point scale. Sample biographical data items can be found in Appendix A.
SJI. A search of existing SJI measures led to creating item stems that were adapted to our 12 dimensions of college student performance (see Table 1). We recruited and paid undergraduate students at a large midwestern university to participate in developing our SJI further. First, students generated critical incidents for use as additional item stems for dimensions underrepresented by existing SJI items. Next, an independent set of students created multiple response options for each item stem.

192

OSWALD, SCHMITT, KIM, RAMSAY, AND GILLESPIE

Finally, we developed a scoring key based on responses of advanced (junior and senior) college students in a undergraduate course in psychological measurement, who responded to SJI items as part of a course project (N ϭ 42). Each item presents a situation about which students made two judgments indicating which responses would be the “best” and “worst.” The scoring key was then developed from these responses. Item response options were part of the scoring key if their means showed statistically significant differences between each other in the frequency endorsed as “best” or “worst” (details of the empirical scoring procedure are in Motowidlo et al., 1990, and Motowidlo, Russell, Carter, & Dunnette, 1988). We then sorted all items for which the scoring key was developed back into our 12 performance dimensions. Items with less than 75% agreement were discarded from the inventory; items were discussed if they had greater than 75% agreement yet were sorted back into a different category. This resulted in a total of 57 items for the final SJI instrument, in which each scale consisted of 3 to 6 items. Individuals could receive a score on each item ranging from ϩ2 (if they agreed with both the “best” and “worst” response keys) to –2 (if they indicated that the “best” item is the worst and the “worst” item is the best). Refer to Appendix B for sample SJI items.
Outcome Measures
GPA. With university authorization, we obtained study participants’ GPA from the registrar’s office as part of the informed-consent process. First- and second-semester GPAs were obtained for 621 of the respondents; these two values were averaged to yield first-year college GPA measure. First-year GPA was judged to be a useful outcome as part of the domain of college performance, because although it may be less related to long-term outcomes than other criteria, it is a critical criterion for staying in college during the early years (vs. being put on probation or being expelled), though having college-grade data longitudinally would also be of interest in future research.
Absenteeism. Absenteeism was assessed with a single self-report measure whereby participants responded by selecting the approximate number of classes they had missed in the past 6 months on a 5-point scale ranging from less than 5 to more than 30.
Behaviorally anchored rating scale for multiple dimensions of college performance. The 12 dimensions served as a guide in developing a behaviorally anchored rating scale (BARS). For each of the 12 BARS items, a dimension name and its definition were presented along with two examples of college-related critical incidents and various behavioral anchors that reflected three levels of performance on a 7-point scale, which ranged from unsatisfactory to exceptional. Both critical incidents and anchors were taken from the incidents and response options generated during SJI development. Also, we collected data on a peer- and self-rated version of these BARS, both of which referred to a student’s performance in college. See Appendix C for sample BARS items.
Self-rated BARS items were administered after the biodata and SJI questions in a larger test administration described below. Dimensionality in these ratings would provide some evidence for distinct dimensions of performance, as was found in past studies of task and contextual dimensions of job performance (LePine, Erez, & Johnson, 2002; Rotundo & Sackett, 2002). However, a principal-axis exploratory factor analysis (EFA) yielded a large first factor that accounted for 32% of the variance and four times as much variance as the second factor. Multiple-factor models did not provide a readily interpretable solution. A confirmatory factor analysis (CFA) of these ratings using LISREL 8.51 (Jo¨reskog & So¨rbom, 2001) yielded support for a single-factor model, ␹2(54, N ϭ 641) ϭ 122.71, p Ͻ .01; root-mean-square error of approximation (RMSEA) ϭ .05; comparative fit index (CFI) ϭ .95; and nonnormed fit index (NNFI) ϭ .93. Coefficient alpha for the 12 BARS ratings was .80. Thus, subsequent data analyses used a composite rating based on the mean of the 12 BARS items.

The 12 peer-rated BARS items were identical to the self-rated BARS except for appropriate wording changes from first to third person. During the test administration, study participants were asked to nominate a peer who knew their work well and could provide ratings of the participant on the same 12 dimensions. Follow-up contacts of these peers by e-mail led to ratings of 145 participants, and raters were compensated $5 for their participation. Note that most of the peers were friends or roommates (83%); other categories of peers included resident assistants, teaching assistants, or professors. Peers were largely known after 6 months to a year of college (40%) or they were known for 3 years or more (40%). Subgroups, both by peer-rater type and by length of acquaintanceship, were too small to allow for meaningful post hoc statistical tests (e.g., for differences in correlations); however certainly different types of peer raters may provide different sources of insight into the student being rated. Similar to the self ratings, EFA and CFA analyses of these peer ratings using LISREL 8.51 yielded support for a single-factor model, ␹2(54, N ϭ 145) ϭ 84.38, p Ͻ .01, RMSEA ϭ .06, CFI ϭ .93, NNFI ϭ .91, and coefficient alpha of the composite of the peer ratings was .83. The fourth measure of student performance was this composite peer rating, which also was an average across BARS dimension ratings; the composite peer rating was available only on these 145 respondents.
Administration of the Paper-and-Pencil Tests
All of the measures were administered with a series of four booklets, in small group administrations (M ϭ 15.19 participants, SD ϭ 8.12). Participants were provided with test booklets and machine-scannable answer sheets. Trained proctors adhered to a script and read test instructions verbatim, similar to standardized test procedures. Sessions were scheduled to last 4 hr, allowing participants sufficient time to complete the various measures. Breaks were held after the administration of the first and second booklets.
Two forms of the test were randomly assigned, Form A and Form B. The two forms were identical, apart from the requirement that some of the biodata questions on Form B required that participants elaborate in writing in support of their multiple-choice response. Written responses were not scored; they were requested as part of an effort to control for the impact of social desirability; the results of this effort are reported in another study (Schmitt et al., 2003). Random assignment of test forms was done by group so that test-taking experience would be similar within each group, and participants would not notice some people were doing substantially more or less writing than others. Written elaboration did have an impact on the descriptive statistics for some of the biodata responses, so it was appropriate to standardize all items within form before creating composites or conducting correlational analyses.
Results
Scale-level descriptive statistics, reliability coefficients, and intercorrelations for our 12 college performance dimensions are shown for the biodata and SJI scales in Tables 2 and 3, respectively. For the biodata scales, most coefficient alphas are acceptable (above .70) or marginal (Perseverance ␣ ϭ .63), and although the Interpersonal, Career, and Adaptability alpha reliabilities are poor, the latter two scales had fewer than the usual 10 items because of subsequent item and scale refinement. Intercorrelations between biodata scales did not approach the reliability of the scales in almost all cases, indicating reasonable levels of discriminant validity (see Table 2, above the diagonal). Correlations with selfdeception and impression management are modest in most cases, though the correlation between the Ethics scale and impression management is quite high (r ϭ –.54). In general, these correlations indicate that the two social desirability dimensions covary with

PREDICTING COLLEGE STUDENT PERFORMANCE

193

Table 2 Biodata Scales: Descriptive Statistics, Intercorrelations Between Scales and Correlations With Self-Deception and Impression Management

Scale

k

SDa

1

2

3

4

5

6

7

8

9

10 11 12 S-D

IM

1. Knowledge

10 5.31 .72 .75 .47 .56 .43 .42 .51 .40 .37 .29 .80 .47 .29 Ϫ.28

2. Learning

9 4.72 .52 .67 .78 .88 .56 .38 .63 .26 .22 .40 .35 .25 .29 Ϫ.18

3. Artistic

9 5.97 .37 .58 .84 .89 .41 .24 .57 .03 .09 .15 .23 .09 .18 Ϫ.11

4. Multicultural 10 5.66 .41 .63 .71 .76 .56 .31 .63 .13 .12 .29 .37 .13 .16 Ϫ.12

5. Leadership

10 5.92 .32 .41 .33 .44 .79 .50 .76 .39 .23 .33 .58 .10 .18 Ϫ.11

6. Interpersonal 10 4.16 .25 .21 .15 .19 .30 .47 .32 .57 .22 .76 .57 .30 .32 Ϫ.26

7. Citizenship

10 5.26 .37 .44 .44 .46 .57 .19 .71 .21 .40 .15 .53 .30 .19 Ϫ.22

8. Health

10 5.25 .29 .18 .02 .10 .29 .33 .15 .71 .13 .60 .57 .21 .25 Ϫ.16

9. Career

3 2.15 .23 .13 .06 .07 .15 .11 .24 .08 .53 .11 .58 .35 .21 Ϫ.15

10. Adaptability

7 3.74 .32 .25 .11 .19 .22 .40 .09 .39 .06 .58 .60 .24 .36 Ϫ.18

11. Perseverance

8 4.22 .54 .36 .17 .25 .41 .31 .36 .38 .34 .36 .63 .48 .35 Ϫ.29

12. Ethics

6 3.86 .34 .17 .09 .10 .08 .18 .22 .15 .22 .15 .32 .72 .21 Ϫ.54

Note. N ϭ 638. Coefficient alpha reliability coefficients are italicized on the main diagonal; observed correlations are in the lower triangle, and correlations corrected for attenuation due to measurement unreliability are in the upper triangle. Refer to Table 1 for definitions of scales. k ϭ number of items; S-D ϭ self-deception; IM ϭ impression management. a Means of the biodata scales are all near zero because Forms A and B were standardized before computing composites; Form B had some items requiring
written elaboration to multiple-choice responses.

these biodata measures. This covariation may be a problem for the use of biodata measures as the basis for admission decisions. However, recent assessments of the correlation of social desirability (Ones, Viswesvaran, & Reiss, 1996) and impression management (Viswesvaran, Ones, & Hough, 2001) with job performance indicate these correlations are near zero. Also across four employment data sets, Ellingson, Smith, and Sackett (2001) found consistent and fairly large mean differences on personality measures between high scorers and low scorers on a scale of socially desirable responding, though the factor structure between personality predictors remained stable. The implication of these studies is that, at least in the employment context, social desirability may correlate with various predictors and influence their means in a socially desirable direction, but neither the correlation nor the mean increase appears to impact criterion-related validity greatly.

Other strategies for detecting faking or socially desirable responding (Stark, Chernyshenko, Chan, & Lee, 2001) may have different implications for criterion-related validity.
In contrast with Table 2, the data in Table 3 regarding the psychometric adequacy of the SJI are much less encouraging. The coefficient alphas are low, and intercorrelations between the scales indicate a lack of evidence for discriminant validity. This may be a function of the small number of items in each scale and the general complexity of situational-judgment items, and it is also consistent with previous research on situational judgment measures that usually treats items together as a single construct (McDaniel et al., 2001). Correlations between the SJI scales with self-deception are low relative to the biodata scales, and for Ethics the SJI scale correlates with impression management lower than the biodata scale does, but these findings may be partly a function

Table 3 SJI Scales: Descriptive Statistics, Intercorrelations Between Scales, and Correlations With Self-Deception and Impression Management

Scale

kM

SD 1 2 3

4

5

6

7

8

9

10 11 12 S-D

1. Knowledge 3 2.726 2.01 .37 .79 .73 .53 .69 .60 .65 1.02 .72 .38 1.05 .60 .14

2. Learning

5 2.849 2.50 .23 .23 .98 .99 .78 1.00 .76 1.59 .95 1.14 .26 .85 .08

3. Artistic

5 1.981 2.92 .28 .29 .39 1.04 .84 .83 .96 .96 .58 .97 .70 .21 .17

4. Multicultural 5 3.609 2.69 .20 .30 .41 .39 .71 .90 .84 .97 .67 .76 .55 .58 .13

5. Leadership

5 3.244 3.00 .28 .25 .35 .29 .44 .94 .83 .99 .60 .90 .78 .54 .19

6. Interpersonal 4 2.273 2.29 .21 .28 .30 .33 .36 .34 1.00 1.10 .65 .87 .70 .71 .09

7. Citizenship 5 2.423 2.53 .22 .21 .34 .30 .31 .33 .32 1.14 .45 .75 .70 .68 .16

8. Health

4 2.845 2.15 .29 .36 .28 .28 .31 .30 .30 .22 1.06 1.37 1.17 .86 .06

9. Career

5 4.978 2.71 .28 .29 .23 .27 .25 .24 .16 .31 .40 .95 .81 .54 .03

10. Adaptability 5 5.207 2.86 .40 .31 .35 .27 .34 .29 .24 .37 .35 .33 1.05 .66 .14

11. Perseverance 5 1.495 2.84 .41 .19 .28 .22 .34 .26 .26 .36 .33 .39 .42 .48 .15

12. Ethics

6 3.622 3.15 .27 .30 .18 .27 .26 .31 .29 .30 .25 .28 .23 .55 .07

IM
Ϫ.24 Ϫ.13 Ϫ.14 Ϫ.19 Ϫ.16 Ϫ.21 Ϫ.24 Ϫ.22 Ϫ.10 Ϫ.18 Ϫ.22 Ϫ.30

Note. N ϭ 634 – 642. Coefficient alpha reliability coefficients are italicized on the main diagonal; observed correlations are in the lower triangle, and correlations corrected for attenuation due to measurement unreliability are in the upper triangle. Corrected correlations are point estimates, and some exceed 1.0. Refer to Table 1 for definitions of scales. SJI ϭ situational judgment inventory. k ϭ number of items; S-D ϭ self-deception; IM ϭ impression management.

194

OSWALD, SCHMITT, KIM, RAMSAY, AND GILLESPIE

of the lower internal-consistency reliability of the SJI measures, which would systematically attenuate observed correlations. An EFA of the SJI items revealed the presence of a relatively large general factor accounting for three times the variance of the second factor. Additional factors accounted for small portions of variance, however, and also were difficult to interpret substantively. Because of the lack of discriminant validity and internal consistency of the SJI subscales, we computed a composite SJI index. This measure had high internal consistency reliability (␣ ϭ .85), suggesting that although it was clearly appropriate to sample content representatively across the 12 dimensions of college performance, the corresponding SJI scales are best used as an overall composite reflecting judgment across a variety of situations relevant to college life. The nature of the constructs being measured with this SJI composite can be understood by examining its correlates in the tables. Table 4 also presents the observed correlations between the SJI composite with the biodata scales, showing that the SJI and biodata are correlated yet are distinct, with each having the potential for incremental validity in the prediction of student performance outcomes.
Correlations of Biodata and SJI With SAT/ACT and Big Five Measures
Table 5 reports correlations between the SJI composite and biodata with the Big Five (Extraversion, Agreeableness, Conscientiousness, Emotional Stability, and Openness) and the SAT/ ACT measure. These correlations are important for two reasons. First, they provide some evidence for the construct validity of the biodata and SJI measures (i.e., the scales are part of a theoretically sensible nomological net). Second, if these correlations are too high, these measures provide redundant information already available through standardized instruments. Correlations between personality measures and the biodata measures were, in fact, relatively high, yet not so high as to preclude the possibility that biodata and the SJI composite will add incrementally to the prediction of student performance outcomes. The strongest positive

Table 5 Biodata and SJI: Correlations With the IPIP (Big Five) and SAT/ACT

Scale

E

A

C

ES

O SAT/ACT

Biodata

Knowledge

.14 .26 .40

.13 .34

Learning

.20 .21 .15

.18 .52

Artistic

.19 .23 .02

.11 .50

Multicultural

.23 .23 .03

.12 .38

Leadership

.48 .23 .15

.10 .24

Interpersonal

.45 .33 .16

.36 .16

Citizenship

.22 .28 .16

.11 .26

Health

.17 .09 .26

.33 .06

Career

.06 .13 .23 Ϫ.01 .08

Adaptability

.26 .17 .27

.35 .20

Perseverance

.24 .27 .47

.16 .23

Ethics

Ϫ.12 .25 .30

.12 .11

SJI

SJI composite

.17 .38 .28

.17 .21

.06 .07 .07 .10 .05 Ϫ.06 Ϫ.01 .01 Ϫ.12 .05 Ϫ.10 .00
Ϫ.03

Note. All correlations with magnitudes above .06 are statistically significant at p Ͻ .05; all correlations with magnitudes above .09 are statistically significant at p Ͻ .01. Refer to Table 1 for definitions of scales. SJI ϭ situational judgment inventory; IPIP ϭ International Personality Item Pool; E ϭ Extraversion; A ϭ Agreeableness; C ϭ Conscientiousness; ES ϭ Emotional Stability (Neuroticism); O ϭ Openness to Experience.

correlations were found between Extraversion with biodata Leadership (r ϭ .48) and biodata Interpersonal (r ϭ .45); Agreeableness with biodata Interpersonal (r ϭ .33); Conscientiousness with biodata Knowledge (r ϭ .40), biodata Perseverance (r ϭ .47), and biodata Ethics (r ϭ .30); Emotional Stability with biodata Interpersonal (r ϭ .36), biodata Health (r ϭ .33) and biodata Adaptability (r ϭ .35); and Openness with biodata Knowledge (r ϭ .34), biodata Learning (r ϭ .52), and biodata Artistic (r ϭ .50). None of the correlations involving the SAT/ACT measure were higher than
.12. In general, the magnitude of the correlations is consistent with
the constructs being correlated.

Table 4 Correlations Between Biodata and SJI Scales

Biodata

SJI

1

2

3

4

5

6

7

8

9 10 11 12

1. Knowledge

.27 .11 .03 .09 .10 .10 .16 .20 .19 .21 .31 .30

2. Learning

.25 .20 .16 .20 .17 .18 .16 .10 .07 .14 .19 .20

3. Artistic

.28 .31 .35 .33 .14 .17 .24 .07 .13 .20 .20 .18

4. Multicultural .25 .30 .40 .35 .22 .20 .32 .08 .12 .12 .19 .26

5. Leadership

.24 .21 .12 .17 .26 .24 .24 .14 .15 .20 .28 .22

6. Interpersonal .19 .10 .11 .19 .17 .17 .22 .08 .17 .16 .18 .26

7. Citizenship

.27 .18 .14 .21 .18 .18 .28 .10 .16 .10 .22 .25

8. Health

.24 .13 .07 .10 .09 .17 .13 .22 .13 .11 .26 .22

9. Career

.16 .13 .13 .14 .09 .13 .11 .09 .10 .15 .17 .19

10. Adaptability .23 .17 .12 .13 .11 .14 .12 .14 .16 .16 .25 .25

11. Perseverance .24 .11 .01 .10 .19 .15 .20 .13 .13 .15 .31 .24

12. Ethics

.28 .18 .13 .19 .17 .13 .23 .15 .06 .14 .21 .44

SJI composite

.41 .30 .25 .31 .27 .28 .34 .21 .23 .27 .39 .43

Note. N ϭ 635. All correlations with magnitudes above .06 are statistically significant at p Ͻ .05; all correlations with magnitudes above .09 are statistically significant at p Ͻ .01. Refer to Table 1 for definitions of scales. SJI ϭ situational judgment inventory.

PREDICTING COLLEGE STUDENT PERFORMANCE

195

Correlations With Outcome Measures
Table 6 presents correlations between the outcome measures, which reveal that all four outcomes provide distinct but related information regarding student performance. The highest correlation is –.53, the correlation between class absences and first-year GPA. The table also correlates the biodata scales, composite SJI measure, SAT/ACT, and Big Five with the primary outcome measures in our study. These outcomes include the first-year GPA as reported by the university registrar’s office, a self-reported index of absenteeism from class, composite self-ratings of performance on the BARS measure, and composite peer ratings on the same BARS measure.
Several biodata scales, such as Knowledge, Health, and Adaptability, do correlate reasonably well with GPA. Several, including the Knowledge, Health, Adaptability, Perseverance, and Ethics scales, also predict class absences (r ϭ –.15 to –.31). Their best correlations, however, are with the composite self-ratings. These higher correlations are probably due to a combination of two factors. First, the biodata measures were designed to reflect the same dimensions that were rated. Second, both the BARS and the biodata mea-

sures were completed by the participants. This is not true of the BARS peer ratings on the same dimensions, and correlations between the biodata scales and corresponding BARS peer ratings are lower than similar correlations with BARS self-ratings. Even so, several of the validity coefficients with the peer-rating BARS composite are above r ϭ .20 (i.e., Knowledge, Leadership, Interpersonal, Citizenship, and Perseverance). As for the SJI composite, it is significantly and relatively highly correlated with the self-rating measures of student performance and absenteeism; correlations with GPA and the peer-rating BARS were relatively low. The SAT/ACT measure was correlated with GPA (r ϭ .33), comparable with the uncorrected validity usually displayed in the research summarized in the introduction. Conscientiousness was the only Big Five measure that demonstrated consistent criterionrelated validity with correlations ranging from .21 to .30 in absolute magnitude. All five personality scales were significantly related ( p Ͻ .05) to the self-ratings measure, but again, these validities are likely inflated to some degree because both the predictors and the ratings come from the same source.

Table 6 College Performance Outcomes: Intercorrelations and Correlations With Predictors

Variable
GPA Absenteeism Self-rating BARS Peer-rating BARS
Biodata Knowledge Learning Artistic Multicultural Leadership Interpersonal Citizenship Health Career Adaptability Perseverance Ethics
SJI composite SAT/ACT Big Five
Extraversion Agreeableness Conscientiousness Emotional stability Openness

M

SD

GPA

Absenteeism

Intercorrelationsa

3.02

0.69



1.98

1.08

Ϫ.53

4.88

0.71

.22

4.96

0.80

.29

— Ϫ.22 Ϫ.16

Correlations between predictors and outcomesb

Self-rating BARS
— Ϫ.10

.22 .06 .01 .08 .14 .04 .08 .24 Ϫ.02 .21 .16 .14 .16 .33
Ϫ.03 .10 .21 .07 .03

Ϫ.19 .00 .07 .01
Ϫ.04 Ϫ.09 Ϫ.08 Ϫ.23 Ϫ.07 Ϫ.15 Ϫ.21 Ϫ.31 Ϫ.27
.11
.10 Ϫ.05 Ϫ.27 Ϫ.05
.04

.47 .40 .37 .38 .41 .25 .39 .22 .17 .24 .45 .35 .53 Ϫ.01
.24 .37 .30 .15 .35

Peer-rating BARS

.21 .06 .09 .10 .20 .21 .22 .11 Ϫ.01 .13 .21 .02 .16 .09
.12 .06 .22 Ϫ.08 .13

Note. GPA ϭ grade point average; BARS ϭ behaviorally anchored rating scale; SJI ϭ situational judgment
inventory. a N ϭ 136. ͉r͉ Ͼ .17 are statistically significant ( p Ͻ .05). b N ϭ 611– 636 for correlations with the first three criteria (GPA, absenteeism, and self-rating BARS), where ͉r͉ Ͼ .08 is statistically significant ( p Ͻ .05). N ϭ 136 for correlations with the peer-rating BARS composite, where ͉r͉ Ͼ .16 is statistically significant ( p Ͻ .05).

196

OSWALD, SCHMITT, KIM, RAMSAY, AND GILLESPIE

Development and Double Cross-Validation of Empirically Keyed Items
Identical empirical-keying procedures were carried out on both the SJI and the biodata, each at the item level. First, all cases were randomly split into two samples, in which the developmental sample had an N of 314 and the holdout sample had an N of 330 (these sample labels are arbitrary). Second, using the cases within each subgroup, all biodata and SJI items were correlated with three criteria: first-year GPA, absenteeism, and the summed composite of the self-report BARS. The peer-rating composite was not used in these analyses because the sample size was considered too small to allow for meaningful cross-validation (i.e., N ϭ 147 for all items). Third, applying a minimum cutoff value to the distribution of these item– criterion correlations produced a single set of “empirically best” biodata and SJI items for each of the three criteria, which resulted in three sets of items for each sample. In all these instances, the cutoff values resulted in the selection of 20 to 40 items.
The sample data used to derive empirical keys may capitalize on chance if they were then correlated with criteria. To avoid this possibility, we applied cross-validation procedures to our data. For cross-validation of the keys, each of the GPA-, absenteeism-, and BARS-based biodata and SJI item sets derived from the developmental sample were then used in the holdout sample to form similar item composite scores, which were then correlated with GPA, absenteeism, and composite self-rating criteria. Note that these holdout validity estimates are shrunken estimates; they are lower because the data are from the holdout sample, and the items were empirically selected based on data in the development sample. This cross-validation procedure was also applied in reverse, resulting in two cross-validated estimates for each of the criteria, one based on the best items chosen from the developmental sample and one based on items from the holdout sample. The two validities were then averaged to provide a single cross-validated estimate for each criterion.
Procedures for selecting the empirically best items for each criterion are separate from those described in the double crossvalidation, though they build from the sample-specific item– criterion correlations derived as part of the double cross-validation effort. The final sets of empirically keyed items for each criterion were chosen on the basis of the compound probabilities associated with the Pearson correlation of each item for both the developmental and the holdout samples. For each item, item-criterion p values in the developmental sample were combined with p values in the holdout sample using the compound probability formula provided by Guilford (1954, p. 440). The formula produced chisquare values representative of the probability that the item–total correlation could occur by chance in both samples (a high chisquare indicates a low probability). Because a large number of items had compound probabilities that were highly significant, cutoffs were made more stringent so that a manageable number of best items would be retained. For biodata items keyed to GPA, the cutoff was a chi-square associated with p Ͻ .001.
Table 7 provides the average validities for each of the empirically derived scales for each of our three major outcomes. As can be seen, the validities were quite respectable, rivaling those for high school GPA and SAT/ACT scores predicting the first-year GPA outcome. Biodata and SJI scales were also highly correlated

Table 7 Criterion-Related Validities of Empirically Keyed Scales

Outcome measure

Variable

Self-rating

GPA

Absenteeism

BARS (total)

SJI keys

GPA

.23

Ϫ.31

.50

Absenteeism

.20

Ϫ.33

.47

BARS

.14

Ϫ.20

.51

Biodata keys

GPA

.37

Ϫ.27

.47

Absenteeism

.26

Ϫ.30

.50

BARS

.15

Ϫ.13

.57

Note. N ϭ 302–328. The validities are averaged double cross-validities as described in the text. All correlations are significant at p Ͻ .05. GPA ϭ grade point average; BARS ϭ behaviorally anchored rating scale; SJI ϭ
situational judgment inventory.

with absenteeism. Finally, as expected, the best correlations were with the self-rating composite. Given that we used double crossvalidation to develop these scales, it should be emphasized that these validity estimates do not capitalize on sampling error variance and are likely representative of what one would achieve with similar samples from the same population of students.
Multivariate Analyses
To this point, we have provided descriptive data and correlations describing the relationship between the experimental biodata and SJI measures, outcome variables, and standard measures of scholastic competence (SAT/ACT) and personality. This section describes multivariate analyses of the relationships between these variables. These analyses provide information as to which of the new measures have the most criterion-related validity and how they work in combination with traditional predictors of college student performance.
A series of hierarchical regressions tested the incremental validity of our measures using the biodata scales and the SJI composite. We did not use the best empirical composites of the biodata and SJI items developed and described in the previous section, because these composites were developed based on the same sample used here, and therefore their incremental relationship with the four outcomes would capitalize on chance. In these regressions, SAT/ACT and personality were entered in Steps 1 and 2 respectively; in Step 3, the biodata and SJI measures were entered. The same stepwise regressions were used for each of the four outcomes. The results of these regression analyses are presented in Table 8 including the R2 for each step, and ‚R2 for Steps 2 and 3.
As expected, the SAT/ACT score predicted first-year GPA; it was also positively and negatively related to absenteeism but substantially less so; however, it did not significantly predict either set of ratings. The Big Five measures added significantly to the prediction of all four outcomes including GPA. The largest increment to prediction was observed for the self-ratings measure, part of which may be due to the fact that both sets of measures came from the same source. Of the personality scales, Conscientiousness was the most consistent predictor, where only in the case of the self-rating was the regression weight nonsignificant. Because ab-

Preparing to load PDF file. please wait...

0 of 0
100%
Developing a Biodata Measure and Situational Judgment Inventory