Pearson

Assessments for Specialized Education Needs

 




 
 Search this site
 All Pearson's Assessment group

 
 


  You are here: Home | Glossary of Terms | Alphabetical Listing


Alphabetical Listing:




PRINTPrinter-Friendly Version

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Ability Testing:

    The use of standardized tests to evaluate the current performance of a person in some defined domain of cognitive, psychomotor, or physical functioning.

Achievement Testing:

    A test to evaluate the extent of knowledge or skill attained by a test taker in a content domain in which the test taker had received instruction.

Age Equivalent:

    The chronological age in a defined population for which a given score is the median (middle) score. Thus, if children 10 years and 6 months of age have a median score of 17 on a test, the score 17 is said to have an age equivalent of 10-6 for that population.

Alternate Forms:

    Two or more versions of a test that are considered interchangeable, in that they measure the same constructs in the same ways, are intended for the same purposes, and are administered using the same directions. Alternate forms is a generic term used to refer to any of three categories. Parallel forms have equal raw score means, equal standard deviations, equal error structures, and equal correlations with other measures for any given population. Equivalent forms do not have the statistical similarity of parallel forms, but the dissimilarities in raw score statistics are compensated for in the conversions to derived scores or in form-specific norm tables. Comparable forms are highly similar in content, but the degree of statistical similarity has not been demonstrated.

Analytic Scoring:

    A method of scoring in which each critical dimension of performance is judged and scored separately, and the resultant values are combined for an overall score. In some instances, scores on the separate dimensions may also be used in interpreting performance. See holistic scoring.

Aptitude Test:

    A test that estimates future performance on other tasks not necessarily having evident similarity to the test tasks. Aptitude tests are often aimed at indicating an individual's readiness to learn or to develop proficiency in some particular area if education or training is provided. Aptitude tests sometimes do not differ in form or substance from achievement tests, but may differ in use and interpretation. See also ability test and achievement test.

Arithmetic Mean:

    A kind of average usually referred to as the mean. It is obtained by dividing the sum of a set of scores by their number.

Average:

    A general term applied to the various measures of central tendency. The three most widely used averages are the arithmetic mean (mean), the median, and the mode. When the term "average" is used without designation as to type, the most likely assumption is that it is the arithmetic mean.

Battery:

    A group of several tests standardized on the same sample population so that results on the several tests are comparable. (Sometimes loosely applied to any group of tests administered together, even though not standardized on the same subjects.) The most common test batteries are those of school achievement, which include subtests in the separate learning areas.

Central Tendency:

    A measure of central tendency provides a single most typical score as representative of a group of scores; the "trend" of a group of measures as indicated by some type of average, usually the mean or the median.

Cognitive Assessment:

    The process of systematically gathering test scores and related data in order to make judgments about an individual's ability to perform various mental activities involved in the processing, acquisition, retention, conceptualization, and organization of sensory, perceptual, verbal, spatial, and psychomotor information.

Composite Score:

    A score that combines several scores according to a specified formula.

Confidence Interval:

    A sample-based estimate as an interval or range of values within which the true or target population value is expected to be located (with a specified level of confidence given as a percentage).

Construct:

    The concept or the characteristic that a test is designed to measure.

Construct Domain:

    The set of interrelated attributes (e.g., behaviors, attitudes, values) that are included under a construct's label. A test typically samples from this construct domain.

Construct Validity:

    A term used to indicate that the test scores are to be interpreted as indicating the test taker's standing on the psychological construct measured by the test. A construct is a theoretical variable inferred from multiple types of evidence, which might include the interrelations of the test scores with other variables, internal test structure, observations of response processes, as well as the content of the test. In the current standards, all test scores are viewed as measures of some construct, so the phrase is redundant with validity. The validity argument establishes the construct validity of a test.

Constructed Response Item:

    An exercise for which examinees must create their own responses or products rather that choose a response from an enumerated set. Short-answer items require a few words or a number as an answer, whereas extended-response items require at least a few sentences.

Content Validity:

    A term used in the 1974 Standards to refer to a kind or aspect of validity that was "required when the test user wishes to estimate how an individual performs in the universe of situations the test is intended to represent" (p. 28). In the 1985 Standards, the term was changed to content-related evidence emphasizing that it referred to one type of evidence within a unitary conception of validity. In the current Standards, this type of evidence is characterized as "evidence based on test content."

Correlation:

    The tendency for certain values or levels of one variable to occur with particular values or levels of another variable.

Correlation Coefficient:

    A measure of association between two variables that can range from -1.00 (perfect negative relationship) to 0 (no relationship) to +1.00 (perfect positive relationship).

Criterion-Referenced Test:

    A test that allows its users to make score interpretations in relation to a functional performance level, as distinguished from those interpretations that are made in relation to the performance of others. Examples of criterion-referenced interpretations include comparison to cut scores, interpretations based on expectancy tables, and domain-referenced score interpretations.

Derived Score:

    A score to which raw scores are converted by numerical transformation (e.g., conversion of raw scores to percentile ranks or standard scores).

Decile:

    Any one of the nine points (scores) that divide a distribution into ten parts, each containing one-tenth of all the scores of cases; every tenth percentile. The first decile is the 10th percentile, the eighth decile the 80th percentile, etc.

Diagnostic Test:

    A test used to "diagnose" or analyze; that is, to locate an individual's specific areas of weakness or strength, to determine the nature of his weaknesses or deficiencies, and, wherever possible, to suggest their cause. Such a test yields measures of the components or subparts of some larger body of information or skill. Diagnostic achievement tests are most commonly prepared for the skill subjects.

Distribution (Frequency Distribution):

    A tabulation of the scores (or other attributes) of a group of individuals to show the number (frequency) of each score, or of those within the range of each interval.

Expected Growth:

    The average amount of change in test scores that occurs over a specified time interval for individuals with certain individual characteristics such as age or grade level.

Factor:

  1. Any variable, real or hypothetical, that is an aspect of a concept or construct.
  2. In measurement theory, a statistical dimension defined by factor analysis.
  3. In mental measurement, a hypothetical trait, ability, or component of ability that underlies and influences performance on two or more tests and hence causes scores on tests to be correlated. The term "factor" strictly refers to a theoretical variable, derived by the process of factor analysis from a table of interrelations among tests. However, it is also used to denote the psychological interpretation given to the variable - i.e., the mental trait assumed to be represented by the variable, as verbal ability, numerical ability, etc.

Factor Analysis:

    Any of several statistical methods of describing the interrelationships of a set of variables by statistically deriving new variables, called factors, that are fewer in number than the original set of variables. Factor analysis reveals how much of the variation in each of the original measures arises from, or is associated with, each of the hypothetical factors. Factor analysis has contributed to an understanding of the organization or components of intelligence, aptitudes, and personality; and it has pointed the way to the development of "purer" tests of several components.

Generalizability Coefficient:

    A reliability index encompassing one or more independent sources of error. It is formed as the ratio of (a) the sum of variances that are considered components of test score variance in the setting under study to (b) the foregoing sum plus the weighted sum of variances attributable to various error sources in this setting. Such indices, which arise from the application of generalizability theory, are typically interpreted in the same manner as reliability coefficients.

Grade Equivalent:

    The school grade level for a given population for which a given score is the median score in that population. Grade Equivalent scores are useful primarily because of three characteristics: 1) they indicate the developmental level of the pupil's performance, 2) they may be averaged for the purpose of making group comparisons, and 3) they are suitable for measuring growth. For example, if a student obtains a grade equivalent score of 6.3 on a math test we would say that his raw score is equivalent to the average raw score obtained by students in the norm group who were in their third month of the sixth grade. A grade equivalent score does not equate to performance in the classroom. Grade equivalents are the first step in the further analysis of raw data. All subsequent statistics are directly related to the grade equivalent.

Holistic Scoring:

    A method of obtaining a score on a test, or a test item, based on a judgement of overall performance using specified criteria. In holistic scoring, raters evaluate the effectiveness of responses in terms of a set of overall descriptions of categories relevant for responses to the task -- be it a written response, an oral response, or some other performance task (i.e., constructed response). The scoring process is holistic in that the score assigned to an examinee's performance reflects the overall effectiveness of the examinee response.

Internal Consistency Coefficient:

    An index of the reliability of test scores derived from the statistical interrelationships of responses among item responses or scores on separate parts of a test.

Median (Md):

    The middle score in a distribution or set of ranked scores; the point (score) that divides the group into two equal parts; the 50th percentile. Half if the scores are below the median and half above it, except when the median itself is one of the obtained scores.

Mode:

    The score or value that occurs most frequently in a distribution.

N:

    The symbol commonly used to represent the number of cases in a group.

Normal Curve Equivalents (NCE):

    Normal Curve Equivalents are normalized standard scores with a mean of 50 and a standard deviation of 21.06. The range of NCEs is from a score of 1 corresponding to a percentile rank of 1.0 to a score of 99 corresponding to a percentile rank of 99.0. NCEs have little direct normative meaning to the typical user. To interpret NCEs it is necessary to relate them to other status scores based on a single reference group such as percentile ranks or stanines. For those who are accustomed to interpreting stanines, NCEs may be thought of as roughly equivalent to stanines to one decimal place. For example, an NCE of 73 may be interpreted as a stanine of 7.3. The main advantage of NCEs is that they are derived through the use of comparable procedures by the publishers of the various tests used in federal projects. NCEs used in federal evaluation must be based on empirically established norms for a particular grade and time of year. This leads to standardization and comparability of reporting procedures. This does not mean that results from different test batteries are interchangeable, however. Tests differ in content, and norms are based on different samples tested at different points in time.

Normalized Standard Score:

    A derived test score in which a numerical transformation has been chosen so that the score distribution closely approximates a normal distribution, for some specific population.

Norm-Referenced Test Interpretation:

    A score interpretation based on a comparison of a test taker's performance to the performance of other people in a specified reference population.

Norms:

    Statistics or tabular data that summarize the distribution of test performance for one or more specified groups, such as test takers of various ages or grades. Norms are usually designed to represent some larger population, such as test takers throughout the country. The group of examinees represented by the norms is referred to as the reference population.

Objective Mastery:

    These are generally associated with criterion-referenced testing though many norm-referenced tests report this information. Items are written measuring particular objectives. If enough items measuring a specific objective are answered correctly, then objective mastery is concluded. Some norm-referenced tests are written in a criterion-referenced mode so that categories of objectives can be measured. The degrees of mastery of these category objectives are reported as objective mastery.

    All three terms-norm-referenced, criterion-referenced, and objective-based-have been used as adjectives to apply to tests, purpose and interpretations. Even though most criterion-referenced interpretations involve the use of skill or item norms, subjective standards are also important. Differences in ability levels within groups of pupils call for different standards and expectations. Discrepancies between expected and actual performance should be evaluated and interpreted in light of local visions for developing the particular skill.

    It should also be noted that the difficulty of a give item depends only on the inherent difficulty of the skill tested, but also on 1) the level of mastery required by the item; 2) the setting in which the item was placed; 3) the attractiveness of the distractors, etc. For example, an item that 80% of the students in a given school answer correctly may represent a skill that is extremely important for all pupils and that should require immediate attention. On the other hand, an item that 40% of the pupils answer correctly may represent a difficult concept and with an item norm of 30% or so that only the most able and talented pupils should be expected to master.

Out-of-Level Testing:

    Administering a test that is designed primarily for people of an age or grade level above or below that of the test taker.

p-value:

    The p-value is a test item statistic that represents the percentage of students who answered that item correctly out of a particular population group. A p-value may be calculated for a national standardization sample or for a class or school level population where a test has been administered. The p-value is calculated by dividing the number of correct responses on an item by the total number of students tested. It may be expressed as a decimal value or as a percentage (by multiplying the decimal value by 100).

Percentile Rank:

    Most commonly, the percentage of scores in a specified distribution that fall below the point at which a given score lies. Sometimes the percentage is defined to include scores the fall at the point; sometimes the percentage is defined to include half of the scores at the point. Percentile ranks indicate the status or relative standing of a pupil in comparison to other pupils. The percentile rank tells the percent of pupils in a particular norm group who obtain lower scores; thus, for example, if Ann earns a percentile rank of 70 on a particular test it means she scored better than 70 percent of the pupils in the norm group and 30 percent scored as well or better than she. The scale goes from 1 to 99 percent. If three points are used to divide the scale into four equal quarters the points are called quartiles; quartile one, quartile two, and quartile three. Quartiles are points, not areas. A score does not fall in a quartile. A score can be above, at, or below a quartile. A score can be within two quartiles. There is not fourth quartile. Many people have the misconception that since there are four quarters that there should also be four quartiles, but this is not the case. Quartiles are points, not areas, so there are four areas divided by the three quartiles, but there are not four quartiles.
Quartiles
 ---- 99th
 ---- 75th
Median 50th
 ---- 25th
 ---- 1st

Power Test:

    A test intended to measure level of performance unaffected by speed of response; hence one in which there is either no time limit or a very generous one. Items are usually arranged in order of increasing difficulty.

Predictive Validity:

    A term used in the 1974 Standards to refer to a type of "criterion-related validity" that applies "when on wishes to infer from a test score and individual's most probable standing on some other variable call a criterion" (p. 26). In the 1985 Standards, the term criterion-related validity was changed to criterion-related evidence, emphasizing that it referred to one type of evidence within a unitary conception of validity. The current Standards document refers to "evidence based on relations to other variables" that include "test-criterion relationships." Predictive evidence indicates how accurately test data can predict criterion scores that are obtained at a later time.

Profile:

    A graphic representation of an individual's scores (or their relative magnitudes) on several tests (or subtests) that employ a single standard scale. See also battery.

Raw Score:

    A raw score is the number of items answered correctly on a given test. For example, if a test had 59 items and the student got 23 correct the raw score would be 23. Raw scores by themselves have little or no meaning. Raw scores are converted to 1) developmental scores such as grade equivalents or 2) status scores such as percentile rank, normal curve equivalents, or stanines in order to be interpreted meaningfully.

Reference Population:

    The population of test takers represented by test norms. The sample on which the test norms are based must permit accurate estimation of the test score distribution for the reference population. The reference population may be defined in terms of examinee age, grade, or clinical status at time of testing, or other characteristics.

Reliability:

    The degree to which test scores for a group of test takers are consistent over repeated applications of a measurement procedure and hence are inferred to be dependable, and repeatable for an individual test taker; the degree to which scores are free of errors of measurement for a given group.

Reliability Coefficient:

    A coefficient of correlation between two administrations of a test. The conditions of administration may involve variation in test forms, raters or scorers, or passage of time. These and other changes in conditions give rise to qualifying adjectives being used to describe the particular coefficient, e.g. parallel form reliability, rater reliability, test retest reliability, etc.

Scale:

    1. The system of numbers, and their units, by which a value is reported on some dimension of measurement. Length can be reported in the English system of feet and inches or in the metric system of meters and centimeters. 2. In testing, scale sometimes refers to the set of items or subtests used in the measurement and is distinguished from a test in the type of characteristic being measured. One speaks of a test of verbal ability, but a scale of extroversion-introversion.

Scaled Score:

    A scaled score is a score derived from the original raw score on a test. A scaled score carries mathematical properties that allow these scores to be examined in a variety of ways. Generally, it can be said that the scaled scores have a wide range and are equally intervaled. There is the same distance from one scaled score unit to the next across the entire scale. However, this does not mean that there is an equal scaled score interval between two raw score units.

Scaling:

    The process of creating a scale or a scaled score. Scaling may enhance test score interpretation by placing scores from different tests or test forms onto a common scale or by producing scale scores designed to support criterion-referenced or norm-referenced score interpretations.

Score:

    Any specific number resulting from the assessment of an individual; a generic term applied for convenience to such diverse measures as test scores, estimates of latent variables, production counts, absence records, course grades, ratings, and so forth.

Speed Test:

    A test in which performance is measured by the number of tasks performed in a given time. Examples are tests of typing speed and reading speed. Also, a test scored for accuracy where the test taker works under time pressure.

Speededness:

    A test characteristic, dictated by the test's time limits, that results in a test taker's score being dependent on the rate at which work is performed as well as the correctness of the responses. The term is not used to describe tests of speed. Speededness is often an undesirable characteristic.

Split-Halves Reliability Coefficient:

    An internal consistency coefficient obtained by using half the items on the test to yield one score and the other half of the items to yield a second, independent score. The correlation between the scores on these two half-tests, adjusted via the Spearman-Brown formula, provides an estimate of the alternate-form reliability of the total test. The Spearman-Brown formula is a formula derived within classical test theory that projects the reliability of a shortened or lengthened test from the reliability of a test of specified length.

Standard Deviation (S.D.):

    A measure of the variability or dispersion of a distribution of scores. The most widely used measure of dispersion of a frequency distribution. It is equal to the positive square root of the population variance. The more the scores cluster around the mean, the smaller the standard deviation. For a normal distribution, approximately two thirds (68.3 percent) or the scores are within the range from one S.D. below the mean to one S.D. above the mean. Computation of the S.D. is based upon the square of the deviation of each score from the mean. The S.D. is sometimes called "sigma" and is represented by the symbol ().

Standard Error of Measurement:

    The standard deviation of an individual's observed scores from repeated administrations of a test (or parallel forms of a test) under identical conditions. Because such data cannot generally be collected, the standard error of measurement is usually estimated from group data.

Standard Score:

    A type of derived score such that the distribution of these scores for a specified population has convenient, known values for the mean and standard deviation. The term is sometimes used to signify a mean of 0.0 and a standard deviation of 1.0.

Standardization:

    1. In test administration, maintaining a constant testing environment and conducting the test according to detailed rules and specifications, so that testing conditions are the same for all test takers. 2. In test development, establishing scoring norms based on the test performance of a representative sample of individuals with which the test is intended to be used. 3. In statistical analysis, transforming a variable so that its standard deviation is 1.0 for some specified population or sample.

Stanine Scores:

    Stanine scores are normalized standard scores with a range of 1 to 9, a mean of five, and a standard deviation of two. Like percentile ranks they are status scores within a particular norm group. The first stanine is the lowest scoring group and the 9th stanine is the highest scoring group. Advocates of stanine reporting site the fact that the single digit scale is simple and convenient to use and that its use minimizes the apparent importance of small score differences. On the other hand, the stanine scale may be regarded as unnecessarily coarse particularly for relatively reliable tests. For example, all pupils scoring between the 40th and 60th percentiles are assigned a stanine of 5. However, a pupil scoring at the 59th percentile, which is in stanine 5, is probably much more similar in achievement level to a pupil scoring at the 61st percentile, stanine 6, than to one at the 41st, stanine 5. In some instances the width of the stanine band exceeds the standard error of measurement. Another reservation about the use of stanine scores is that there is evidence that skills development in the elementary schools is more variable in subjects such as reading in which the pupils have many opportunities for advancing "on their own" than they are in subjects such as mathematics in which pupil progress is more rigidly controlled through placement of concepts and processes in the curriculum. The distribution of percentages from low to high is as follows:

Distribution of Percentages from low to high

1
2
3
4
5
6
7
8
9
STANINES
4% 7% 12% 17% 20% 17% 12% 7% 4% PERCENTAGE OF CASES
Low Low Low Average Average Average Average High Average High High  

T-Score:

    A derived score on a scale having a mean score of 50 units and a standard deviation of 10 units.

Test Modification:

    Changes made in the content, format, and/or administration procedure of a test in order to accommodate test takers who are unable to take the original test under standard test conditions.

Test-Retest Reliability:

    A reliability coefficient obtained by administering the same test a second time to the same group after a time interval and correlating the two sets of scores.

Test-Retest Reliability Coefficient:

    A type of reliability coefficient obtained by administering the same test a second time, after a short interval, and correlating the two sets of scores. "Same test" was originally understood to mean identical content, i.e., the same form; currently, however, the term "test-retest' is also used to describe the administration of different forms of the same test, in which case this reliability coefficient becomes the same as the alternate form coefficient. In either case (1) fluctuations over time and in testing situation, and (2) any effect of the first test upon the second are involved. When the time interval between the two testings is considerable, as several months, a test-retest reliability coefficient reflects not only the consistency of measurement provided by the test, but also the stability of the examinee trait being measured.

Timed Tests:

    A test administered to a test taker who is allotted a strictly prescribed amount of time to respond to the test.

True Score:

    In classical test theory, the average of the scores that would be earned by an individual on an unlimited number of perfectly parallel forms of the same test. In item response theory, the error-free value of test taker proficiency, usually symbolized by .

Validation:

    The process through which the validity of the proposed interpretation of test scores is investigated.

Validity:

    The degree to which accumulated evidence and theory support specific interpretations of test scores entailed by proposed uses of a test. The capacity of a measuring instrument to predict what it was designed to predict; stated most often in terms of the correlation between values on the instrument and measures of performance on some criterion.

Variance:

    A measure of variability; the average squared deviation from the mean; the square of the standard deviation.

Z-score:

    A type of standard score scale in which the mean equals zero and the standard deviation equals one unit for the group used in defining the scale.

Reference List