 |
Printer-Friendly
Version
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Ability Testing:
The use of standardized tests to evaluate the current performance of
a person in some defined domain of cognitive, psychomotor, or physical
functioning.
Achievement Testing:
A test to evaluate the extent of knowledge or skill attained by a test
taker in a content domain in which the test taker had received instruction.
Age Equivalent:
The chronological age in a defined population for which a given score
is the median (middle) score. Thus, if children 10 years and 6 months
of age have a median score of 17 on a test, the score 17 is said to
have an age equivalent of 10-6 for that population.
Alternate Forms:
Two or more versions of a test that are considered interchangeable, in
that they measure the same constructs in the same ways, are intended
for the same purposes, and are administered using the same directions. Alternate
forms is a generic term used to refer to any of three categories. Parallel
forms have equal raw score means, equal standard deviations, equal
error structures, and equal correlations with other measures for any
given population. Equivalent forms do not have the statistical
similarity of parallel forms, but the dissimilarities in raw score
statistics are compensated for in the conversions to derived scores
or in form-specific norm tables. Comparable forms are highly
similar in content, but the degree of statistical similarity has not
been demonstrated.
Analytic Scoring:
A method of scoring in which each critical dimension of performance is
judged and scored separately, and the resultant values are combined
for an overall score. In some instances, scores on the separate dimensions
may also be used in interpreting performance. See holistic scoring.
Aptitude Test:
A test that estimates future performance on other tasks not necessarily
having evident similarity to the test tasks. Aptitude tests are often
aimed at indicating an individual's readiness to learn or to develop
proficiency in some particular area if education or training is provided.
Aptitude tests sometimes do not differ in form or substance from achievement
tests, but may differ in use and interpretation. See also ability test
and achievement test.
Battery:
A group of several tests standardized on the same sample population so
that results on the several tests are comparable. (Sometimes loosely
applied to any group of tests administered together, even though not
standardized on the same subjects.) The most common test batteries
are those of school achievement, which include subtests in the separate
learning areas.
Cognitive Assessment:
The process of systematically gathering test scores and related data
in order to make judgments about an individual's ability to perform
various mental activities involved in the processing, acquisition,
retention, conceptualization, and organization of sensory, perceptual,
verbal, spatial, and psychomotor information.
Composite Score:
A score that combines several scores according to a specified formula.
Construct:
The concept or the characteristic that a test is designed to measure.
Construct Domain:
The set of interrelated attributes (e.g., behaviors, attitudes, values)
that are included under a construct's label. A test typically samples
from this construct domain.
Constructed Response Item:
An exercise for which examinees must create their own responses or products
rather that choose a response from an enumerated set. Short-answer
items require a few words or a number as an answer, whereas extended-response
items require at least a few sentences.
Criterion-Referenced Test:
A test that allows its users to make score interpretations in relation
to a functional performance level, as distinguished from those
interpretations that are made in relation to the performance of others.
Examples of criterion-referenced interpretations include comparison
to cut scores, interpretations based on expectancy tables, and domain-referenced
score interpretations.
Derived Score:
A score to which raw scores are converted by numerical transformation
(e.g., conversion of raw scores to percentile ranks or standard scores).
Decile:
Any one of the nine points (scores) that divide a distribution into ten
parts, each containing one-tenth of all the scores of cases; every
tenth percentile. The first decile is the 10th percentile, the eighth
decile the 80th percentile, etc.
Diagnostic Test:
A test used to "diagnose" or analyze; that is, to locate an individual's
specific areas of weakness or strength, to determine the nature of his
weaknesses or deficiencies, and, wherever possible, to suggest their
cause. Such a test yields measures of the components or subparts of some
larger body of information or skill. Diagnostic achievement tests are
most commonly prepared for the skill subjects.
Expected Growth:
The average amount of change in test scores that occurs over a specified
time interval for individuals with certain individual characteristics
such as age or grade level.
Factor:
- Any variable, real or hypothetical, that is an aspect of a concept
or construct.
- In measurement theory, a statistical dimension defined by factor
analysis.
- In mental measurement, a hypothetical trait, ability, or component
of ability that underlies and influences performance on two or more
tests and hence causes scores on tests to be correlated. The term "factor" strictly
refers to a theoretical variable, derived by the process of factor
analysis from a table of interrelations among tests. However, it is
also used to denote the psychological interpretation given to the variable
- i.e., the mental trait assumed to be represented by the variable,
as verbal ability, numerical ability, etc.
Factor Analysis:
Any of several statistical methods of describing the interrelationships
of a set of variables by statistically deriving new variables, called
factors, that are fewer in number than the original set of variables.
Factor analysis reveals how much of the variation in each of the original
measures arises from, or is associated with, each of the hypothetical
factors. Factor analysis has contributed to an understanding of the
organization or components of intelligence, aptitudes, and personality;
and it has pointed the way to the development of "purer" tests of several
components.
Grade Equivalent:
The school grade level for a given population for which a given score
is the median score in that population. Grade Equivalent scores are
useful primarily because of three characteristics: 1) they indicate
the developmental level of the pupil's performance, 2) they may be
averaged for the purpose of making group comparisons, and 3) they are
suitable for measuring growth. For example, if a student obtains a
grade equivalent score of 6.3 on a math test we would say that his
raw score is equivalent to the average raw score obtained by students
in the norm group who were in their third month of the sixth grade.
A grade equivalent score does not equate to performance in the classroom.
Grade equivalents are the first step in the further analysis of raw
data. All subsequent statistics are directly related to the grade equivalent.
Holistic Scoring:
A method of obtaining a score on a test, or a test item, based on a judgement
of overall performance using specified criteria. In holistic scoring,
raters evaluate the effectiveness of responses in terms of a set of
overall descriptions of categories relevant for responses to the task
-- be it a written response, an oral response, or some other performance
task (i.e., constructed response). The scoring process is holistic
in that the score assigned to an examinee's performance reflects the
overall effectiveness of the examinee response.
Normal Curve Equivalents (NCE):
Normal Curve Equivalents are normalized standard scores with a mean of
50 and a standard deviation of 21.06. The range of NCEs is from a score
of 1 corresponding to a percentile rank of 1.0 to a score of 99 corresponding
to a percentile rank of 99.0. NCEs have little direct normative meaning
to the typical user. To interpret NCEs it is necessary to relate them
to other status scores based on a single reference group such as percentile
ranks or stanines. For those who are accustomed to interpreting stanines,
NCEs may be thought of as roughly equivalent to stanines to one decimal
place. For example, an NCE of 73 may be interpreted as a stanine of
7.3. The main advantage of NCEs is that they are derived through the
use of comparable procedures by the publishers of the various tests
used in federal projects. NCEs used in federal evaluation must be based
on empirically established norms for a particular grade and time of
year. This leads to standardization and comparability of reporting
procedures. This does not mean that results from different test batteries
are interchangeable, however. Tests differ in content, and norms are
based on different samples tested at different points in time.
Normalized Standard Score:
A derived test score in which a numerical transformation has been chosen
so that the score distribution closely approximates a normal distribution,
for some specific population.
Norm-Referenced Test Interpretation:
A score interpretation based on a comparison of a test taker's performance
to the performance of other people in a specified reference population.
Norms:
Statistics or tabular data that summarize the distribution of test performance
for one or more specified groups, such as test takers of various ages
or grades. Norms are usually designed to represent some larger population,
such as test takers throughout the country. The group of examinees
represented by the norms is referred to as the reference population.
Objective Mastery:
These are generally associated with criterion-referenced testing though
many norm-referenced tests report this information. Items are written
measuring particular objectives. If enough items measuring a specific
objective are answered correctly, then objective mastery is concluded.
Some norm-referenced tests are written in a criterion-referenced mode
so that categories of objectives can be measured. The degrees of mastery
of these category objectives are reported as objective mastery.
All three terms-norm-referenced, criterion-referenced, and objective-based-have
been used as adjectives to apply to tests, purpose and interpretations.
Even though most criterion-referenced interpretations involve the use
of skill or item norms, subjective standards are also important. Differences
in ability levels within groups of pupils call for different standards
and expectations. Discrepancies between expected and actual performance
should be evaluated and interpreted in light of local visions for developing
the particular skill.
It should also be noted that the difficulty of a give item depends only
on the inherent difficulty of the skill tested, but also on 1) the level
of mastery required by the item; 2) the setting in which the item was
placed; 3) the attractiveness of the distractors, etc. For example, an
item that 80% of the students in a given school answer correctly may
represent a skill that is extremely important for all pupils and that
should require immediate attention. On the other hand, an item that 40%
of the pupils answer correctly may represent a difficult concept and
with an item norm of 30% or so that only the most able and talented pupils
should be expected to master.
Out-of-Level Testing:
Administering a test that is designed primarily for people of an age
or grade level above or below that of the test taker.
Percentile Rank:
Most commonly, the percentage of scores in a specified distribution that
fall below the point at which a given score lies. Sometimes the percentage
is defined to include scores the fall at the point; sometimes the percentage
is defined to include half of the scores at the point. Percentile ranks
indicate the status or relative standing of a pupil in comparison to
other pupils. The percentile rank tells the percent of pupils in a
particular norm group who obtain lower scores; thus, for example, if
Ann earns a percentile rank of 70 on a particular test it means she
scored better than 70 percent of the pupils in the norm group and 30
percent scored as well or better than she. The scale goes from 1 to
99 percent. If three points are used to divide the scale into four
equal quarters the points are called quartiles; quartile one, quartile
two, and quartile three. Quartiles are points, not areas. A score does
not fall in a quartile. A score can be above, at, or below a quartile.
A score can be within two quartiles. There is not fourth quartile.
Many people have the misconception that since there are four quarters
that there should also be four quartiles, but this is not the case.
Quartiles are points, not areas, so there are four areas divided by
the three quartiles, but there are not four quartiles.
Quartiles
| ---- |
99th |
| ---- |
75th |
| Median |
50th |
| ---- |
25th |
| ---- |
1st |
Power Test:
A test intended to measure level of performance unaffected by speed of
response; hence one in which there is either no time limit or a very
generous one. Items are usually arranged in order of increasing difficulty.
Profile:
A graphic representation of an individual's scores (or their relative
magnitudes) on several tests (or subtests) that employ a single standard
scale. See also battery.
Raw Score:
A raw score is the number of items answered correctly on a given test.
For example, if a test had 59 items and the student got 23 correct
the raw score would be 23. Raw scores by themselves have little or
no meaning. Raw scores are converted to 1) developmental scores such
as grade equivalents or 2) status scores such as percentile rank, normal
curve equivalents, or stanines in order to be interpreted meaningfully.
Reference Population:
The population of test takers represented by test norms. The sample on
which the test norms are based must permit accurate estimation of the
test score distribution for the reference population. The reference
population may be defined in terms of examinee age, grade, or clinical
status at time of testing, or other characteristics.
Scaled Score:
A scaled score is a score derived from the original raw score on a test.
A scaled score carries mathematical properties that allow these scores
to be examined in a variety of ways. Generally, it can be said that
the scaled scores have a wide range and are equally intervaled. There
is the same distance from one scaled score unit to the next across
the entire scale. However, this does not mean that there is an equal
scaled score interval between two raw score units.
Score:
Any specific number resulting from the assessment of an individual; a
generic term applied for convenience to such diverse measures as test
scores, estimates of latent variables, production counts, absence records,
course grades, ratings, and so forth.
Speed Test:
A test in which performance is measured by the number of tasks performed
in a given time. Examples are tests of typing speed and reading speed.
Also, a test scored for accuracy where the test taker works under time
pressure.
Speededness:
A test characteristic, dictated by the test's time limits, that results
in a test taker's score being dependent on the rate at which work is
performed as well as the correctness of the responses. The term is
not used to describe tests of speed. Speededness is often an undesirable
characteristic.
Standard Score:
A type of derived score such that the distribution of these scores for
a specified population has convenient, known values for the mean and
standard deviation. The term is sometimes used to signify a mean of
0.0 and a standard deviation of 1.0.
Standardization:
1. In test administration, maintaining a constant testing environment
and conducting the test according to detailed rules and specifications,
so that testing conditions are the same for all test takers. 2. In
test development, establishing scoring norms based on the test performance
of a representative sample of individuals with which the test is intended
to be used. 3. In statistical analysis, transforming a variable so
that its standard deviation is 1.0 for some specified population or
sample.
Stanine Scores:
Stanine scores are normalized standard scores with a range of 1 to 9,
a mean of five, and a standard deviation of two. Like percentile ranks
they are status scores within a particular norm group. The first stanine
is the lowest scoring group and the 9th stanine is the highest scoring
group. Advocates of stanine reporting site the fact that the single
digit scale is simple and convenient to use and that its use minimizes
the apparent importance of small score differences. On the other hand,
the stanine scale may be regarded as unnecessarily coarse particularly
for relatively reliable tests. For example, all pupils scoring between
the 40th and 60th percentiles are assigned a stanine of 5. However,
a pupil scoring at the 59th percentile, which is in stanine 5, is probably
much more similar in achievement level to a pupil scoring at the 61st
percentile, stanine 6, than to one at the 41st, stanine 5. In some
instances the width of the stanine band exceeds the standard error
of measurement. Another reservation about the use of stanine scores
is that there is evidence that skills development in the elementary
schools is more variable in subjects such as reading in which the pupils
have many opportunities for advancing "on their own" than they are
in subjects such as mathematics in which pupil progress is more rigidly
controlled through placement of concepts and processes in the curriculum.
The distribution of percentages from low to high is as follows:
Distribution of Percentages from low to high
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
STANINES |
| 4% |
7% |
12% |
17% |
20% |
17% |
12% |
7% |
4% |
PERCENTAGE OF CASES |
| Low |
Low |
Low Average |
Average |
Average |
Average |
High Average |
High |
High |
|
Test Modification:
Changes made in the content, format, and/or administration procedure
of a test in order to accommodate test takers who are unable to take
the original test under standard test conditions.
Timed Tests:
A test administered to a test taker who is allotted a strictly prescribed
amount of time to respond to the test.
True Score:
In classical test theory, the average of the scores that would be earned
by an individual on an unlimited number of perfectly parallel forms
of the same test. In item response theory, the error-free value of
test taker proficiency, usually symbolized by .
Validation:
The process through which the validity of the proposed interpretation
of test scores is investigated.
Reference List
|