Psychometrics is the field of study concerned with the theory and technique of educational and psychological measurement, which includes the measurement of knowledge, abilities, attitudes, and personality traits. The field is primarily concerned with the study of measurement instruments such as questionnaires and tests. It involves two major research tasks, namely: (i) the construction of instruments and procedures for measurement; and (ii) the development and refinement of theoretical approaches to measurement.
More recently, psychometric theory has been applied in the measurement of personality, attitudes and beliefs, academic achievement, and in health-related fields. Measurement of these unobservable phenomena is difficult, and much of the research and accumulated art in this discipline has been developed in an attempt to properly define and quantify such phenomena. Critics, including practitioners in the physical sciences and social activists, have argued that such definition and quantification is impossibly difficult, and that such measurements are often misused. Proponents of psychometric techniques can reply, though, that their critics often misuse data by not applying psychometric criteria, and also that various quantitative phenomena in the physical sciences, such as heat and forces, cannot be observed directly but must be inferred from their manifestations.
These divergent responses are reflected to a large extent within alternative approaches to measurement. For example, methods based on covariance matrices are typically employed on the premise that numbers, such as raw scores derived from assessments, are measurements. Such approaches implicitly entail Stevens' definition of measurement, which requires only that numbers are assigned according to some rule. The main research task, then, is generally considered to be the discovery of associations between scores, and of factors posited to underlie such associations. On the other hand, when measurement models such as the Rasch model are employed, numbers are not assigned based on a rule. Instead, in keeping with Reese's statement above, specific criteria for measurement are stated, and the objective is to construct procedures or operations that provide data which meet the relevant criteria. Measurements are estimated based on the models, and tests are conducted to ascertain whether it has been possible to meet the relevant criteria.
Psychometrics is applied widely in educational assessment to measure abilities in domains such as reading, writing, and mathematics. The main approaches in applying tests in these domains have been Classical Test Theory and the more modern Item Response Theory and Rasch measurement models. These modern approaches permit joint scaling of persons and assessment items, which provides a basis for mapping of developmental continua by allowing descriptions of the skills displayed at various points along a continuum. Such approaches provide powerful information regarding the nature of developmental growth within various domains.
Another major focus in psychometrics have been on personality testing. There have been a range of theoretical approaches to conceptualising and measuring personality. Some of the better known instruments include the Minnesota Multiphasic Personality Inventory, the Five-Factor Model (or "Big 5") and tools such as PAPI and Myers-Briggs Type Indicator. Attitudes have also been studied extensively in psychometrics. A common approach to the measurement of attitudes is the use of the Likert scale. An alternative approach involves the application of unfolding measurement models, the most general being the Hyperbolic Cosine Model (Andrich & Luo, 1993).
Psychometric theory involves several distinct areas of study. First, psychometricians have developed a large body of theory used in the development of mental tests and analysis of data collected from these tests. This work can be roughly divided into classical test theory (CTT) and the item response theory (IRT: Embretson & Reise, 2000; Hambleton & Swaminathan, 1985). An approach which seems mathematically to be similar to IRT but also quite distinctive, in terms of its origins and features, is represented by the Rasch model for measurement. The development of the Rasch model, and the broader class of models to which it belongs, was explicitly founded on requirements of measurement in the physical sciences (Rasch, 1960).
Second, psychometricians have developed methods for working with large matrices of correlations and covariances. Techniques in this general tradition include factor analysis (finding important underlying dimensions in the data), multidimensional scaling (finding a simple representation for high-dimensional data) and data clustering (finding objects which are like each other). In these multivariate descriptive methods, users try to simplify large amounts of data. More recently, structural equation modeling and path analysis represent more sophisticated approaches to solving this problem of large covariance matrices. These methods allow statistically sophisticated models to be fitted to data and tested to determine if they are adequate fits.
One of the main deficiencies in various factor analysis is a lack of cutting points. A usual procedure is to stop factoring when eigenvalues drop below one because the original sphere shrinks. The lack of the cutting points concerns other multivariate methods, too. At the bottom, psychometric spaces are Hilbertian but they are dealt with as if Cartesian. Therefore, the problem is more of interpretations than utilizing a method.
Both reliability and validity may be assessed conceptually and mathematically. Internal consistency may be assessed by correlating performance on two halves of a test (split-half reliability); the value of the Pearson product-moment correlation coefficient is adjusted with the Spearman-Brown prediction formula to correspond to the correlation between two full-length tests. Other approaches include the intra-class correlation (the ratio of variance of measurements of a given target to the variance of all targets). A commonly used measure is Cronbach's α, which is equivalent to the mean of all possible split-half coefficients. Stability over repeated measures is assessed with the Pearson coefficient, as is the equivalence of different versions of the same measure (different forms of an intelligence test, for example). Other measures are also used.
Validity may be assessed by correlating measures with a criterion measure known to be valid. When the criterion measure is collected at the same time as the measure being validated the goal is to establish concurrent validity; when the criterion is collected later the goal is to establish predictive validity. A measure has construct validity if it is related to other variables as required by theory. Content validity is simply a demonstration that the items of a test are drawn from the domain being measured. In a personnel selection example, test content is based on a defined statement or set of statements of knowledge, skill, ability, or other characteristics obtained from a job analysis.
Predictive or concurrent validity cannot exceed the square of the correlation between two versions of the same measure.
Item response theory models the relationship between latent traits and responses to test items. Among other advantages, IRT provides a basis for obtaining an estimate of the location of a test-taker on a given latent trait as well as the standard error of measurement of that location. For example, a university student's knowledge of history can be deduced from his or her score on a university test and then be compared reliably with a high school student's knowledge deduced from a less difficult test. Scores derived by classical test theory do not have this characteristic, and assessment of actual ability (rather than ability relative to other test-takers) must be assessed by comparing scores to those of a "norm group" randomly selected from the population. In fact, all measures derived from classical test theory are dependent on the sample tested, while, in principle, those derived from item response theory are not.
Each publication presents and elaborates a set of standards for use in a variety of educational settings. The standards provide guidelines for designing, implementing, assessing and improving the identified form of evaluation. Each of the standards has been placed in one of four fundamental categories to promote educational evaluations that are proper, useful, feasible, and accurate. In these sets of standards, validity and reliability considerations are covered under the accuracy topic. For example, the student accuracy standards help ensure that student evaluations will provide sound, accurate, and credible information about student learning and performance.