Added to Favorites

Related Searches

Inter-rater reliability, inter-rater agreement, or concordance is the degree of agreement among raters. It gives a score of how much homogeneity, or consensus, there is in the ratings given by judges. It is useful in refining the tools given to human judges, for example by determining if a particular scale is appropriate for measuring a particular variable. If various raters do not agree, either the scale is defective or the raters need to be re-trained.## Joint probability of agreement

The joint-probability of agreement is probably the most simple and least robust measure. It is the number of times each rating (e.g. 1, 2, ... 5) is assigned by each rater and then divides this number by the total number of ratings. It assumes that the data are entirely nominal. It does not take into account that agreement may happen solely based on chance.
## Kappa statistics

## Correlation coefficients

## Intra-class correlation coefficient

Another way of performing reliability testing is to use the intra-class correlation coefficient (ICC) .
There are several types of this and one is defined as, "the proportion of variance of an observation due to between-subject variability in the true scores". The range of the ICC may be between 0.0 and 1.0 (an early definition of ICC could be between -1 and +1). The ICC will be high when there is little variation between the scores given to each item by the raters, e.g. if all raters
give the same, or similar scores to each of the items. The ICC is an improvement over Pearson's $r$ and Spearman's $rho$,
as it takes into account of the differences in ratings for individual segments, along with the correlation between raters.
## Limits of agreement

## Notes

## Further reading

## External links

There are a number of statistics which can be used to determine inter-rater reliability. Different statistics are appropriate for different types of measurement. Some options are: joint-probability of agreement, Cohen's kappa and the related Fleiss' kappa, inter-rater correlation, concordance correlation coefficient and intra-class correlation.

- Main articles: Cohen's kappa, Fleiss' kappa

Cohen's kappa, which works for two raters, and Fleiss' kappa, an adaptation that works for any fixed number of raters, improve upon the joint probability in that they take into account the amount of agreement that could be expected to occur through chance. They suffer from the same problem as the joint-probability in that they treat the data as nominal and assume the ratings have no natural ordering. If the data do have an order, the information in the measurements is not fully taken advantage of.

- ''Main articles: Pearson product-moment correlation coefficient, Spearman's rank correlation coefficient

Another approach to agreement (useful when there are only two raters) is to calculate the mean of the differences between the two raters. The confidence limits around the mean provide insight into how much random variation may be influencing the ratings. If the raters tend to agree, the mean will be near zero. If one rater is usually higher than the other by a consistent amount, the mean will be far from zero, but the confidence interval will be narrow. If the raters tend to disagree, but without a consistent pattern of one rating higher than the other, the mean will be near zero but the confidence interval will be wide.

Bland and Altman have expanded on this idea by graphing the difference of each point, the mean difference, and the confidence limits on the vertical against the average of the two ratings on the horizontal. The resulting Bland-Altman plot demonstrates not only the overall degree of agreement, but also whether the agreement is related to the underlying value of the item. For instance, two raters might agree closely in estimating the size of small items, but disagree about larger items.

- Cohen, J. (1960) "A coefficient for agreement for nominal scales" in Education and Psychological Measurement. Vol. 20, pp. 37--46
- Fleiss, J. L. (1971) "Measuring nominal scale agreement among many raters" in Psychological Bulletin. Vol. 76, No. 5, pp. 378--382
- Shrout, P. and Fleiss, J. L. (1979) "Intraclass correlation: uses in assessing rater reliability" in Psychological Bulletin. Vol. 86, No. 2, pp. 420--428
- Everitt, B. (1996) Making Sense of Statistics in Psychology (Oxford : Oxford University Press) ISBN 0-19-852366-1
- Bland, J. M., and Altman, D. G. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet i, pp. 307--310.

- Gwet, K. (2001) Handbook of Inter-Rater Reliability, (Gaithersburg : StatAxis Publishing) ISBN 0-9708062-0-5

- Statistical Methods for Rater Agreement by John Übersax
- Inter-rater Reliability Calculator by Medical Education Online
- Online (Multirater) Kappa Calculator

Wikipedia, the free encyclopedia © 2001-2006 Wikipedia contributors (Disclaimer)

This article is licensed under the GNU Free Documentation License.

Last updated on Wednesday October 08, 2008 at 05:08:33 PDT (GMT -0700)

View this article at Wikipedia.org - Edit this article at Wikipedia.org - Donate to the Wikimedia Foundation

This article is licensed under the GNU Free Documentation License.

Last updated on Wednesday October 08, 2008 at 05:08:33 PDT (GMT -0700)

View this article at Wikipedia.org - Edit this article at Wikipedia.org - Donate to the Wikimedia Foundation

Copyright © 2015 Dictionary.com, LLC. All rights reserved.