article thumbnail

Measuring Validity and Reliability of Human Ratings

The Unofficial Google Data Science Blog

If they roll two dice and apply a label if the dice rolls sum to 12 they will agree 85% of the time, purely by chance. While it may be a little abstract, this concept forms a key piece of Classical Test Theory (CTT) , a foundation of psychometrics. 2005) Partitioning variation in multilevel models (Goldstein et al.,