Remove 2005 Remove Modeling Remove Slice and Dice Remove Testing
article thumbnail

Measuring Validity and Reliability of Human Ratings

The Unofficial Google Data Science Blog

If they roll two dice and apply a label if the dice rolls sum to 12 they will agree 85% of the time, purely by chance. Throughout, we’ll refer to our model-derived measurement of inter-rater reliability as the Intraclass Correlation Coefficient (ICC). The raw agreement will be (⅚ * ⅚ + ⅙ * ⅙) = 72%.