article thumbnail

Measuring Validity and Reliability of Human Ratings

The Unofficial Google Data Science Blog

Once we’ve answered that, we will then define and use metrics to understand the quality of human-labeled data, along with a measurement framework that we call Cross-replication Reliability or xRR. If they roll two dice and apply a label if the dice rolls sum to 12 they will agree 85% of the time, purely by chance.

article thumbnail

Data scientist as scientist

The Unofficial Google Data Science Blog

Note also that this account does not involve ambiguity due to statistical uncertainty. As you can see from the tiny confidence intervals on the graphs, big data ensured that measurements, even in the finest slices, were precise. We sliced and diced the experimental data in many many ways.