Inter-rater agreement

Two people can read the same document and label it differently. That is the central risk in any human-in-the-loop labeling program: the training data a team ships is only as consistent as the people producing it, and inconsistency is invisible until someone measures it. Piprio measures it.

The product can route a configurable share of incoming work to a second, independent reviewer, then score how often the two reviewers reach the same answer. The score is reported per labeling question, so a team can see exactly which parts of a labeling spec are clear and which are ambiguous. Low scores point to the guideline that needs tightening, not at the people. This is the same discipline a research team would apply by hand, run continuously and without anyone setting up spreadsheets.

How kappa is computed

The agreement statistic is Cohen's Kappa, a standard measure of how often two raters agree on a categorical judgment. Piprio computes it on the labeling questions where that kind of agreement is meaningful: single-select fields, where the reviewer picks one option from a list, and yes/no fields. Free-text answers, numbers, and dates are left out, because there is no clean notion of "the same category" for an open string or a continuous value.

The reason for kappa, rather than a plain "the reviewers matched 80 percent of the time," is that raw match rate flatters easy questions. If a field is "yes" 95 percent of the time, two reviewers who both guess "yes" without reading will agree 90 percent of the time on luck alone. Kappa corrects for that. It subtracts the agreement expected by chance and reports only the agreement beyond it. A kappa of 1.0 means the reviewers matched on every comparison. A kappa of 0 means they agreed no more than two people guessing independently would. A negative kappa means they agreed less than chance, which usually signals that the two reviewers are reading the question in opposite ways.

Piprio sorts each question's kappa into three bands so a non-statistician can read the report at a glance:

At or above 0.8 is excellent. Reviewers are applying the guideline the same way.
At or above 0.6 is good. Agreement is solid, with room to tighten.
Below 0.6 needs attention. The guideline, the field's options, or the training is ambiguous enough that reviewers diverge.

A question with too few paired reviews to score, or where one reviewer left the field blank too often, is treated as needing attention rather than being scored on thin data. The report also shows a plain match rate alongside the kappa for each question, so a reader sees both the headline number and the chance-corrected one side by side.

Reading the report

The agreement report lives on the quality area of the application and is open to anyone with a reviewer role or above. It opens with an overall match rate and the number of reviewer pairs it was computed from, then breaks the score down question by question.

Each row names a labeling question, shows its kappa rounded to three decimals, gives its quality band, and lists how many pairs of reviews had an answer on both sides. The questions that fall into the needs-attention band are also collected into a short list at the top, so a reviewer lead can see the trouble spots without scanning the whole table. When the overall picture is poor enough to warrant action, the report surfaces an "action recommended" prompt. An administrator can acknowledge that prompt for a given labeling spec so it stops re-appearing on every visit. Acknowledging it is a way to silence the reminder once someone has owned the problem. It does not change any score or hide the underlying numbers.

By default the report looks back over the last 90 days of completed reviews. The window is adjustable, from one week up to a full year, so a team can compare a recent stretch against a longer baseline.

A labeling spec that has never been double-reviewed does not error. The report opens cleanly with zeros, which is the correct reading: there is no agreement to report yet because no pairs exist.

Resolving conflicts

The agreement score signals that reviewers diverge. The disagreement report shows where, on which documents, and on what answers. It is the input to a human decision, and it is open to reviewers and above.

The report lists only the document pairs where the two reviewers actually gave different answers on at least one question. Pairs where they agreed on everything drop out, so the list is always a worklist of real conflicts rather than a dump of every comparison. Each entry names the document, both reviewers, when the second review finished, and a list of the specific questions they split on, showing each reviewer's answer next to the other's. If one reviewer left a question blank, that side reads "(empty)" rather than disappearing, so a missed answer is as visible as a wrong one. The most recent conflicts come first.

Here is the shape of a single conflict, the way a reviewer lead reads it:

Document: Inspection report 1184 Reviewers: Alice and Bob, completed 21 May 2026 Surface finish: Alice chose "polished", Bob chose "as-cast"

Piprio does not pick a winner. There is no automatic tie-break and no silent overwrite of one reviewer's answer with the other's. That is deliberate. A disagreement is information about the guideline, and the resolution is a judgment a senior reviewer or data steward makes: clarify the written instructions, split an ambiguous option into two clearer ones, or sit down with the reviewers and walk through the case. Past reviews are part of the audit trail and are never edited after the fact, so resolving a conflict means improving the next batch of work, not rewriting the last one.

Training reviewers

A low score is almost always a guideline problem, not a reviewer problem. Two careful people reading the same vague instruction will land in different places, and no amount of effort on their part will fix an instruction that admits two readings. The two reports give a closed training loop:

The needs-attention list shows which questions to fix first. These are the questions where reviewers diverge most, so they are where a clearer definition, a worked example, or a cleaner set of options will buy the most consistency. The disagreement report gives a team the raw material for that conversation: the actual documents reviewers read differently, with both answers laid out side by side. Those are the cases to review as a team, because they are the ones that genuinely confused real people, not hypotheticals.

To have anything to train on, double-review has to be turned on. Piprio routes a configurable share of incoming work to a second reviewer, picked from the same team as the first so the two share a queue and a rulebook. The share is off by default and can be set up to 25 percent, so at most one in four documents carries the extra cost of a second pass. Raising the share applies to new work as it comes in, and an administrator can also sweep the current unassigned backlog to start producing pairs immediately rather than waiting for fresh intake. With the share at zero, no pairs are created and the report stays empty no matter how many documents are reviewed, so the first step in any training program is to turn sampling on.

After a guideline change, watch the per-question score over the next review window. A question that climbs out of the needs-attention band into good or excellent is the measurable result of the change, and the band makes that progress legible to a manager who never wants to think about a kappa formula. Give a change a few weeks of new reviews before reading the number as settled, since the default window spans 90 days and a freshly clarified guideline needs volume behind it before the score reflects it.