Confidence calibration

When the AI proposes a value for a field, it also reports how sure it is. That confidence is one of three bands: High, Medium, or Low. There is no percentage and no internal score behind it. The model is asked to say, in plain terms, whether a value was stated clearly in the document, inferred from context, or barely there at all, and it answers with one of the three bands.

That single signal drives a lot of what reviewers experience. It decides what surfaces first in the queue, what can be accepted in bulk, and what the model is shown about its own past mistakes the next time it runs. Piprio keeps score on how often each band turns out to be right, and feeds that back into both the review screen and the extraction prompt. This page explains where the confidence band comes from, how Piprio adjusts it, and why it shifts over time.

How scores are computed

The confidence band is the model's own report, not a number Piprio calculates after the fact. The model labels each proposed value High, Medium, or Low based on how directly the document supported it. Values a reviewer types by hand carry no confidence band at all, so AI judgment and human entry stay distinct in the record.

There is one case where Piprio overrides the model. If a proposed value fails the validation rule on its field, say a date that is not a valid date, or a category that is not one of the allowed options, the band is forced down to Low and the reason is kept on file. That is the only point where Piprio changes a freshly extracted band.

Each set of labels for a document also carries a roll-up: the lowest band across all its fields. The review queue sorts by that roll-up, lowest first, so the least certain work rises to the top of what reviewers see. Bulk accept uses the same idea from the other end. It will only clear a set where every field came back High, so a one-click approval never sweeps up a value the model was unsure about.

Everything past this point is bookkeeping layered on top of the model's report. None of it rewrites the original band. The adjustments below change what a reviewer is shown and what the model is told, not the underlying record.

Override adjustments

The band a reviewer sees on the review form is not always the raw band the model reported. Piprio tracks, for every field and every band, how often reviewers have ended up changing that value. When the track record is clear enough, it shifts the displayed band to match reality.

The adjustment kicks in once a field has accumulated at least ten proposals at a given band. Below that, there is not enough history to trust, so the raw band shows through unchanged. At or above ten:

If reviewers override more than 80 percent of the proposals at that band, the displayed band drops two steps.
If they override more than 60 percent, it drops one step.
If a Low-band field is accepted more than 80 percent of the time, it is promoted to Medium.

So a field whose High proposals get edited most of the time will display as Low, a warning to the reviewer to look closely. A Low field that reviewers keep accepting climbs to Medium, so it stops drawing scrutiny it has earned the right to skip. Drops bottom out at Low. Nothing goes below the floor.

The quality dashboard reads the same history. For each field it shows the override rate and a plain label: a field overridden more than 60 percent of the time is flagged overconfident, one overridden less than 20 percent of the time is well calibrated, and anything between is moderate. The list sorts worst first, so the fields the model is getting wrong sit at the top. This view is open to reviewers and above.

Few-shot context injection

The same override history feeds back into extraction itself. Before the model runs on a new document, Piprio looks for fields it has been getting wrong and shows the model concrete examples of its past mistakes, paired with what the reviewer corrected them to. The model sees its own track record and adjusts.

A field qualifies for this once two conditions hold across all of its proposals: at least five samples, and an override rate of at least 25 percent. For each qualifying field, Piprio pulls the actual corrections reviewers made, what the model proposed against what the reviewer set it to, groups identical corrections so the most common pattern leads, and includes up to five examples per field. Those examples are written into the prompt for the next extraction.

The thresholds here are looser than the ones that change the displayed band. A field starts feeding examples back into the prompt at five samples and a 25 percent override rate, well before it reaches the ten samples and 60 percent rate that would downgrade what a reviewer sees. The intent is to start correcting the model early, while still waiting for a stronger signal before changing what the reviewer is told. If the lookup fails for any reason, extraction proceeds without the examples rather than blocking.

When scores drift

Drift here means the gap between how confident the model says it is and how often that confidence holds up. A field where the model keeps reporting High while reviewers keep editing the value has drifted. Piprio is built to find that gap and close it, not to prevent it from opening.

Two separate records track different parts of this, and they behave differently:

A history of every decision. Each time a reviewer accepts or rejects an AI proposal, Piprio records, per field, what band the model reported and whether the value was kept or changed. This is append-only, it captures both accepts and rejects, and it is the source of the calibration summary that reports accept rate and edit rate for each of the three bands.
A running per-field tally. After an accepted set, Piprio recomputes how each field is faring at each band. This tally drives both the override adjustment on the review form and the example injection into the prompt.

The two records part ways on rejects. The decision history learns from both accepts and rejects. The per-field tally updates only on accepts, so the adjusted display and the prompt examples both learn from the values reviewers chose to keep, not the ones they threw out. The per-field recompute also runs only after the acceptance has fully committed, never speculatively mid-decision, so it never reflects a decision that did not stick.

Drift shows up on the quality dashboard as a field flagged overconfident: enough proposals at a band, with an override rate past 60 percent. That flag is the signal that the model's self-reported confidence for that field no longer matches what reviewers actually do with it.

Recalibrating

There is no manual recalibration step and no model to retrain. Recalibration is continuous. Every accepted set updates the per-field tally, which immediately changes both the band shown on the next review and the correction examples written into the next extraction. The system is always chasing the current gap.

To see where a schema stands right now, the quality dashboard offers two views. One lists each field's override rate and its calibration label, showing only fields with at least ten proposals so the numbers mean something. The other summarizes the model's learning: how many fields currently have correction examples active, the average override rate across them, and the total number of examples available to the prompt.

One thing to plan for. The history is kept per schema version. When a team publishes a new version of a schema, the counts for that version start from zero. A new version shows raw, unadjusted confidence until each field has gathered at least ten proposals, and it injects no correction examples until each field has at least five. Expect a settling period after any schema change before adjusted bands and prompt feedback take hold again. This is the cost of versioning done honestly: a changed field is a different field, and its old track record no longer applies.