Concepts
Piprio turns the files a manufacturer already has into labeled training data, with a record of who decided what. A connector finds a file. The file becomes an artifact. A schema says which values to capture from it. AI proposes those values, or a person enters them, and a reviewer accepts or rejects the result. Each step leaves an entry that cannot be edited after the fact, so the finished dataset carries its own history.
The rest of this page defines those terms and shows how each one connects to the next. A buyer evaluating Piprio against an in-house build will recognize most of these ideas. What is worth attention is how they fit together: the labeling spec is versioned, the source files are never copied, AI output stays separable from human judgment, and nothing in the chain can be quietly rewritten.
How a customer account is organized
Work in Piprio lives under a customer account. That account is the boundary for everything: the documents, the labeling specs, the people, the connectors, and the history. One company, one account.
Each account is isolated at the database layer. Two customers on the same Piprio deployment never share a table, a row, or a query. This is the strongest part of the isolation story and the first thing most procurement reviews ask about: tenant data sits in its own Postgres schema, separate at the storage layer rather than filtered at the application layer.
A person can belong to more than one account, with a different role in each. Their permissions in one account say nothing about their permissions in another. Inside an account, work is grouped by the labeling specs it runs against and by the teams responsible for it, which the sections below cover in turn.
Schemas and versions
A schema is the labeling spec for one kind of document. It names the values to capture, what type each value is, which are required, and which only appear when an earlier answer calls for them. A schema for an inspection report asks different questions than a schema for a build log, and a team defines as many schemas as it has document kinds.
Each schema can be turned on or off, and AI assistance and AI validation can be switched independently of each other and of the schema itself. A team can run a schema by hand for a month, watch the results, then enable AI proposals on its own timeline without rebuilding anything. A schema can also be saved as a reusable template, a starting point a team copies rather than a live spec it labels against.
Schemas are versioned, and the line between a version and an edit is deliberate. Some changes are breaking: adding a required value, removing an option from a controlled list, or changing a value's type. Those create a new version of the schema. Other changes are not: rewriting a hint, relabeling a field for clarity, changing a default, or reordering the form. Those edit the current version in place.
The reason the distinction matters is historical accuracy. Every batch of labels records the exact schema version it was captured against. When a schema gains a new version, work already labeled under the old version keeps pointing at the old version. Schema evolution never re-labels historical data, and an export can always say which version produced a given record.
A schema supports seven kinds of values: a controlled list of choices, yes or no, a whole number, a decimal number, a rating scale with named bounds, free text, and a date. Each value can carry a hint for the reviewer, a default, and a rule that shows or hides it based on another answer. At launch the conditional rule is a single show-or-hide condition, for example show the defect-code field only when the inspection result is "fail."
Artifacts and ingestion
An artifact is one file, or one logical group of files, that a connector has found and made eligible for labeling. It records where the file came from, its display name, type, size, and a content fingerprint, along with whatever metadata the source system carried. The same path from the same connector always resolves to the same artifact, so re-running a crawl does not create duplicates.
Piprio does not copy customer files into itself. A connector records the path and reads the bytes on demand at crawl and extraction time. This is a design choice with two consequences a buyer should weigh. The customer's files stay in the customer's systems, which simplifies a data-residency review. And removing an artifact is reversible: it is hidden and marked with who removed it and why, rather than being erased, so the audit trail survives a deletion.
Every artifact carries a status that tracks where it sits in the pipeline. A new artifact is untagged. Once AI has written proposed values it is AI Proposed. Once a batch of labels is submitted it is In Review, and once a reviewer accepts it is Tagged. An artifact a reviewer deliberately sets aside is Skipped. Those five states drive the queues reviewers work from and the progress numbers a manager watches.
AI extraction runs in the background, separate from the review interface, so a long-running model call never stalls the screen a reviewer is working on. Files above the AI-extraction size limit still enter the queue. They arrive without an AI proposal and go straight to a person. The size limits are layered: files larger than 500 MB are rejected at crawl time, files larger than 100 MB are not downloaded, and files larger than 50 MB skip AI extraction and route to human labeling. Within those bounds Piprio handles the document types a manufacturer accumulates, including PDF reports, CSV and log exports, and CAD and build-output files.
Some work spans several files at once. A build job might pair a log, a model file, and a parameter file that only make sense together. Piprio groups those into one unit with a designated primary file, and AI extraction can read across every file in the group when it proposes values.
Reviewers and roles
Access is governed by a person's role in the account, not by per-document permissions. There are six roles, from most to least authority:
- Owner has full control, including billing and deleting the account.
- Admin has everything except billing and account deletion.
- Data Steward can manage schemas, connectors, and routing rules across the whole account, and can review and label.
- Senior Reviewer can manage routing rules, work the review surfaces, and reassign other reviewers' work.
- Reviewer can accept, reject, and edit labels and pull unassigned work from the queue.
- Viewer is read-only.
Teams sit alongside roles. A person can lead a team, and the product shows a Data Steward who leads a team under the name "Team Lead" to make the team-scoped controls easier to find. Leading a team is a presentation detail, not extra authority. Team membership never widens or narrows what a role can do, so a Reviewer added to a team is still a Reviewer.
Every batch of labels tracks who is responsible for it, by person, by team, or both, and records how that assignment was made, whether by a routing rule or by hand. A reviewer who is handed work that is not theirs returns it to the queue rather than completing it, which keeps responsibility unambiguous in the history.
AI proposals and overrides
When AI assistance is on, extraction proposes a value for each field the schema asks for, and the artifact becomes AI Proposed. A proposal is a starting point, not a verdict. A reviewer can keep it, change it, or replace it, and the original AI answer is preserved either way.
Each proposed value carries a confidence rating, and confidence is a category, not a number. Piprio reports it as High, Medium, or Low, which is what a reviewer needs to decide where to spend attention. There is no 0-to-1 probability for a reviewer to second-guess.
When a reviewer changes a proposed value, Piprio records the change as its own event and keeps the AI's original answer alongside the kept one. Because both sides are stored, the product can compare what the model proposed against what the reviewer decided. Piprio tracks how often reviewers override the AI for a given field, and that override rate feeds the model-feedback and calibration views a team uses to decide whether AI assistance is earning its place on a schema.
A batch of labels moves through its own lifecycle. A draft is in progress and not yet in anyone's queue. Submitting it puts it up for review. A reviewer either accepts it, at which point its values count as labeled data, or rejects it. A rejected batch is kept for the record, and a fresh human-entered batch is opened on the same artifact so the work can be redone cleanly. When a newer batch replaces an older one on the same artifact, the older one is marked as superseded rather than deleted.
Accepting or rejecting a batch is guarded against collisions. If two reviewers open the same work and one acts first, the second is told the batch changed under them and is asked to refresh before acting, so concurrent edits are surfaced before one reviewer overwrites another.
The audit log
Provenance is recorded in two append-only layers: one inside each customer account, one across the whole platform.
Inside an account, every change to a batch of labels is written as an immutable event. Creating it, an AI proposal landing, submitting it for review, an assignment, a value being edited, an acceptance or rejection, a reopen, a schema-version change, a team assignment, a validation warning, each is its own entry stamped with who did it and when. These entries are never updated and never deleted. Append-only is enforced as a hard rule of the system, not a convention, and the trail powers the per-artifact history a reviewer sees, the lineage fields on an export, the disagreement analysis behind model feedback, and the inter-rater agreement comparisons a manager runs.
Across accounts, a separate platform log records privileged actions: role changes, billing changes, changes to AI configuration, and similar account-level events. Each entry stores what changed, the before and after, and who did it, including a snapshot of the actor's email so the responsible person stays identifiable even after their user record changes. A companion log records member-management actions with the actor, the affected person, and the originating address.
The two layers stay consistent with the data they describe because each change and its audit entry are written together. They either both land or both roll back. There is no window in which a value is updated but its history is missing, which is the property an auditor checks for first.