Skip to content

Exports

A labeling program is only worth as much as the data a team can get back out of it. Piprio turns accepted labels into delivered datasets in the formats analytics and ML teams already use, on a schedule, with the provenance a model team needs to trust the data.

An export pulls the labels a team has accepted on its documents and writes them out as a single dataset file. The work runs in the background, so building a large export never ties up the review interface. Exports can be triggered on demand or run automatically on a schedule, and the same set of filters governs both.

Output formats

Four formats cover the range from spreadsheet review to a training pipeline.

  • CSV (flat) is one row per labeled document, with the column delimiter and file encoding both adjustable. Good for a quick load into a spreadsheet or a relational table, and the format most reviewers can open without any tooling.
  • JSON Lines writes one record per line, keeping each document's field values and confidence scores as nested objects. Good for streaming ingestion and for pipelines that prefer structured records over flat columns.
  • Parquet is the columnar format for analytics at scale, with compression selectable between Snappy, gzip, Zstandard, or none. Good for loading into a data warehouse or a query engine over large label volumes.
  • Hugging Face Dataset packages the same records as a Parquet file inside a ZIP, alongside a dataset description file and a generated README listing the fields. It loads directly into a Hugging Face training pipeline without a conversion step.

All four formats stream to disk as records are produced rather than holding the full dataset in memory, so export size is bounded by storage, not by available memory.

Destinations

Where an export lands depends on how it was started.

An on-demand export is uploaded to managed object storage and handed back as a download link valid for 48 hours. The link can be refreshed for another 48 hours at any point while the file is still retained, so a team that misses the first window does not have to rebuild the dataset. Files age out after a retention window set for the deployment. When a file ages out, the export record stays in the history with its record count and size intact, and the link simply goes inert. Re-running the export produces a fresh file.

A scheduled export is delivered straight to a destination the customer controls. Two destination types are supported: an Amazon S3 bucket, or an SFTP server. The customer supplies the bucket name and prefix, or the host and remote path, along with the access credentials. Those credentials are encrypted before they are stored and are decrypted only at delivery time, never written or logged in the clear.

No other destinations exist. An export is either a download link or a push to the customer's own S3 or SFTP target.

Scheduling

A scheduled export runs the same query on a recurring cadence and delivers the result without anyone opening the application. Each schedule has a name, a destination, a format, and a cadence expressed as a standard cron expression. Cron gives the full range of common cadences, from hourly to nightly to a specific day of the month, and the expression is validated when the schedule is saved so an invalid cadence is rejected up front rather than failing silently at run time.

A schedule can be scoped to a single labeling spec so it only delivers documents of one type, or left unscoped to cover everything accepted. Each schedule can be paused and resumed without deleting it. After every run, Piprio records when it last ran, whether the run succeeded, how many documents it delivered, and any error from a failed run, so an administrator can confirm deliveries are landing without checking the destination by hand.

Most scheduled exports run with the incremental option turned on, described below, so each run delivers only what is new since the last one.

Filters

The same filters apply to on-demand and scheduled exports, and they decide which labeled documents land in the file:

  • Labeling status, so an export can be limited to accepted work or widened to other states.
  • Labeling spec and version, to export only documents labeled against a particular specification.
  • Label origin, to separate AI-proposed values from human-entered ones, or to include both.
  • Document type, to restrict an export to one file type.
  • Source connector, to export only documents that came from a particular source.
  • Ingest date range, to bound the export to documents brought in within a window.
  • Not yet exported, which excludes any document that has already gone out in a previous export.

That last filter is what makes a cadence safe. With it on, each scheduled run delivers only the documents accepted since the previous run, so a downstream system does not receive the same records twice. Piprio tracks every document that has been exported and uses that history to compute the difference on each run. The same history surfaces in the labeling audit view, where each document is marked as previously exported or not, and in quality metrics that count how many accepted documents have never been exported and when the most recent export ran.

When a customer's downstream systems need to react the moment a dataset is ready, an export-completion event is available as a webhook, carrying the format, record count, and file size so a pipeline can pick up the file without polling.

Retry behavior

Exports are built to be safe to re-run, not to resume a half-finished file. If a build is interrupted, the job is marked failed and records the error, and the partial file is never delivered. Re-running starts the build again from the beginning.

The build itself is idempotent. The dataset is written to a temporary location first and only moved into place once it is complete, so a download link never points at a partly written file. If a retry finds the finished file already in storage, it skips the upload rather than rewriting it. When the system is under heavy load for one customer, an export waits and retries instead of failing, so a busy period delays delivery rather than dropping it.

What an export contains

An export holds one record per accepted label set on a document. Each record carries two layers: the provenance that lets a data team trust the data, and the labeled values themselves.

The provenance on every record includes the document's identity and original source path, its file type, when it was brought into Piprio, which version of the labeling spec it was labeled against, whether the values came from AI or a human, when the labels were accepted, and which reviewer accepted them. This is enough for a model team to filter by source, reconstruct the labeling spec a record was produced under, and separate AI-proposed data from human-reviewed data, all without leaving the dataset file.

Alongside the provenance, each record carries the labeled field values for that document and a confidence score for each value. In the flat formats, every field becomes a pair of columns: the value and its confidence. In the line-oriented and Parquet-backed formats, the values and confidences are kept as structured objects. A team training or evaluating a model can weight or threshold on confidence directly from the export, with no separate lookup back into the platform.