Security overview

This page describes how we protect customer data in Piprio. It is written for a security reviewer running a procurement assessment. We state what the platform does today, and we state the limits plainly, because a reviewer who finds an overstatement stops trusting the rest of the document. Where a control is the operator's responsibility rather than ours, we say so. Where a process or certification is not yet in place, we mark it rather than imply it.

Architecture

Piprio is a document-labeling platform built as a web application, a set of background workers, a PostgreSQL database, a cache and job queue, and an object store for files and exports. Customers reach the platform over HTTPS through a reverse proxy. The web application serves the interface and the API. The workers run the slower jobs: crawling source systems, AI extraction, exporting datasets, and sending notifications.

One boundary in that design matters to a security reviewer, so we call it out first. No language-model call ever runs inside the web application. Every AI extraction runs in a background worker, never in the request that serves a page or an API call. The model boundary sits in the worker tier, so a flaw in the request path cannot reach the model or the credentials that drive it. That separation is enforced in the codebase, not left to convention.

For the managed offering we operate this stack. Customers who require their data to stay inside their own network can run the same platform self-hosted, including fully air-gapped, with a local model and a bundled object store. The self-hosted package ships with the installer, backup, restore, and upgrade tooling described in the operations guides.

Tenant isolation

Each customer organization is isolated at the database layer using schema-per-tenant in PostgreSQL. Every organization owns its own schema, and that organization's labeling data, document metadata, labeling specifications, and label history live only inside it. Cross-organization records that have to be shared, such as the user registry and organization memberships, live in a separate common area.

The isolation rests on a single rule applied on every request. When a request arrives, the platform resolves which organization it belongs to and sets the active database search_path to that organization's schema, once, at the start of the request. From that point every query in the request reads and writes only that organization's data. We do not interpolate the schema name freely. A request can only ever point at a schema whose name matches a fixed pattern, so a crafted organization identifier cannot be turned into a path into another schema.

Holding that guarantee in place meant removing a class of bug we had seen before. If application code commits a transaction in the middle of a request, the database connection is returned to the pool, and a pool reset clears the active search_path back to the shared area. Any later query in that same request would then run against the wrong schema. The fix is a structural rule: service code never commits. Services prepare their work, and the request boundary commits exactly once, after everything is done, so the active schema cannot be cleared mid-request. We enforce that rule with an automated test, tests/test_no_service_commits.py, which fails the build if any service-layer file calls a commit. There is no exemption list. New code lands at zero violations.

The shared common area carries a second layer of defense. The tables there that hold organization-scoped records, including memberships, single-sign-on configuration, outbound webhooks, and API keys, have row-level security enabled in PostgreSQL and are filtered by the active organization. This is defense in depth on the records that, by design, do not live inside a per-tenant schema.

Encryption in transit and at rest

In transit, all traffic is encrypted. The reverse proxy terminates TLS and redirects every plain-HTTP request to HTTPS. The proxy is configured to accept only TLS 1.2 and TLS 1.3. Strict transport security is sent only once a deployment is running a trusted certificate, so a browser is not pinned to a self-signed certificate.

For self-hosted deployments the certificate story is explicit. The installer generates a self-signed certificate for the chosen domain so the platform comes up reachable over an encrypted connection on day one. That certificate is meant to be replaced. Installing a trusted certificate is a manual step: the operator drops the real certificate and key into the certificate directory, restarts the proxy, and then turns on strict transport security. The installer does not request a certificate from a public authority on the operator's behalf. We name this so a reviewer does not assume automatic certificate issuance that is not there.

At rest, encryption is targeted rather than blanket, and we draw the line precisely. Application-layer secrets are encrypted by the platform before they are stored, using AES-256-GCM with a 256-bit key and a fresh random nonce per value. This covers the credentials a customer entrusts to connectors, such as the password or key for a source system, and the key for a customer's own language model when one is configured. These values are never written to the database in plain form, never returned by any screen, and never written to a log or the audit trail. They are decrypted only inside a worker, at the moment a job needs them, and only in that process's memory for the duration of the call.

What that AES-256-GCM layer does not cover, we state directly. It is not full-disk encryption and it is not whole-database encryption. The bulk of the database, the labeling metadata and label history, is not separately encrypted by the application. Encryption of the underlying disk and the database volume is the operator's infrastructure responsibility, handled at the host or storage layer, the same way it would be for any self-managed PostgreSQL deployment. A reviewer evaluating a control that requires encryption of all data at rest should confirm that the hosting environment provides volume-level encryption.

Backups are a related gap worth stating in the same place. The bundled backup tool writes its bundle to disk in the clear, including the configuration archive that holds the signing key, the credential encryption key, the database password, and the object-store keys. Encrypting a backup, and encrypting it before it ever leaves the host, is the operator's responsibility today. The backup-restore operations guide spells out the required handling.

Authentication and access control

Customers sign in with an email and password, or through their own identity provider over single sign-on. Passwords are hashed with bcrypt at a deliberate work factor and are never stored in plain form. The platform enforces a password policy of at least twelve characters with a mix of upper case, lower case, a digit, and a special character, and it checks a new password against known-breached password lists as a warning, without blocking the choice on a network failure.

A signed-in session is carried by a JSON Web Token signed with HMAC-SHA256, with an eight-hour lifetime, after which the user re-authenticates. Each token carries a unique identifier and a version counter, and revoked tokens are tracked so a sign-out takes effect immediately rather than waiting for the token to expire.

For programmatic access, an organization can issue API keys. A key is shown once at creation and never again. We store only a hash of the key, never the key itself, so a database leak does not expose live keys.

Single sign-on uses SAML 2.0. An organization configures its identity provider, and the platform validates the signed assertion before creating a session. SSO can be set to enforced for an organization, in which case its members are redirected to their identity provider rather than signing in with a password. The platform requires that incoming assertions be signed.

Inside an organization, access follows a role model that separates day-to-day labeling work from administrative control. Reviewers and senior reviewers work on labels, while owners and administrators manage connectors, labeling specifications, members, and security settings. The administrative control plane that operators use to manage the platform itself is reachable only from an explicitly configured list of network addresses, and in a production deployment that list defaults to denying all traffic until it is set, so it fails closed. Request rates are capped at the proxy and again in the application, with stricter limits on the sign-in path than on general API traffic.

Audit logging

Piprio records who did what, when, and to which record, at two levels. Administrators see a history of platform actions taken inside their organization. Separately, every label carries a decision-by-decision history of how it was reached. User-management actions, such as invitations, role changes, and removals, are recorded as a third stream. Each administrative and user-management record names the actor, the resource, the time, and the originating network address, and administrative records that changed a value keep both the prior value and the new one.

The decision history distinguishes machine-generated values from human ones, so AI output stays separable from reviewer judgment at every step. The decision-level trail exports to a flat file an auditor can open in a spreadsheet, scoped to a date range.

We are precise about the integrity property, because a reviewer will test it. The label-decision history is append-only as a matter of application discipline, held in place by code review and by the same commit-boundary test that protects tenant isolation. Audit records are written inside the same transaction as the change they describe, so a change and its record either both land or both roll back. There is no path that ships a change without its record.

What we do not claim is cryptographic tamper-evidence. There is no hash chain, no signature, and no database-level immutability such as triggers or revoked write privileges. A reviewer evaluating a control that requires cryptographic proof of tamper-evidence should treat that as absent today. Two boundaries on the append-only property are also worth stating plainly, because they are real. First, acknowledging a validation warning updates that one event in place to mark it handled, which is the single case where an existing decision event is modified rather than added to. Second, a right-to-erasure request removes the named person's entries from the user-management history. The administrative-action log handles that same case differently: it keeps the record and preserves a snapshot of the actor's identity, so deleting an account cannot quietly erase the account's history.

Piprio processes a limited set of personal data. For the people who use the platform, that is account data: name, email, the hashed password, sign-on identity, and the audit records of actions they took. The labeling data itself is metadata about a customer's documents, the label values produced from them, and numerical embeddings derived from them. We hold pointers to source files in the customer's own systems, not the files. Source documents are read briefly at extraction time and discarded after the model call returns, so Piprio does not keep a persistent archive of source content. For a regulated manufacturer this narrows the personal-data footprint considerably: the source bytes that may contain personal data stay in the customer's systems.

The lawful basis for processing is set out in our data-processing agreement, which is being finalized.

Retention follows the data. Most data, including the audit history, has no fixed retention window and is kept for the life of the organization rather than aged out on a schedule. Two exceptions are deliberate. Export bundles are convenience files a customer can regenerate, so they are pruned automatically after a retention window of about a week. Self-hosted backup bundles are pruned after thirty days. Invoices carry a statutory multi-year retention.

Deletion is supported at two levels. A right-to-erasure request for a person who no longer belongs to any organization removes that person's account and the records tied to it, while preserving the attributability of administrative actions through an identity snapshot, as described under Audit logging. Removing an entire organization tears down its data end to end, including dropping the organization's tenant schema, so that organization's labeling data is removed in one operation. Because source files live in the customer's own systems, erasure of the source bytes themselves rests with the customer.

Sub-processors

The set of third parties that may process data depends on how the platform is run, so we describe both modes honestly.

For AI extraction, the managed offering uses a hosted model provider by default. An organization on the Enterprise plan can instead point extraction at a model it controls, including a model that runs entirely inside its own network, in which case no extraction content leaves the customer's boundary for a third-party model. A fully self-hosted deployment can run a local model with no external model provider at all. So the model provider is a sub-processor only when a customer uses the default hosted path.

Our formal sub-processor list and data-processing agreement are being finalized and are available to customers on request.

Incident response

We are documenting a formal incident-response process, including severity classification and breach-notification timelines. The platform already provides operational groundwork an incident process builds on: an audit trail of administrative and user-management actions with originating addresses, error tracking when an error-reporting endpoint is configured, and a record of background jobs that fail permanently.

Vulnerability disclosure

We are establishing a published security contact and a coordinated vulnerability-disclosure policy.

Compliance and certifications

Formal attestations, including SOC 2 and ISO 27001, are in progress. We are establishing an independent penetration-testing cadence.