Connectors

A connector points Piprio at a document source and brings the files in. Five source systems are supported for production use: Amazon S3, SMB and CIFS file shares, Microsoft SharePoint, Siemens Teamcenter, and PTC Windchill. Each one knows how to test its own connection, list the files that match a customer's filters, and download a single file on demand. The platform handles the rest the same way for every source: discovery, deduplication, and the labeling pipeline that follows.

A connector is owned by the organization and configured once. The credentials it needs are encrypted before they are stored and are decrypted only inside a background worker at the moment a crawl runs, never in the web application and never written to a log. The encryption is AES-256-GCM, with a unique nonce generated for each stored record. That property matters for a security review: a database snapshot or a log export does not expose a single source-system password.

What a customer sets up depends on the source. Every connector accepts the same crawl controls regardless of system: include and exclude path patterns, a file-type filter, a maximum crawl depth, and one or more root paths to scope the search. The sections below describe what each connector reaches, how it proves who it is, and what a customer needs to supply.

S3

The S3 connector reads files from an Amazon S3 bucket. A customer supplies the bucket name, an optional prefix to scope the crawl to one path inside the bucket, the AWS region, and a pair of static credentials:

region
bucket
prefix
access_key_id
secret_access_key

Authentication is an access key and secret key pair. The connector does not assume an IAM role, use an instance profile, or exchange short-lived session tokens, so the customer provisions a dedicated access key for Piprio with read access to the target bucket and nothing more. The connection test lists a single object to confirm the key works and the bucket is reachable. Files are enumerated through the standard bucket listing, and each file's identity is tracked by its entity tag, so a file that has not changed between crawls is recognized and skipped.

SMB

The SMB connector reads files from Windows and CIFS network shares. It walks the share's directory tree recursively, applying the customer's path patterns and depth limit as it goes. A customer supplies the host, the share name, an account, and a port if the share does not use the default:

host
share
username
password
port
root_path

Authentication is a username and password presented to the file server over the SMB protocol. Kerberos and other single-sign-on schemes are not used, so the customer creates a service account with read access to the share. When the customer does not set an explicit file-type filter, the connector ingests a default set of common formats: logs, comma-separated values, JSON, plain text, PDF, and Excel workbooks. Each discovered file is hashed so that unchanged files are skipped on later crawls.

SharePoint

The SharePoint connector reads files from a SharePoint Online document library through the Microsoft Graph API. It resolves the configured site to its Graph identity, finds the named document library, and enumerates every file in it, walking into subfolders as it goes. A customer supplies the directory tenant, an application registration, the site address, and the library name:

tenant_id
client_id
client_secret
site_url
library_name

Authentication is app-only, using the OAuth client-credentials flow against the customer's Microsoft Entra directory. Piprio acquires a token for the registered application itself rather than acting on behalf of a signed-in person, so a customer grants the application read access to the target sites during onboarding and is not asked to keep a user account active for sync. When the library name is left blank, the connector defaults to the standard Documents library. File identity comes from the content hash that SharePoint reports for each file.

Teamcenter

The Teamcenter connector reads datasets from Siemens Teamcenter through the Active Workspace REST interface. Teamcenter version 14.0 or later is required, because earlier releases do not expose this interface. A pre-14 site needs a SOAP-based integration that this connector does not provide. A customer supplies the server address, an account, the file-management address used for downloads, the dataset types to ingest, and either a folder to crawl or a saved search to run:

base_url
username
password
fms_url
dataset_types
folder_uid
search_query

Authentication uses the Teamcenter session service. The connector signs in with a username and password, receives a session, and reuses it across requests. If the session expires mid-crawl it signs in again automatically and continues. Discovery walks a folder tree or runs a saved search, then collects the datasets attached to each item, keeping only the dataset types the customer asked for. Downloads go through Teamcenter's File Management System: the connector requests a read ticket for a dataset and fetches the content with that ticket. Common dataset types are mapped to familiar file extensions, including Office documents, PDFs, images, and CAD formats such as JT, STEP, and CATIA parts. A dataset's identity is tracked by its identifier together with its last-modified date, so a revised dataset is recognized as updated on the next crawl.

The connector retries automatically when the server reports it is busy or rate-limited, backing off between attempts. It needs either a folder or a saved search to be configured. With neither set, a crawl finds nothing and reports an empty result rather than failing.

Windchill

The Windchill connector reads documents and parts from PTC Windchill through its OData REST interface. Windchill version 12.0 or later is required, because earlier releases expose only SOAP or proprietary endpoints that this connector does not support. A customer supplies the server address, an account, the object types to query, and an optional folder path and workspace to scope the crawl:

base_url
username
password
workspace
object_types
folder_path

Authentication is HTTP Basic, sent on every request, so the customer provides a Windchill account with read access to the target containers. The connector queries one collection per requested object type, paging through results in batches of one hundred. It can read engineering documents, business documents, parts, CAD documents, and drawings, mapping each to a sensible default file extension when the object name does not already carry one. Primary content is downloaded directly from the object. A document's identity is tracked by its object identifier together with its modification timestamp, so a new iteration is recognized as an update. Like the Teamcenter connector, it backs off and retries when the server is busy.

A local-upload option also exists, but it reads from the application's own filesystem and is intended for development and demonstration rather than production sources.

Pre-flight validation

Validation happens at two points, so a broken configuration is caught early rather than at the end of a long crawl.

At save time, a customer can test a connector before committing it. Two checks are available: one that tests a configuration that has not been saved yet, and one that tests a connector that already exists. Both run the same live connection test the source system would see during a real crawl, and both report success or the specific reason for failure, along with how long the round trip took. An Administrator, Owner, or Data Steward who is setting up a source learns immediately whether the address is reachable and the credentials are accepted, instead of saving a connector that fails the first time it runs.

Before every crawl, the worker runs its own pre-flight pass. It checks whether the stored credentials have a recorded expiry date that has already passed, then runs the live connection test against the source. If either check fails, the crawl does not start: the job is marked failed, an alert is recorded against the connector, and the Owners and Administrators of the organization are notified that the run was skipped and why. A non-blocking warning is logged but lets the crawl proceed. A third gate prevents two crawls of the same connector from running at once, so a manual crawl cannot collide with a scheduled one.

Sync behavior

A crawl discovers files, deduplicates them, and records what it found. It does not run AI labeling. That step is triggered separately after the files are in.

Two crawl modes are available. An incremental crawl asks the source only for files modified since the last successful crawl of that connector, which keeps routine syncs fast on large sources. A full rescan ignores the last-crawl timestamp and walks the entire source again, which a customer uses after changing filters or to recover from a source-side change that did not update modification times. Every file the source returns is checked against the connector's include and exclude patterns, file-type filter, and depth limit before it is considered.

Within a crawl, each file is identified by a fingerprint drawn from the source. A file already on record with the same fingerprint is skipped, a changed file is recorded as an update, and a file not seen before is recorded as new. Each completed crawl reports counts of discovered, new, updated, and skipped files. Files larger than the size limit are skipped and counted separately rather than failing the crawl. That limit is 100 MB by default and can be raised or lowered.

Crawls run on demand or on a schedule. A customer can trigger a crawl manually, preview what a crawl would find before committing to it, or set a recurring schedule using a standard cron expression. A periodic dispatcher checks active connectors each minute and starts the ones whose schedule is due. A connector can also be set to start AI labeling automatically as soon as a crawl brings in new files, so a scheduled source moves from discovery to extraction without anyone in the loop. That behavior is off by default and turned on per connector.