Transfer logic
Understanding the Prequel data transfer logic
How transfers work
Prequel performs transfers by querying the source for a given recipient's data and loading that data into the recipient's destination, on an ongoing basis. The first transfer that runs for a given destination will automatically load all historical data (the "backfill"), and subsequent transfers will attempt to transfer only the data that has changed or been added since the previous transfer.
Transfer Lifecycle
Transfers are managed by an internal queue, which is used to dispatch transfers to workers. When a destination has the enabled
flag set to true, Prequel will automatically enqueue transfers for that destination based on the frequency
value of the destination or the organization's default frequency.
A transfer resource always has a status corresponding to its current phase of the lifecycle:
- PENDING: Transfers start as pending by default when they are created (enqueued). The
submitted_at
timestamp on the transfers resource corresponds to when the transfer was enqueued. - RUNNING: A transfer is running after it has been dispatched to a worker. The
started_at
timestamp on the transfers resource corresponds to when the transfer changed toRUNNING
. - ERROR: A transfer is marked with an error if there is an issue dispatching the transfer, if the worker fails to connect to the source or destination, or if all the models fail to transfer. The
ended_at
timestamp on the transfers resource corresponds to when the transfer changed toERROR
. - PARTIAL_FAILURE: A transfer is considered a partial failure if it reaches the running state but only some models succeed while others fail. The
ended_at
timestamp on the transfers resource corresponds to when the transfer changed toPARTIAL_FAILURE
. - SUCCESS: A transfer is successful if it was running and all models transfer without issues. The
ended_at
timestamp on the transfers resource corresponds to when the transfer changed toSUCCESS
. - CANCELLED: A transfer is cancelled if a user terminates it before it starts running.
- KILLED: A transfer is killed if a user terminates it while it is running.
- EXPIRED: A transfer becomes expired if it is blocked from being dispatched and remains pending for longer than 6 hours.
- ORPHANED: A transfer becomes orphaned if the worker dies ungracefully or stops communicating with the control plane.
Backfills & full refreshes
The initial transfer (or "backfill"), is often the largest transfer by volume. During this initial sync, all historical data for a given recipient is loaded into the destination.
If, for any reason, a destination needs to be reset (e.g., a destination admin accidentally drops the table), you can trigger a full refresh by adding the "full_refresh": true
parameter to a transfer request. This will backfill the entire table as if it were the first transfer.
Backfill vs. incremental transfer performance
Because the initial backfill is often the most storage and compute intensive, sync time/performance should not be used as an indicator of ongoing transfer statistics.
Incremental transfers
After each transfer (backfill or incremental) Prequel will record the most recent updated_at
value that was transferred. This value will be used as the starting point for the subsequent transfer.
By default, every transfer of a given model (after a successful backfill) will be an "incremental transfer".
Incremental updates and eventually consistent data sources
By default, Prequel will query the source for slightly earlier data than the most recently transferred row. This is to provide a window in which data from eventually consistent sources can converge and still be transferred.
Transfer Parallelism and Concurrency
Transfer Concurrency:
Within an individual transfer, operations are optimistically concurrent. Transfers can download, upload, or serialize multiple data files concurrently, regardless of the model to which they belong. The max_concurrent_queries_per_transfer
field on a source or destination limits the number of concurrently queries or API calls that can be made against the source or destination. The default for max_concurrent_queries_per_transfer
is 1
.
Transfers Parallelism:
Transfers can run in parallel of each other as long as the following constraints hold:
- No simultaneous transfers are allowed for the same model to the same destination.
- No simultaneous integrity and transfer jobs can run against the same destination.
- The
max_concurrent_transfers
field exists on both the source and destination. It defaults to10
for sources and1
for destinations. This field represents a hard limit on the number of simultaneous transfers involving a particular source or destination.
Prequel's dispatcher will enforce the above rules. A transfer that is unable to be dispatched will remain pending until it can be dispatched.
Required columns on source table
Required column | Description |
---|---|
Unique ID (e.g., id ) | Every table to be transferred will need a primary key column (e.g., an id column) to facilitate UPDATE /INSERT ("upsert") operations on the destination. |
Last modified (e.g. updated_at ) | Every table to be transferred will need to be configured with a column to indicate when it was last modified (i.e., an updated_at column). This column should contain timestamp data and will be used by Prequel to identify changes between transfers. |
Tenant ID (e.g., organization_id ) | Every source table will need some way to indicate its recipient. Prequel supports two tenancy modes: multi-tenant tables and schema-tenanted databases. For multi-tenant source tables, Prequel requires an organization_id column to filter the source data by tenant ID. To read more about the different tenancy modes, you can read the multi-tenancy docs. |
Staging buckets
Some sources and destinations supported by Prequel may require staging buckets to efficiently transfer data. Where possible, Prequel will use built in staging resourced provided by the database or data warehouse, but in cases where it does not exist, it may need to be provided. The source/destination documentation will provide instructions for configuring staging buckets where needed.
Safeguarding user data
As a matter of security and compliance, Prequel does not store nor retain any of the data it transfers. Transferred data only lives within the ephemeral worker tasked with running a specific transfer for the duration of the transfer and up to 24hrs afterwards. These workers are sandboxed from each other; a dedicated worker is spun up for each transfer and wound down afterwards. In order to facilitate incremental transfers, Prequel does store the timestamp corresponding to the most recent last_updated_at
value for each transfer run. We consider this to be safe metadata rather than user data.
Updated 4 days ago