Data integrity checks
Auditing the data sent to your destinations
One of Prequel's core promises is the guarantee that data in the source will be accurately replicated in the destination. Every row will be transferred, no row will be dropped, and the destination will get all changes to existing rows.
We want every user to build that confidence for themselves instead of having to take our word for it. This is why we surface data integrity checks – a way to audit the integrity of the data present in the destination, and compare it to what's in the source.
Results
Data integrity checks return a range of information about the underlying destination table. Some results are informational, while others denote data integrity issues.
Field | Action needed if greater than 0 | Meaning |
---|---|---|
sampling_rate | None, this is expected. | Bounded from 0 to 1. A value of 1.0 indicates that no sampling was used. This number represents the percentage of (primary key, last_modified) tuples compared within partitions with differing hashes. |
total_rows_compared | None, this is expected. | Only appears if hashing partitions indicates that rows need to be compared. Indicates the number of (primary key, last_modified) tuples were compared between the source and destination. |
total_rows_pending | None, this is expected. | The number of rows present in the source that will be inserted into the destination with the next transfer. |
total_rows_stale | None, this is expected. | The number of rows present in the source that have changed since they were last updated in the destination, and that will be updated in the destination with the next transfer |
total_rows_missing | Full refresh the destination and contact Prequel support. | The number of rows present in the source that should exist in the destination but do not. |
total_rows_incorrect | Full refresh the destination and contact Prequel support. | The number of rows present in the source that exist in the destination but do not have the correct updated_at value. |
total_rows_orphaned | None. | The number of rows that are present in the destination but not in the source. This is usually the result of hard deletes in the source, or of foreign data having been introduced in the destination by the recipient. |
Limitations
Data integrity checks work by comparing hashes of certain values between the source and the destination. This allows them to be relatively compute and bandwidth efficient. In order to work across systems and in a way that doesn't put undue load on those systems at scale, they are subject to a handful of limitations.
Supported systems
Data integrity checks are currently only supported on the following systems:
- Athena
- BigQuery
- Databricks
- Postgres
- Redshift
- Snowflake
Other limitations
- Data integrity checks only ensure the integrity of primary keys and updated_at timestamps. This is sufficient to build assurance of integrity.
- On large tables, integrity checks leverage random sampling to build probabilistic confidence in the integrity of the data. This prevents the checks from overloading either system, and allows them to run in a finite time.
- Automated sampling will likely occur on tables receiving more than 1B rows per day in volume.
- You can also leverage manual sampling and specify a set of primary keys to compare across systems (regardless of table size).
- Data Integrity jobs will not run at the same time as transfers to the same destination and vice versa. This is to prevent overloading the resources of the source or destination. This is also to ensure that integrity measurements are not taken on a “moving target”. If a data integrity check is in progress, all transfers to that destination will remain in PENDING state until the integrity check is complete.
Updated 8 months ago