🚧
Beta destination
Apache Iceberg is currently a beta destination with an upper limit of 1 billion rows per month. Please contact the Prequel team if you have any questions.

Setting up Apache Iceberg

Apache Iceberg requires a central catalog to manage table metadata and provide atomic transactions. We support several catalog options, each with its own setup guide below:

AWS Glue Catalog
AWS S3 Tables Catalog
Iceberg REST Catalog (including R2 Data Catalog and Tabular)

Setting up with AWS Glue Catalog

📘
How this works

The Glue catalog stores Iceberg table metadata and the pointer to each table's location.

The destination S3 bucket stores your Iceberg data, metadata files, and is used during staging.

Prerequisites

By default, S3 authentication uses role-based access. You will need the trust policy prepopulated with the data syncing service's identifier to grant access. It should look similar to the following JSON object with a proper service account identifier:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sts:AssumeRoleWithWebIdentity"
      ],
      "Principal": {
        "Federated": "accounts.google.com"
      },
      "Condition": {
        "StringEquals": {
          "accounts.google.com:oaud": "<some_organization_identifier>",
          "accounts.google.com:sub": "<some_service_account_identifier>"
        }
      }
    }
  ]
}

Step 1: Set up destination S3 bucket

Create bucket

Navigate to the S3 service page.
Click Create bucket.
Enter a Bucket name and modify any of the default settings as desired. Note: Object Ownership can be set to "ACLs disabled" and Block Public Access settings for this bucket can be set to "Block all public access" as recommended by AWS. Make note of the Bucket name and AWS Region.
Click Create bucket.

Step 2: Create policy and IAM role

Create policy

Navigate to the IAM service page.
Navigate to the Policies navigation tab, and click Create policy.
Click the JSON tab, and paste the following policy, being sure to replace BUCKET_NAME, ACCOUNT_ID, and DATABASE with your specific values.

📘
Why are these permissions necessary?

The listed Glue permissions are needed to manage catalog metadata and handle table operations, including cleaning up temporary tables during syncs.

The listed S3 permissions are needed to upload data files, list bucket contents, read Iceberg metadata, and manage files during compaction.

Note: The glue:CreateDatabase permission is required if the database does not yet exist. If you wish to use an existing Glue database, you can remove this action and provide us with the name of your pre-existing database.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowGlueAccessToDestinationDatabaseAndTables",
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabases",
                "glue:GetDatabase",
                "glue:GetTables",
                "glue:GetTable",
                "glue:GetPartitions",
                "glue:CreateTable",
                "glue:CreateDatabase",
                "glue:UpdateTable",
                "glue:DeleteTable"
            ],
            "Resource": [
                "arn:aws:glue:*:ACCOUNT_ID:catalog",
                "arn:aws:glue:*:ACCOUNT_ID:database/DATABASE",
                "arn:aws:glue:*:ACCOUNT_ID:database/default",
                "arn:aws:glue:*:ACCOUNT_ID:table/DATABASE/*"
            ]
        },
        {
            "Sid": "AllowS3AccessToBucket",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::BUCKET_NAME",
                "arn:aws:s3:::BUCKET_NAME/*"
            ]
        }
    ]
}

🔐
KMS encryption (optional)
If your S3 bucket uses KMS encryption (CMK), add the following statement to the Statement array of your IAM policy to allow data encryption/decryption with your KMS key. Encryption with SSE-C is not currently supported.
{
  "Effect": "Allow",
  "Action": [
    "kms:GenerateDataKey",
    "kms:Decrypt"
  ],
  "Resource": "arn:aws:kms:REGION_NAME:ACCOUNT_ID:key/KEY_ID"
}
Replace REGION_NAME, ACCOUNT_ID, and KEY_ID with your values.

Click Next: Tags, click Next: Review.
Name the policy, add a description, and click Create policy.

Create role

Navigate to the IAM service page.
Navigate to the Roles navigation tab, and click Create role.
Select Custom trust policy and paste the provided trust policy to allow AssumeRole access to the new role. Click Next.
Add the permissions policy created above, and click Next.
Enter a Role name, for example, transfer-role, and click Create role.
Once successfully created, search for the created role in the Roles list, click the role name, and make a note of the ARN value.

Step 3: Add your destination

Use the following details to complete the connection setup: bucket name, bucket region, role ARN, and Glue database name.

Setting up with AWS S3 Tables Catalog

📘
How this works

The S3 Tables bucket stores your Iceberg data and metadata.

A separate staging S3 bucket is required for staging data.

Prerequisites

S3 Tables authentication uses role-based access. You will need the trust policy prepopulated with the data syncing service's identifier to grant access.
The IAM role must also have a trust relationship with itself to function correctly with the S3 Tables API. Your final trust policy should include two principals: our service and the role itself. Be sure to replace YOUR_ACCOUNT_ID and YOUR_ROLE_NAME with the appropriate identifiers.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "AWS": "arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_ROLE_NAME" },
      "Action": "sts:AssumeRole"
    },
    {
      "Effect": "Allow",
      "Action": ["sts:AssumeRoleWithWebIdentity"],
      "Principal": { "Federated": "accounts.google.com" },
      "Condition": {
        "StringEquals": {
          "accounts.google.com:oaud": "<some_oaud_identifier>",
          "accounts.google.com:sub": "<some_service_account_identifier>"
        }
      }
    }
  ]
}

Step 1: Set up S3 Tables bucket

Navigate to the S3 service page.
In the left navigation, click Table buckets.
Click Create bucket.
Enter a Bucket name and choose the same AWS Region you plan to use for your destination S3 bucket. This bucket will be used as your S3 Tables bucket.
Click Create bucket.

Step 2: Set up staging S3 bucket

Navigate to the S3 service page.
Click Create bucket.
Enter a Bucket name and modify any of the default settings as desired. Note: Object Ownership can be set to "ACLs disabled" and Block Public Access settings for this bucket can be set to "Block all public access" as recommended by AWS. Make note of the Bucket name and AWS Region.
Click Create bucket.

Step 3: Create policy and IAM role

Create policy

Navigate to the IAM service page.
Navigate to the Policies navigation tab, and click Create policy.
Click the JSON tab, and paste the following policy, replacing ACCOUNT_ID, REGION, S3_TABLES_BUCKET_NAME, and S3_STAGING_BUCKET_NAME with the appropriate values.

📘
Why are these permissions necessary?

The listed S3 Table data permissions are needed to read/write Iceberg data files and manage metadata locations in your S3 Tables bucket.

The listed S3 Table management permissions are needed to create/manage tables and namespaces (including cleaning up temporary tables during syncs) in your S3 Tables bucket.

The listed S3 permissions are needed to write data files to your staging S3 bucket, list bucket contents, and clean up staged or test files.

The permissions to create and manage namespaces (s3tables:CreateNamespace, etc.) are required if the namespace does not already exist. If you wish to use an existing namespace, you can remove these actions.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowS3TableDataActions",
            "Effect": "Allow",
            "Action": [
                "s3tables:GetTable",
                "s3tables:DeleteTable",
                "s3tables:GetTableData",
                "s3tables:PutTableData",
                "s3tables:GetTableMetadataLocation",
                "s3tables:UpdateTableMetadataLocation"
            ],
            "Resource": "arn:aws:s3tables:REGION:ACCOUNT_ID:bucket/S3_TABLES_BUCKET_NAME/table/*"
        },
        {
            "Sid": "AllowS3TableManagementAndNamespaceActions",
            "Effect": "Allow",
            "Action": [
                "s3tables:GetTableBucket",
                "s3tables:CreateTable",
                "s3tables:ListTables",
                "s3tables:CreateNamespace",
                "s3tables:GetNamespace",
                "s3tables:ListNamespaces",
                "s3tables:DeleteNamespace"
            ],
            "Resource": "arn:aws:s3tables:REGION:ACCOUNT_ID:bucket/S3_TABLES_BUCKET_NAME"
        },
        {
            "Sid": "AllowS3AccessToDestinationBucket",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::S3_STAGING_BUCKET_NAME",
                "arn:aws:s3:::S3_STAGING_BUCKET_NAME/*"
            ]
        }
    ]
}

🔐
KMS encryption (optional)
If your S3 staging bucket uses KMS encryption (CMK), add the following statement to the Statement array of your IAM policy to allow data encryption/decryption with your KMS key. Encryption with SSE-C is not currently supported.
{
  "Effect": "Allow",
  "Action": [
    "kms:GenerateDataKey",
    "kms:Decrypt"
  ],
  "Resource": "arn:aws:kms:REGION_NAME:ACCOUNT_ID:key/KEY_ID"
}
Replace REGION_NAME, ACCOUNT_ID, and KEY_ID with your values.

Click Next: Tags, click Next: Review.
Name the policy, and click Create policy.

Create role

Navigate to the IAM service page.
Navigate to the Roles navigation tab, and click Create role.
Select Custom trust policy. Leave the default placeholder trust policy as-is for now (do not paste the final trust policy yet), and click Next. The policy must be self-assuming, which is not allowed until the role is created, so it must be updated after the role is created.
Add the permissions policy created above, and click Next.
Enter a Role name and click Create role.
Once successfully created, search for the created role in the Roles list and click the role name.
In the role detail view, navigate to the Trust relationships tab, click Edit trust policy, and replace the default trust policy with the trust policy JSON in the Prerequisites section above. Click Update policy to save.

📘
AWS IAM Propagation Delay After updating the trust policy, AWS IAM changes can take 5-10 minutes or longer to propagate. Please wait for the propagation to complete before testing the connection.

Step 4: Add your destination

Use the following details to complete the connection setup: S3 Tables bucket ARN, destination S3 bucket name, destination S3 bucket region, role ARN, and chosen namespace.

Setting up with Iceberg REST Catalog

The Iceberg REST catalog is an open standard for interacting with an Iceberg catalog over HTTP. Below are instructions for two popular implementations.

📘
Note on Credentials We connect to REST catalogs as a standard client using the credentials you provide. We do not support credential vendoring (issuing temporary credentials) for downstream access.

R2 Data Catalog (Cloudflare)

💡
Tip: Zero Egress Fees Cloudflare R2 charges no egress fees, making it a cost-effective option if you plan to query your Iceberg data from external locations or other cloud providers.

Step 1: Create R2 bucket and API token

Log in to your Cloudflare dashboard.
Follow the Cloudflare documentation to create an R2 bucket. Make a note of the Bucket Name and your R2 Account ID.
Follow the Cloudflare documentation to create an R2 API token with Admin Read & Write permissions. Make a note of the generated Access Key ID and Secret Access Key.

Step 2: Add your destination

Use the following details to complete the connection setup:

Catalog URI: https://api.cloudflare.com/client/v4/accounts/YOUR_R2_ACCOUNT_ID/r2/catalog
API Token (as the credential)
Bucket Name and Region
R2 Access Key ID and R2 Secret Access Key

📘
R2 catalog path requirement If you customize the folder or path used for the R2 Data Catalog, it must start with __r2_data_catalog. The R2 API does not validate this upfront, so an incorrect prefix will result in runtime failures when creating or querying tables.

Tabular

Step 1: Create a Tabular credential

Log in to your Tabular organization's dashboard.
Navigate to the credentials section and create a new credential with permissions to create tables and write data.
Make a note of the generated Client ID and Client Secret.

Step 2: Add your destination

Use the following details to complete the connection setup:

Catalog URI: https://api.tabular.io/ws
Client ID
Client Secret

Understanding Iceberg Configuration Options

Warning: Changing these attributes on an existing destination table will not take effect until you perform a full refresh of the table.

Managing staged data

Purpose: During each transfer, batches are first written to a staging prefix in your object storage bucket before they are committed into the final Iceberg table. This prefix is always named _write_ahead_staging (for example: your_folder/_write_ahead_staging/<table_name>/<transfer_id> or _write_ahead_staging/<table_name>/<transfer_id> if no folder/schema is configured).

Recommendation: We recommend configuring an object storage lifecycle policy to automatically delete objects under the _write_ahead_staging prefix after 30 days. This provides a safety net for any orphaned staged files that are not cleaned up due to failed or interrupted runs.

retention_window_days

Purpose: Sets the number of days for which historical data (e.g., previous table snapshots used for time travel or auditing) is retained.

Recommendation: Set this value according to your organization's internal data retention policies.

FAQ

Q: How is data transferred into my Iceberg tables?

A: We first stage batch files into an object storage bucket, then use your chosen catalog to atomically commit them into the final Iceberg table. For Glue and REST catalogs, the same S3 bucket is used for both staging and permanent table data, with different prefixes. For S3 Tables, batches are staged into your staging S3 bucket, then the finalized Iceberg data and metadata is written to the managed S3 Tables bucket.

Q: Should I use AWS Glue or AWS S3 Tables?

A: There are tradeoffs to consider when choosing between Glue and S3 Tables:

Glue Catalog: Glue stores the table metadata, and your S3 bucket stores both staged files and the final Iceberg data under different prefixes. We run snapshot expiry and compaction and you control the S3 layout. This is a good fit if you already use Glue as your central catalog or want to keep data in a single S3 bucket you manage directly.
S3 Tables Catalog: The S3 Tables bucket is a fully managed table bucket where AWS stores the finalized Iceberg data and metadata. We write batches to a separate staging S3 bucket, and the catalog writes the final data into the S3 Tables bucket and handles maintenance on your behalf. This is a good fit if you prefer automatic maintenance and plan to query through engines that natively support S3 Tables.

Q: What is Apache Iceberg and why should I use it?

A: Apache Iceberg is an open table format designed for analytic datasets on object stores. It delivers warehouse-native capabilities such as ACID transactions, time travel, and schema evolution with the simplicity, scalability, and secure permissions model of an object storage bucket. By using a central catalog, Iceberg provides reliable transactions and enables multiple engines to work concurrently on the same data. This enables your warehouse to be isolated from data sharing, so you can receive data without exposing your internal resources.

Q: Why do you need permissions to delete data?

A: Iceberg performs background maintenance operations to manage the table's health and performance. These include expiring old snapshots and compacting small data files. The writer must have delete permissions to safely remove obsolete files without compromising data integrity.

Q: Can I send the data to a specific prefix in a bucket?

A: Yes, you can direct data to a specific prefix (warehouse path). However, we recommend using a completely isolated bucket to receive data. This minimizes security risks and reduces the chance of accidental interference with other datasets.

Q: Do I need to perform maintenance operations on the Iceberg table?

A: No, the data writer is responsible for expiring snapshots and compacting data as needed. Data consumers should not run any non-read queries on the table, except managed catalogs like R2 and S3 Tables, which automatically run compaction and snapshot expiration.

Important: You should treat the destination tables as read-only. Executing write or delete operations manually may corrupt the table state and break data synchronization.

Q: How do I know when a table has been updated?

A: You can query the table's metadata to see the history of snapshots. Each snapshot represents a version of the table. For example, in Spark SQL, you can run:

SELECT snapshot_id, committed_at FROM my_glue_catalog.my_db.my_table.snapshots ORDER BY committed_at DESC LIMIT 1;

This command returns the most recent snapshot details. Additionally, most bucket providers offer the capability to trigger a webhook or lambda when objects are created, which can be configured to monitor the table's metadata directory for new manifest lists.

Note: We write a version-hint.txt file to the metadata directory. This allows tools like PyIceberg and DuckDB to read the table directly from S3 without needing to connect to the catalog service, by pointing them to the table's root location.

Q: Are there any limitations on data sizes?

A: We do not enforce size limits, including JSON fields. Downstream query engines may have their own limits (e.g., Amazon Redshift's SUPER and VARCHAR sizes). Ensure your data fits within your query engine's constraints to avoid query failures.

Q: Can I mount my Iceberg data to BigQuery?

A: Coming soon! Contact us if you'd like to be notified when this feature is available.

Mounting/Reading an Iceberg Table

You can mount/read an Iceberg table into your data warehouse of choice. Below are the supported catalog types and links to the corresponding vendor documentation:

ClickHouse
DuckDB / MotherDuck
Spark
Athena / Redshift
- Glue catalog
- S3 Tables catalog
Snowflake
- REST catalog

Beta destination

Setting up Apache Iceberg

Setting up with AWS Glue Catalog

How this works

Prerequisites

Step 1: Set up destination S3 bucket

Create bucket

Step 2: Create policy and IAM role

Create policy

Why are these permissions necessary?

KMS encryption (optional)

Create role

Step 3: Add your destination

Setting up with AWS S3 Tables Catalog

How this works

Prerequisites

Step 1: Set up S3 Tables bucket

Step 2: Set up staging S3 bucket

Step 3: Create policy and IAM role

Create policy

Why are these permissions necessary?

KMS encryption (optional)

Create role

AWS IAM Propagation Delay After updating the trust policy, AWS IAM changes can take 5-10 minutes or longer to propagate. Please wait for the propagation to complete before testing the connection.

Step 4: Add your destination

Setting up with Iceberg REST Catalog

Note on Credentials We connect to REST catalogs as a standard client using the credentials you provide. We do not support credential vendoring (issuing temporary credentials) for downstream access.

R2 Data Catalog (Cloudflare)

Tip: Zero Egress Fees Cloudflare R2 charges no egress fees, making it a cost-effective option if you plan to query your Iceberg data from external locations or other cloud providers.

Step 1: Create R2 bucket and API token

Step 2: Add your destination

R2 catalog path requirement If you customize the folder or path used for the R2 Data Catalog, it must start with __r2_data_catalog. The R2 API does not validate this upfront, so an incorrect prefix will result in runtime failures when creating or querying tables.

Tabular

Step 1: Create a Tabular credential

Step 2: Add your destination

Understanding Iceberg Configuration Options

Managing staged data

retention_window_days

FAQ

Q: How is data transferred into my Iceberg tables?

Q: Should I use AWS Glue or AWS S3 Tables?

Q: What is Apache Iceberg and why should I use it?

Q: Why do you need permissions to delete data?

Q: Can I send the data to a specific prefix in a bucket?

Q: Do I need to perform maintenance operations on the Iceberg table?

Q: How do I know when a table has been updated?

Q: Are there any limitations on data sizes?

Q: Can I mount my Iceberg data to BigQuery?

Mounting/Reading an Iceberg Table

R2 catalog path requirement If you customize the folder or path used for the R2 Data Catalog, it must start with `__r2_data_catalog`. The R2 API does not validate this upfront, so an incorrect prefix will result in runtime failures when creating or querying tables.