Apache Iceberg

Beta destinationApache Iceberg is currently a beta destination with an upper limit of 1 billion rows per month. Please contact our team if you have any questions.

Setting up Apache Iceberg

Apache Iceberg requires a central catalog to manage table metadata and provide atomic transactions. We support several catalog options, each with its own setup guide below:

AWS Glue Catalog
AWS S3 Tables Catalog
Iceberg REST Catalog (including R2 Data Catalog and Tabular)
Google Lakehouse Catalog

Setting up with AWS Glue Catalog

How this works

The Glue catalog stores Iceberg table metadata and the pointer to each table’s location.
The destination S3 bucket stores your Iceberg data, metadata files, and is used during staging.

Prerequisites

By default, S3 authentication uses role-based access. You will need the trust policy prepopulated with our identifier to grant access. It should look similar to the following JSON object with a proper service account identifier:

Trust policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sts:AssumeRoleWithWebIdentity"
      ],
      "Principal": {
        "Federated": "accounts.google.com"
      },
      "Condition": {
        "StringEquals": {
          "accounts.google.com:oaud": "<some_organization_identifier>",
          "accounts.google.com:sub": "<some_service_account_identifier>"
        }
      }
    }
  ]
}

Set up destination S3 bucket

Create bucket

Navigate to the S3 service page.
Click Create bucket.
Enter a Bucket name and modify any of the default settings as desired. Note: Object Ownership can be set to “ACLs disabled” and Block Public Access settings for this bucket can be set to “Block all public access” as recommended by AWS. Make note of the Bucket name and AWS Region.
Click Create bucket.

Create policy and IAM role

Create policy

Navigate to the IAM service page.
Navigate to the Policies navigation tab, and click Create policy.
Click the JSON tab, and paste the following policy, being sure to replace BUCKET_NAME, ACCOUNT_ID, and DATABASE with your specific values.

Why are these permissions necessary?

The listed Glue permissions are needed to manage catalog metadata and handle table operations, including cleaning up temporary tables during syncs.
The listed S3 permissions are needed to upload data files, list bucket contents, read Iceberg metadata, and manage files during compaction.
Note: The glue:CreateDatabase permission is required if the database does not yet exist. If you wish to use an existing Glue database, you can remove this action and provide us with the name of your pre-existing database.

Access policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowGlueAccessToDestinationDatabaseAndTables",
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabases",
                "glue:GetDatabase",
                "glue:GetTables",
                "glue:GetTable",
                "glue:GetPartitions",
                "glue:CreateTable",
                "glue:CreateDatabase",
                "glue:UpdateTable",
                "glue:DeleteTable"
            ],
            "Resource": [
                "arn:aws:glue:*:ACCOUNT_ID:catalog",
                "arn:aws:glue:*:ACCOUNT_ID:database/DATABASE",
                "arn:aws:glue:*:ACCOUNT_ID:database/default",
                "arn:aws:glue:*:ACCOUNT_ID:table/DATABASE/*"
            ]
        },
        {
            "Sid": "AllowS3AccessToBucket",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::BUCKET_NAME",
                "arn:aws:s3:::BUCKET_NAME/*"
            ]
        }
    ]
}

KMS encryption (optional)If your S3 bucket uses KMS encryption (CMK), add the following statement to the Statement array of your IAM policy to allow data encryption/decryption with your KMS key. Encryption with SSE-C is not currently supported.

KMS statement

{
  "Effect": "Allow",
  "Action": [
    "kms:GenerateDataKey",
    "kms:Decrypt"
  ],
  "Resource": "arn:aws:kms:REGION_NAME:ACCOUNT_ID:key/KEY_ID"
}

Replace REGION_NAME, ACCOUNT_ID, and KEY_ID with your values.

Click Next: Tags, click Next: Review.
Name the policy, add a description, and click Create policy.

Create role

Navigate to the IAM service page.
Navigate to the Roles navigation tab, and click Create role.
Select Custom trust policy and paste the provided trust policy to allow AssumeRole access to the new role. Click Next.
Add the permissions policy created above, and click Next.
Enter a Role name, for example, transfer-role, and click Create role.
Once successfully created, search for the created role in the Roles list, click the role name, and make a note of the ARN value.

Add your destination

Use the following details to complete the connection setup: bucket name, bucket region, role ARN, and Glue database name.

Setting up with AWS S3 Tables Catalog

How this works

The S3 Tables bucket stores your Iceberg data and metadata.
A separate staging S3 bucket is required for staging data.

Prerequisites

S3 Tables authentication uses role-based access. You will need the trust policy prepopulated with our identifier to grant access.
The IAM role must also have a trust relationship with itself to function correctly with the S3 Tables API. Your final trust policy should include two principals: our service and the role itself. Be sure to replace YOUR_ACCOUNT_ID and YOUR_ROLE_NAME with the appropriate identifiers.

Trust policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "AWS": "arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_ROLE_NAME" },
      "Action": "sts:AssumeRole"
    },
    {
      "Effect": "Allow",
      "Action": ["sts:AssumeRoleWithWebIdentity"],
      "Principal": { "Federated": "accounts.google.com" },
      "Condition": {
        "StringEquals": {
          "accounts.google.com:oaud": "<some_oaud_identifier>",
          "accounts.google.com:sub": "<some_service_account_identifier>"
        }
      }
    }
  ]
}

Set up S3 Tables bucket

Navigate to the S3 service page.
In the left navigation, click Table buckets.
Click Create bucket.
Enter a Bucket name and choose the same AWS Region you plan to use for your destination S3 bucket. This bucket will be used as your S3 Tables bucket.
Click Create bucket.

Set up staging S3 bucket

Navigate to the S3 service page.
Click Create bucket.
Enter a Bucket name and modify any of the default settings as desired. Note: Object Ownership can be set to “ACLs disabled” and Block Public Access settings for this bucket can be set to “Block all public access” as recommended by AWS. Make note of the Bucket name and AWS Region.
Click Create bucket.

Create policy and IAM role

Create policy

Navigate to the IAM service page.
Navigate to the Policies navigation tab, and click Create policy.
Click the JSON tab, and paste the following policy, replacing ACCOUNT_ID, REGION, S3_TABLES_BUCKET_NAME, and S3_STAGING_BUCKET_NAME with the appropriate values.

Why are these permissions necessary?

The listed S3 Table data permissions are needed to read/write Iceberg data files and manage metadata locations in your S3 Tables bucket.
The listed S3 Table management permissions are needed to create/manage tables and namespaces (including cleaning up temporary tables during syncs) in your S3 Tables bucket.
The listed S3 permissions are needed to write data files to your staging S3 bucket, list bucket contents, and clean up staged or test files.
The permissions to create and manage namespaces (s3tables:CreateNamespace, etc.) are required if the namespace does not already exist. If you wish to use an existing namespace, you can remove these actions.

Access policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowS3TableDataActions",
            "Effect": "Allow",
            "Action": [
                "s3tables:GetTable",
                "s3tables:DeleteTable",
                "s3tables:GetTableData",
                "s3tables:PutTableData",
                "s3tables:GetTableMetadataLocation",
                "s3tables:UpdateTableMetadataLocation"
            ],
            "Resource": "arn:aws:s3tables:REGION:ACCOUNT_ID:bucket/S3_TABLES_BUCKET_NAME/table/*"
        },
        {
            "Sid": "AllowS3TableManagementAndNamespaceActions",
            "Effect": "Allow",
            "Action": [
                "s3tables:GetTableBucket",
                "s3tables:CreateTable",
                "s3tables:ListTables",
                "s3tables:CreateNamespace",
                "s3tables:GetNamespace",
                "s3tables:ListNamespaces",
                "s3tables:DeleteNamespace"
            ],
            "Resource": "arn:aws:s3tables:REGION:ACCOUNT_ID:bucket/S3_TABLES_BUCKET_NAME"
        },
        {
            "Sid": "AllowS3AccessToDestinationBucket",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::S3_STAGING_BUCKET_NAME",
                "arn:aws:s3:::S3_STAGING_BUCKET_NAME/*"
            ]
        }
    ]
}

KMS encryption (optional)If your S3 staging bucket uses KMS encryption (CMK), add the following statement to the Statement array of your IAM policy to allow data encryption/decryption with your KMS key. Encryption with SSE-C is not currently supported.

KMS statement

{
  "Effect": "Allow",
  "Action": [
    "kms:GenerateDataKey",
    "kms:Decrypt"
  ],
  "Resource": "arn:aws:kms:REGION_NAME:ACCOUNT_ID:key/KEY_ID"
}

Replace REGION_NAME, ACCOUNT_ID, and KEY_ID with your values.

Click Next: Tags, click Next: Review.
Name the policy, and click Create policy.

Create role

Navigate to the IAM service page.
Navigate to the Roles navigation tab, and click Create role.
Select Custom trust policy. Leave the default placeholder trust policy as-is for now (do not paste the final trust policy yet), and click Next. The policy must be self-assuming, which is not allowed until the role is created, so it must be updated after the role is created.
Add the permissions policy created above, and click Next.
Enter a Role name and click Create role.
Once successfully created, search for the created role in the Roles list and click the role name.
In the role detail view, navigate to the Trust relationships tab, click Edit trust policy, and replace the default trust policy with the trust policy JSON in the Prerequisites section above. Click Update policy to save.

AWS IAM Propagation DelayAfter updating the trust policy, AWS IAM changes can take 5-10 minutes or longer to propagate. Please wait for the propagation to complete before testing the connection.

Add your destination

Use the following details to complete the connection setup: S3 Tables bucket ARN, destination S3 bucket name, destination S3 bucket region, role ARN, and chosen namespace.

Setting up with Iceberg REST Catalog

The Iceberg REST catalog is an open standard for interacting with an Iceberg catalog over HTTP. Below are instructions for two popular implementations.

Note on CredentialsWe connect to REST catalogs as a standard client using the credentials you provide. We do not support credential vendoring (issuing temporary credentials) for downstream access.

R2 Data Catalog (Cloudflare)

Tip: Zero Egress FeesCloudflare R2 charges no egress fees, making it a cost-effective option if you plan to query your Iceberg data from external locations or other cloud providers.

Create R2 bucket and API token

Log in to your Cloudflare dashboard.
Follow the Cloudflare documentation to create an R2 bucket. Make a note of the Bucket Name and your R2 Account ID.
Follow the Cloudflare documentation to create an R2 API token with Admin Read & Write permissions. Make a note of the generated Access Key ID and Secret Access Key.

Add your destination

Use the following details to complete the connection setup:

Catalog URI: https://api.cloudflare.com/client/v4/accounts/YOUR_R2_ACCOUNT_ID/r2/catalog
API Token (as the credential)
Bucket Name and Region
R2 Access Key ID and R2 Secret Access Key

R2 catalog path requirementIf you customize the folder or path used for the R2 Data Catalog, it must start with __r2_data_catalog. The R2 API does not validate this upfront, so an incorrect prefix will result in runtime failures when creating or querying tables.

Google BigLake (Lakehouse Catalog)

How this works

The Google Lakehouse catalog stores Iceberg table metadata.
The destination GCS bucket stores your Iceberg data, metadata files, and is used during staging.
With Google BigLake, the Iceberg tables become queryable directly from BigQuery. No external table definition or separate mount step is required.

Prerequisites

You will need a Google Cloud service account with permissions to read/write to your GCS bucket, and manage your Lakehouse catalog.
By default, authentication uses role-based access via service account impersonation. You will need our service account name available to grant access. It should look like some-name@some-project.iam.gserviceaccount.com.

Set up destination GCS bucket

Navigate to the Cloud Storage service page.
Click Create bucket.
Provide a name and choose the appropriate region. Make note of the Bucket name.

Create a Lakehouse Catalog

Create a Lakehouse catalog in your Google Cloud Project by following the Lakehouse catalog documentation.
Important: When creating the catalog, ensure that you configure it to use end-user credentials. Do NOT use credential vending, as we need to access the underlying storage directly with the provided service account credentials.

Set up IAM permissions

Ensure your service account has the necessary permissions to access the bucket and catalog. Then, allow the service to impersonate it.

1. Grant permissions to your service account

You can use preconfigured GCP roles or define a fine-tuned custom role for least-privilege access.Option A: Preconfigured GCP RolesAssign the following roles to your service account:

Storage Object Admin (roles/storage.objectAdmin) on the destination bucket.
BigLake Editor (roles/biglake.editor) on the target project/catalog to allow creation and management of Iceberg tables via BigLake.

Option B: Fine-Tuned Custom PermissionsFor a least-privilege approach, create a custom IAM role with the following individual permissions:Cloud Storage Permissions (applied to the destination bucket):

storage.buckets.get
storage.objects.create
storage.objects.delete
storage.objects.get
storage.objects.list
storage.objects.update

BigLake Permissions (applied to the target project or catalog):

biglake.catalogs.get
biglake.catalogs.list
biglake.databases.create
biglake.databases.delete
biglake.databases.get
biglake.databases.list
biglake.databases.update
biglake.tables.create
biglake.tables.delete
biglake.tables.get
biglake.tables.list
biglake.tables.update

2. Allow the service to impersonate your service account

Navigate to IAM & Admin > Service Accounts. Select the service account you just created. Select the Permissions tab, click Grant Access, enter our service account name (from the prerequisite), and select the Service Account Token Creator role.

Add your destination

Use the following details to complete the connection setup: Google Cloud Project ID, Catalog Name, Schema, Bucket Name, and provide your service account credentials.

Tabular

Create a Tabular credential

Log in to your Tabular organization’s dashboard.
Navigate to the credentials section and create a new credential with permissions to create tables and write data.
Make a note of the generated Client ID and Client Secret.

Add your destination

Use the following details to complete the connection setup:

Catalog URI: https://api.tabular.io/ws
Client ID
Client Secret

Understanding Iceberg configuration options

Changing these attributes on an existing destination table will not take effect until you perform a full refresh of the table.

Managing staged data

Purpose: During each transfer, batches are first written to a staging prefix in your object storage bucket before they are committed into the final Iceberg table. This prefix is always named _write_ahead_staging (for example: your_folder/_write_ahead_staging/<table_name>/<transfer_id> or _write_ahead_staging/<table_name>/<transfer_id> if no folder/schema is configured). Recommendation: We recommend configuring an object storage lifecycle policy to automatically delete objects under the _write_ahead_staging prefix after 30 days. This provides a safety net for any orphaned staged files that are not cleaned up due to failed or interrupted runs.

retention_window_days

Purpose: Sets the number of days for which historical data (e.g., previous table snapshots used for time travel or auditing) is retained. Recommendation: Set this value according to your organization’s internal data retention policies.

FAQ

How is data transferred into my Iceberg tables?

We first stage batch files into an object storage bucket, then use your chosen catalog to atomically commit them into the final Iceberg table. For Glue and REST catalogs, the same S3 bucket is used for both staging and permanent table data, with different prefixes. For S3 Tables, batches are staged into your staging S3 bucket, then the finalized Iceberg data and metadata is written to the managed S3 Tables bucket.

Should I use AWS Glue or AWS S3 Tables?

There are tradeoffs to consider when choosing between Glue and S3 Tables:

Glue Catalog: Glue stores the table metadata, and your S3 bucket stores both staged files and the final Iceberg data under different prefixes. We run snapshot expiry and compaction and you control the S3 layout. This is a good fit if you already use Glue as your central catalog or want to keep data in a single S3 bucket you manage directly.
S3 Tables Catalog: The S3 Tables bucket is a fully managed table bucket where AWS stores the finalized Iceberg data and metadata. We write batches to a separate staging S3 bucket, and the catalog writes the final data into the S3 Tables bucket and handles maintenance on your behalf. This is a good fit if you prefer automatic maintenance and plan to query through engines that natively support S3 Tables.

What is Apache Iceberg and why should I use it?

Apache Iceberg is an open table format designed for analytic datasets on object stores. It delivers warehouse-native capabilities such as ACID transactions, time travel, and schema evolution with the simplicity, scalability, and secure permissions model of an object storage bucket. By using a central catalog, Iceberg provides reliable transactions and enables multiple engines to work concurrently on the same data. This enables your warehouse to be isolated from data sharing, so you can receive data without exposing your internal resources.

Why do you need permissions to delete data?

Iceberg performs background maintenance operations to manage the table’s health and performance. These include expiring old snapshots and compacting small data files. The writer must have delete permissions to safely remove obsolete files without compromising data integrity.

Can I send the data to a specific prefix in a bucket?

Yes, you can direct data to a specific prefix (warehouse path). However, we recommend using a completely isolated bucket to receive data. This minimizes security risks and reduces the chance of accidental interference with other datasets.

Do I need to perform maintenance operations on the Iceberg table?

No, the data writer is responsible for expiring snapshots and compacting data as needed. Data consumers should not run any non-read queries on the table, except managed catalogs like R2 and S3 Tables, which automatically run compaction and snapshot expiration.

You should treat the destination tables as read-only. Executing write or delete operations manually may corrupt the table state and break data synchronization.

How do I know when a table has been updated?

You can query the table’s metadata to see the history of snapshots. Each snapshot represents a version of the table. For example, in Spark SQL, you can run:

Check latest snapshot

SELECT snapshot_id, committed_at FROM my_glue_catalog.my_db.my_table.snapshots ORDER BY committed_at DESC LIMIT 1;

This command returns the most recent snapshot details. Additionally, most bucket providers offer the capability to trigger a webhook or lambda when objects are created, which can be configured to monitor the table’s metadata directory for new manifest lists.Note: We write a version-hint.txt file to the metadata directory. This allows tools like PyIceberg and DuckDB to read the table directly from S3 without needing to connect to the catalog service, by pointing them to the table’s root location.

Are there any limitations on data sizes?

We do not enforce size limits, including JSON fields. Downstream query engines may have their own limits (e.g., Amazon Redshift’s SUPER and VARCHAR sizes). Ensure your data fits within your query engine’s constraints to avoid query failures.

Why are two service accounts involved with Google BigLake? Why is service account impersonation required?

You create one service account in your project with BigLake/Storage permissions, and we use our service account to impersonate yours. This means we never handle your private keys, all operations appear in your audit logs, access is via short-lived tokens, and you can revoke access anytime through your own IAM permissions. Direct service account access is not supported.

Can I mount my Iceberg data to BigQuery?

Yes. Set up the destination with the Google Lakehouse Catalog: tables written through Google BigLake are queryable directly from BigQuery, with no external table definition or separate mount step required. Mounting is not currently supported for the other catalog options.

Mounting/reading an Iceberg table

You can mount/read an Iceberg table into your data warehouse of choice. Below are the supported catalog types and links to the corresponding vendor documentation:

ClickHouse
DuckDB / MotherDuck
Spark
Athena / Redshift
- Glue catalog
- S3 Tables catalog
Snowflake
- REST catalog

Getting started

Core concepts

Features

Deploying Prequel

Logging & Monitoring

Integrations

Developer SDKs

Sources

Destinations

Security & compliance

Setting up Apache Iceberg

Setting up with AWS Glue Catalog

Prerequisites

Create bucket

Create policy

Create role

Setting up with AWS S3 Tables Catalog

Prerequisites

Create policy

Create role

Setting up with Iceberg REST Catalog

R2 Data Catalog (Cloudflare)

Google BigLake (Lakehouse Catalog)

Prerequisites

1. Grant permissions to your service account

2. Allow the service to impersonate your service account

Tabular

Understanding Iceberg configuration options

Managing staged data

retention_window_days

FAQ

Mounting/reading an Iceberg table

​Setting up Apache Iceberg

​Setting up with AWS Glue Catalog

​Prerequisites

​Create bucket

​Create policy

​Create role

​Setting up with AWS S3 Tables Catalog

​Prerequisites

​Create policy

​Create role

​Setting up with Iceberg REST Catalog

​R2 Data Catalog (Cloudflare)

​Google BigLake (Lakehouse Catalog)

​Prerequisites

​1. Grant permissions to your service account

​2. Allow the service to impersonate your service account

​Tabular

​Understanding Iceberg configuration options

​Managing staged data

​retention_window_days

​FAQ

​Mounting/reading an Iceberg table

Setting up Apache Iceberg

Setting up with AWS Glue Catalog

Prerequisites

Create bucket

Create policy

Create role

Setting up with AWS S3 Tables Catalog

Prerequisites

Create policy

Create role

Setting up with Iceberg REST Catalog

R2 Data Catalog (Cloudflare)

Google BigLake (Lakehouse Catalog)

Prerequisites

1. Grant permissions to your service account

2. Allow the service to impersonate your service account

Tabular

Understanding Iceberg configuration options

Managing staged data

retention_window_days

FAQ

Mounting/reading an Iceberg table