Setting up Apache Iceberg
Apache Iceberg requires a central catalog to manage table metadata and provide atomic transactions. We support several catalog options, each with its own setup guide below:- AWS Glue Catalog
- AWS S3 Tables Catalog
- Iceberg REST Catalog (including R2 Data Catalog and Tabular)
Setting up with AWS Glue Catalog
- The Glue catalog stores Iceberg table metadata and the pointer to each table’s location.
- The destination S3 bucket stores your Iceberg data, metadata files, and is used during staging.
Prerequisites
- By default, S3 authentication uses role-based access. You will need the trust policy prepopulated with our identifier to grant access. It should look similar to the following JSON object with a proper service account identifier:
Set up destination S3 bucket
Create bucket
- Navigate to the S3 service page.
- Click Create bucket.
- Enter a Bucket name and modify any of the default settings as desired. Note: Object Ownership can be set to “ACLs disabled” and Block Public Access settings for this bucket can be set to “Block all public access” as recommended by AWS. Make note of the Bucket name and AWS Region.
- Click Create bucket.
Create policy and IAM role
Create policy
- Navigate to the IAM service page.
- Navigate to the Policies navigation tab, and click Create policy.
- Click the JSON tab, and paste the following policy, being sure to replace
BUCKET_NAME,ACCOUNT_ID, andSCHEMAwith your specific values.
- The listed Glue permissions are needed to manage catalog metadata and handle table operations, including cleaning up temporary tables during syncs.
- The listed S3 permissions are needed to upload data files, list bucket contents, read Iceberg metadata, and manage files during compaction.
- Note: The
glue:CreateDatabasepermission is required if the database does not yet exist. If you wish to use an existing Glue database, you can remove this action and provide us with the name of your pre-existing database.
- Click Next: Tags, click Next: Review.
- Name the policy, add a description, and click Create policy.
Create role
- Navigate to the IAM service page.
- Navigate to the Roles navigation tab, and click Create role.
- Select Custom trust policy and paste the provided trust policy to allow AssumeRole access to the new role. Click Next.
- Add the permissions policy created above, and click Next.
- Enter a Role name, for example,
transfer-role, and click Create role. - Once successfully created, search for the created role in the Roles list, click the role name, and make a note of the ARN value.
Setting up with AWS S3 Tables Catalog
- The S3 Tables bucket stores your Iceberg data and metadata.
- A separate staging S3 bucket is required for staging data.
Prerequisites
- S3 Tables authentication uses role-based access. You will need the trust policy prepopulated with our identifier to grant access.
- The IAM role must also have a trust relationship with itself to function correctly with the S3 Tables API. Your final trust policy should include two principals: our service and the role itself. Be sure to replace
YOUR_ACCOUNT_IDandYOUR_ROLE_NAMEwith the appropriate identifiers.
Set up S3 Tables bucket
- Navigate to the S3 service page.
- In the left navigation, click Table buckets.
- Click Create bucket.
- Enter a Bucket name and choose the same AWS Region you plan to use for your destination S3 bucket. This bucket will be used as your S3 Tables bucket.
- Click Create bucket.
Set up staging S3 bucket
- Navigate to the S3 service page.
- Click Create bucket.
- Enter a Bucket name and modify any of the default settings as desired. Note: Object Ownership can be set to “ACLs disabled” and Block Public Access settings for this bucket can be set to “Block all public access” as recommended by AWS. Make note of the Bucket name and AWS Region.
- Click Create bucket.
Create policy and IAM role
Create policy
- Navigate to the IAM service page.
- Navigate to the Policies navigation tab, and click Create policy.
- Click the JSON tab, and paste the following policy, replacing
ACCOUNT_ID,REGION,S3_TABLES_BUCKET_NAME, andS3_STAGING_BUCKET_NAMEwith the appropriate values.
- The listed S3 Table data permissions are needed to read/write Iceberg data files and manage metadata locations in your S3 Tables bucket.
- The listed S3 Table management permissions are needed to create/manage tables and namespaces (including cleaning up temporary tables during syncs) in your S3 Tables bucket.
- The listed S3 permissions are needed to write data files to your staging S3 bucket, list bucket contents, and clean up staged or test files.
- The permissions to create and manage namespaces (
s3tables:CreateNamespace, etc.) are required if the namespace does not already exist. If you wish to use an existing namespace, you can remove these actions.
- Click Next: Tags, click Next: Review.
- Name the policy, and click Create policy.
Create role
- Navigate to the IAM service page.
- Navigate to the Roles navigation tab, and click Create role.
- Select Custom trust policy. Leave the default placeholder trust policy as-is for now (do not paste the final trust policy yet), and click Next. The policy must be self-assuming, which is not allowed until the role is created, so it must be updated after the role is created.
- Add the permissions policy created above, and click Next.
- Enter a Role name and click Create role.
- Once successfully created, search for the created role in the Roles list and click the role name.
- In the role detail view, navigate to the Trust relationships tab, click Edit trust policy, and replace the default trust policy with the trust policy JSON in the Prerequisites section above. Click Update policy to save.
Setting up with Iceberg REST Catalog
The Iceberg REST catalog is an open standard for interacting with an Iceberg catalog over HTTP. Below are instructions for two popular implementations.R2 Data Catalog (Cloudflare)
Create R2 bucket and API token
- Log in to your Cloudflare dashboard.
- Follow the Cloudflare documentation to create an R2 bucket. Make a note of the Bucket Name and your R2 Account ID.
- Follow the Cloudflare documentation to create an R2 API token with Admin Read & Write permissions. Make a note of the generated Access Key ID and Secret Access Key.
Add your destination
- Catalog URI:
https://api.cloudflare.com/client/v4/accounts/YOUR_R2_ACCOUNT_ID/r2/catalog - API Token (as the credential)
- Bucket Name and Region
- R2 Access Key ID and R2 Secret Access Key
__r2_data_catalog. The R2 API does not validate this upfront, so an incorrect prefix will result in runtime failures when creating or querying tables.Tabular
Create a Tabular credential
- Log in to your Tabular organization’s dashboard.
- Navigate to the credentials section and create a new credential with permissions to create tables and write data.
- Make a note of the generated Client ID and Client Secret.
Understanding Iceberg configuration options
Managing staged data
Purpose: During each transfer, batches are first written to a staging prefix in your object storage bucket before they are committed into the final Iceberg table. This prefix is always named_write_ahead_staging (for example: your_folder/_write_ahead_staging/<table_name>/<transfer_id> or _write_ahead_staging/<table_name>/<transfer_id> if no folder/schema is configured).
Recommendation:
We recommend configuring an object storage lifecycle policy to automatically delete objects under the _write_ahead_staging prefix after 30 days. This provides a safety net for any orphaned staged files that are not cleaned up due to failed or interrupted runs.
retention_window_days
Purpose: Sets the number of days for which historical data (e.g., previous table snapshots used for time travel or auditing) is retained. Recommendation: Set this value according to your organization’s internal data retention policies.FAQ
How is data transferred into my Iceberg tables?
How is data transferred into my Iceberg tables?
Should I use AWS Glue or AWS S3 Tables?
Should I use AWS Glue or AWS S3 Tables?
- Glue Catalog: Glue stores the table metadata, and your S3 bucket stores both staged files and the final Iceberg data under different prefixes. We run snapshot expiry and compaction and you control the S3 layout. This is a good fit if you already use Glue as your central catalog or want to keep data in a single S3 bucket you manage directly.
- S3 Tables Catalog: The S3 Tables bucket is a fully managed table bucket where AWS stores the finalized Iceberg data and metadata. We write batches to a separate staging S3 bucket, and the catalog writes the final data into the S3 Tables bucket and handles maintenance on your behalf. This is a good fit if you prefer automatic maintenance and plan to query through engines that natively support S3 Tables.
What is Apache Iceberg and why should I use it?
What is Apache Iceberg and why should I use it?
Why do you need permissions to delete data?
Why do you need permissions to delete data?
Can I send the data to a specific prefix in a bucket?
Can I send the data to a specific prefix in a bucket?
Do I need to perform maintenance operations on the Iceberg table?
Do I need to perform maintenance operations on the Iceberg table?
How do I know when a table has been updated?
How do I know when a table has been updated?
version-hint.txt file to the metadata directory. This allows tools like PyIceberg and DuckDB to read the table directly from S3 without needing to connect to the catalog service, by pointing them to the table’s root location.Are there any limitations on data sizes?
Are there any limitations on data sizes?
SUPER and VARCHAR sizes). Ensure your data fits within your query engine’s constraints to avoid query failures.Can I mount my Iceberg data to BigQuery?
Can I mount my Iceberg data to BigQuery?
Mounting/reading an Iceberg table
You can mount/read an Iceberg table into your data warehouse of choice. Below are the supported catalog types and links to the corresponding vendor documentation:- ClickHouse
- DuckDB / MotherDuck
- Spark
- Athena / Redshift
- Snowflake