S3

Prerequisites

By default, S3 authentication uses role-based access. You will need the trust policy prepopulated with the data syncing service's identifier to grant access. It should look similar to the following JSON object with a proper service account identifier:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sts:AssumeRoleWithWebIdentity"
      ],
      "Principal": {
        "Federated": "accounts.google.com"
      },
      "Condition": {
        "StringEquals": {
          "accounts.google.com:oaud": "<some_organization_identifier>",
          "accounts.google.com:sub": "<some_service_account_identifier>"
        }
      }
    }
  ]
}

Step 1: Set up destination S3 bucket

Create bucket

Navigate to the S3 service page.
Click Create bucket.
Enter a Bucket name and modify any of the default settings as desired. Note: Object Ownership can be set to "ACLs disabled" and Block Public Access settings for this bucket can be set to "Block all public access" as recommended by AWS. Make note of the Bucket name and AWS Region.
Click Create bucket.

📘
Recommendation: dedicated bucket for data transfers
Use a unique bucket for these transfers. This:

Prevents resource contention with other workloads

Avoids accidental data loss from mixed lifecycle or cleanup rules

Improves security by reducing surface area and enabling tighter, destination-scoped policies

🧹
Optional: Add a short retention lifecycle policy
You may configure a lifecycle rule on the staging bucket to automatically delete objects older than 2 days as the bucket is not used to persist data. In the bucket Management tab, click Create lifecycle rule, set an expiration action for current versions of objects with a 2-day age. Note that transfer logic automatically cleans up files after transfer completion, so this is an optional step.

Step 2: Create policy and IAM role

Create policy

Navigate to the IAM service page.
Navigate to the Policies navigation tab, and click Create policy.
Click the JSON tab, and paste the following policy, being sure to replace BUCKET_NAME with the name of the bucket chosen in Step 1.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": ["s3:PutObject", "s3:DeleteObject"],
            "Resource": "arn:aws:s3:::BUCKET_NAME/*"
        }
    ]
}

📘
Understanding the s3:DeleteObject requirement
By default, a connection test is performed against the destination during initial configuration and s3:DeleteObject is required to clean up test artifacts. Once the test has been performed successfully and the destination added, this action can be safely removed, as S3 destinations are append-only by default.

🔐
KMS encryption (optional)
If your S3 destination bucket uses KMS encryption (CMK), add the following statement to the Statement array of your IAM policy to allow data encryption/decryption with your KMS key. Encryption with SSE-C is not currently supported.
{
  "Effect": "Allow",
  "Action": [
    "kms:GenerateDataKey",
    "kms:Decrypt"
  ],
  "Resource": "arn:aws:kms:REGION_NAME:ACCOUNT_ID:key/KEY_ID"
}
Replace REGION_NAME, ACCOUNT_ID, and KEY_ID with your values.

Click Next: Tags, click Next: Review.
Name the policy, add a description, and click Create policy.

Create role

Navigate to the IAM service page.
Navigate to the Roles navigation tab, and click Create role.
Select Custom trust policy and paste the provided trust policy to allow AssumeRole access to the new role. Click Next.
Add the permissions policy created above, and click Next.
Enter a Role name, for example, transfer-role, and click Create role.
Once successfully created, search for the created role in the Roles list, click the role name, and make a note of the ARN value.

📘
Alternative authentication method: AWS User with HMAC Access Key ID & Secret Access Key
Role based authentication is the preferred authentication mode for S3 based on AWS recommendations. However, HMAC Access Key ID & Secret Access Key is an alternative authentication method that can be used if preferred.

Navigate to the IAM service page.

Navigate to the Users navigation tab, and click Add users.

Enter a User name for the service, for example, transfer-service, click Next. Under Select AWS access type, select the Access key - Programmatic access option. Click Next: Permissions.

Click the Attach existing policies directly option, and search for the name of the policy created in the previous step. Select the policy, and click Next: Tags.

Click Next: Review and click Create user.

In the Success screen, record the Access key ID and the Secret access key.

Step 3: Add your destination

Use the following details to complete the connection setup: bucket name, bucket region, and role ARN.

Permissions checklist

IAM policy on the role allows:
- s3:PutObject on arn:aws:s3:::BUCKET_NAME/*
- s3:DeleteObject on arn:aws:s3:::BUCKET_NAME/* (only required for initial connection test; may be removed after setup)
If using KMS encryption (CMK), IAM policy also allows:
- kms:GenerateDataKey and kms:Decrypt on your CMK ARN
Bucket exists in the intended region; folder prefix (if any) is configured as desired
Trust policy allows the data transfer service to assume the role

FAQ

Q: How is the S3 connection secured?

A: The recommended approach is role-based access using an IAM Role with a scoped permissions policy. The role is assumed via a trust policy and short-lived credentials, so no long-lived access keys are required. Optionally, access can be configured with HMAC access keys if your policies require it. For at-rest encryption, S3-managed encryption or KMS CMKs are supported (see the KMS callout above for required actions). Grant only the minimum permissions needed (PutObject, and DeleteObject for initial connection test).

Q: What are the `oaud` vs `sub` IDs used for?

A: These are identity claims used in the IAM trust policy when federating from GCP to AWS. sub uniquely identifies our Google principal in federation. oaud is an additional claim used to bind role assumption to your organization.

Q: How is data organized in the bucket?

A: Data lands in Hive-style partitions per model: <folder>/<model_name>/dt=<transfer_date>/<file_part>_<transfer_timestamp>.<ext>. You can set <folder> during configuration.

Q: What file formats are supported?

A: Parquet (default/recommended), CSV, and JSON/JSONL.

Q: How are large datasets written?

A: Files are automatically split; multiple files may be written per model per transfer.

Q: How do I know when a transfer completed?

A: Each transfer writes a manifest file per model under _manifests. The _manifests folder is created automatically at the root of the bucket. Files are written per model per transfer in the following format: _manifests/<model_name>/dt=<transfer_date>/manifest_{transfer_id}.json.

Q: Why do I sometimes see duplicates?

A: Object storage is append-only. The change detection process uses a lookback window to ensure no data is missed, which can create duplicates. Downstream pipelines should deduplicate on primary keys prioritizing the most recent transfer window; manifest files can help bound the set of files to read.

Prerequisites

Step 1: Set up destination S3 bucket

Create bucket

Recommendation: dedicated bucket for data transfers

Optional: Add a short retention lifecycle policy

Step 2: Create policy and IAM role

Create policy

Understanding the s3:DeleteObject requirement

KMS encryption (optional)

Create role

Alternative authentication method: AWS User with HMAC Access Key ID & Secret Access Key

Step 3: Add your destination

Permissions checklist

FAQ

Q: How is the S3 connection secured?

Q: What are the oaud vs sub IDs used for?

Q: How is data organized in the bucket?

Q: What file formats are supported?

Q: How are large datasets written?

Q: How do I know when a transfer completed?

Q: Why do I sometimes see duplicates?

Q: What are the `oaud` vs `sub` IDs used for?