DocumentationAPI Reference
Documentation

Google Cloud Storage

Configuring your Google Cloud Storage destination.

Prerequisites

  • By default, GCS authentication uses role-based access. You will need the data syncing service's service account name available to grant access. It should look like [email protected].

Step 1: Create a service account

  1. In the GCP console, navigate to the IAM & Admin menu, click into the Service Accounts tab, and click Create service account at the top of the menu.
  2. In the first step, name the service account that will be used to transfer data into Cloud Storage and click Create and Continue. Click Continue in the following optional step without assigning any roles.
  3. In the Grant users access to this service account step, within the Service account users role field, enter the provided Service account (see prerequisite) and click Done.
  4. Once successfully created, search for the created service account in the service accounts list, click the Service account name to view the details, and make a note of the email (note: this is a different email than the service's service account).
  5. Select the permissions tab, find the provided principal name (Service account from the prerequisite), click the Edit principal button (pencil icon), click Add another role, select the Service Account Token Creator role, and click Save.
🚧

Alternative authentication method: HMAC Access Key & Secret

Role based authentication is the preferred authentication mode for Google Cloud Storage based on GCP recommendations, however, HMAC Access Key ID & Secret Access Key is an alternative authentication method that can be used if preferred. An HMAC key is a type of credential and can be associated with a service account or a user account to access Google Cloud Storage.

  1. Navigate to the Cloud Storage page.
  2. Click into the Settings tab on the left side menu.

  1. Navigate to the Interoperability tab and click the Create a key for a Service Account button.
  2. Select the Service Account created in Step 1, and click Create key.

  1. Make a note of the Access key and Secret.

Step 2: Create destination GCS bucket

  1. Navigate to the Cloud Storage page.
  2. Click Create.
  3. Enter a bucket name, choose a region. Note: at the Choose how to control access to objects step, we recommend selecting Enforce public access prevention on this bucket.
  1. After choosing your preferences for the remaining steps, click Create.
📘

Recommendation: dedicated bucket for data transfers

Use a unique bucket for these transfers. This:

  • Prevents resource contention with other workloads
  • Avoids accidental data loss from mixed lifecycle or cleanup rules
  • Improves security by reducing surface area and enabling tighter, destination-scoped policies
  1. On the Bucket details page for the bucket you created, select the Permissions tab, and click Grant access.
  2. Grant access to the principal (Service Account) you created in Step 1 (Note: this is the service account you created, not the service account from the prerequisite), and assign the Role: Storage Legacy Bucket Writer. Click Save.

Step 3: Add your destination

Securely share your bucket name, your chosen folder name for the data, and your Service account email with us to complete the connection.

Permissions checklist

  • Service account has write access to the bucket (e.g., Storage Legacy Bucket Writer), or an equivalent custom role including:
    • storage.buckets.get
    • storage.objects.list, storage.objects.get, storage.objects.create, storage.objects.delete
  • If using service account impersonation, the token creator role is granted to the impersonating principal

FAQ

Q: How is the GCS connection secured?

A: Recommended: use a service account with role-based access (no long-lived user credentials). Optionally, HMAC keys can be used when policy requires, but short-lived tokens and least-privilege roles are preferred.

Q: How is data organized in the bucket?

A: Data lands in Hive-style partitions per model: <folder>/<model_name>/dt=<transfer_date>/<file_part>_<transfer_timestamp>.<ext>. To write to the bucket root, enter . as the folder name.

Q: What file formats are supported?

A: Parquet (default/recommended), CSV, and JSON/JSONL.

Q: How are large datasets written?

A: Files are automatically split; multiple files may be written per model per transfer.

Q: How do I know when a transfer completed?

A: Each transfer writes a manifest file per model under _manifests. Files are written per model per transfer in the format: _manifests/<model_name>/dt=<transfer_date>/manifest_{transfer_id}.json.

Q: Why do I sometimes see duplicates?

A: Object storage is append-only. The change detection process uses a lookback window to ensure no data is missed, which can create duplicates. Downstream pipelines should deduplicate on primary keys prioritizing the most recent transfer window; manifest files can help bound the set of files to read.

Q: What if I change the bucket or folder?

A: New files are appended to the new location. Existing data remains in the old location.

Q: Are there file size limits?

A: No explicit size/row limits for GCS; files are split automatically based on volume and performance heuristics.