Athena

Instructions for connecting to an AWS Athena service and the underlying S3 storage as a source

Prerequisites

  • By default, Athena authentication uses role-based access. You will need the trust policy prepopulated with the data-syncing service's identifier to grant access. It should look similar to the following JSON object with a proper service account identifier:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sts:AssumeRoleWithWebIdentity"
      ],
      "Principal": {
        "Federated": "accounts.google.com"
      },
      "Condition": {
        "StringEquals": {
          "accounts.google.com:sub": "<some_service_account_identifier>"
        }
      }
    }
  ]
}

πŸ“˜

Trust policies for AWS deployments

If you are running Prequel on a self-hosted AWS deployment, the Trust Policy JSON will reference an AWS federation account instead of the Google Cloud federation service seen in the above JSON document.

Step 1: Create a staging bucket, service policy, and role

Create Athena staging bucket

Follow these steps to create a bucket to be used for staging data before transferring to a destination.

  1. Navigate to the S3 service page.
  2. Click Create bucket.
  3. Enter a Bucket name, select the same AWS Region as your Athena S3 source, and modify any of the default settings as desired. Note: Object Ownership can be set to "ACLs disabled" and Block Public Access settings for this bucket can be set to "Block all public access" as recommended by AWS. Make note of the Bucket name and AWS Region.
  4. Click Create bucket.

Create Athena access policy

  1. Navigate to the IAM service page, click on the Policies navigation tab, and click Create policy.
  2. Click the JSON tab, and paste the following policy, being sure to replace ACCOUNT_ID, WORKGROUP, BUCKET_SOURCE and BUCKET_STAGING with the your account information.
    1. WORKGROUP should be primary unless otherwise specified during connection configuration.
    2. BUCKET_SOURCE should refer to the bucket(s) containing the underlying data to be queried by Athena.
    3. BUCKET_STAGING should refer to the staging bucket created in the previous step. Note: a staging folder is created automatically in the provided staging bucket to temporarily "stage" query results before transferring to destinations.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowAthenaAccess",
            "Effect": "Allow",
            "Action": [
                "athena:GetQueryResults",
                "athena:StartQueryExecution",
                "athena:StartSession",
                "athena:GetDatabase",
                "athena:GetDataCatalog",
                "athena:GetWorkGroup",
                "athena:GetTableMetadata",
                "athena:GetQueryExecution"
            ],
            "Resource": [
                "arn:aws:athena:*:ACCOUNT_ID:workgroup/WORKGROUP"
            ]
        },
        {
          "Sid": "AllowGlueAccessToSourceDatabaseAndTables",
          "Effect": "Allow",
            "Action": [
                "glue:GetDatabases",
                "glue:GetDatabase",
                "glue:GetTables",
                "glue:GetTable",
                "glue:GetPartitions",
                "glue:CreateTable",
                "glue:CreateDatabase",
                "glue:UpdateTable"
            ],
            "Resource": [
                "arn:aws:glue:*:ACCOUNT_ID:catalog",
                "arn:aws:glue:*:ACCOUNT_ID:database/DATABASE_SOURCE",
                "arn:aws:glue:*:ACCOUNT_ID:database/default",
                "arn:aws:glue:*:ACCOUNT_ID:table/DATABASE_SOURCE/*"
            ]
        },
        {
            "Sid": "AllowGlueAccessToDeleteSnapshotTables",
            "Effect": "Allow",
            "Action": [
                "glue:DeleteTable"
            ],
            "Resource": [
                "arn:aws:glue:*:ACCOUNT_ID:catalog",
                "arn:aws:glue:*:ACCOUNT_ID:database/DATABASE_SOURCE",
                "arn:aws:glue:*:ACCOUNT_ID:table/DATABASE_SOURCE/prequel*"
            ]
        },
        {
            "Sid": "AllowS3AccessToSourceBucket",  
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::BUCKET_SOURCE",
                "arn:aws:s3:::BUCKET_SOURCE/*"
            ]
        },
        {
            "Sid": "AllowS3AccessToStagingBucket",    
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::BUCKET_STAGING",
                "arn:aws:s3:::BUCKET_STAGING/*"
            ]
        }
    ]
}

πŸ“˜

Athena vs. S3 permissions

Because Athena uses S3 as a data source, the Resource access requested in the policy are scoped down via resource-specific permissions in the S3 actions.

πŸ“˜

BUCKET_SOURCE vs. BUCKET_STAGING

During source configuration, you will notice both a BUCKET_SOURCE and a BUCKET_STAGING. They are used for two different purposes here:

  • BUCKET_SOURCE should be the name of the S3 bucket in which the source data is located. Permissions on this resource are limited to read-only operations.
  • BUCKET_STAGING should be the name of the S3 bucket in which the Athena query results are written (i.e., the automatically generated athena_output/ data).
  1. Click through to the Review step, choose a name for the policy, for example, transfer-service-policy (this will be referenced in the next step), add a description, and click Create policy.

Create role

  1. Navigate to the IAM service page.
  2. Navigate to the Roles navigation tab, and click Create role.
  3. Select Custom trust policy and paste the provided trust policy (referenced in the prerequisite at the top of this guide) to allow AssumeRole access to this role. Click Next.
  4. Add the permissions policy created above, and click Next.
  5. Enter a Role name, for example, transfer-role, and click Create role.
  6. Once successfully created, search for the created role in the Roles list, click the role name, and make a note of the ARN value.

🚧

Alternative authentication method: AWS User with HMAC Access Key ID & Secret Access Key

Role based authentication is the preferred authentication mode for Athena based on AWS recommendations, however, HMAC Access Key ID & Secret Access Key is an alternative authentication method that can be used if preferred.

  1. Navigate to the IAM service page.
  2. Navigate to the Users navigation tab, and click Add users.
  3. Enter a User name for the service, for example, transfer-service, click Next. Under Select AWS access type, select the Access key - Programatic access option. Click Next: Permissions.
  4. Click the Attach existing policies directly option, and search for the name of the policy created in the previous step. Select the policy, and click Next: Tags.
  5. Click Next: Review and click Create user.
  6. In the Success screen, record the Access key ID and the Secret access key.

Step 2: Add source to Prequel

Test the connection & add the configured source to Prequel in the Prequel UI or via cURL request.