Skip to main content

Athena on Outerbounds

Introduction​

Welcome to the Athena on Outerbounds journey!

πŸ“ˆ Learning objectives​

The goal of this self-contained lesson is to configure your Outerbounds account to work with Amazon Athena. You will:

  • configure an S3 bucket and form an Athena database on top of it
  • set up an IAM role that allows Outerbounds to interact with Athena, and
  • run SQL queries using Athena from Outerbounds workstations and Metaflow tasks.

Create Athena resources​

AWS Athena is a way to run SQL queries over data assets in S3. If you want to use your own bucket, skip this section and fetch the IAM role that can access your Athena account. If you want to set up a test bucket, follow the rest of this section.

Create an S3 bucket​

Go to the AWS console and create a bucket. Make sure to copy the arn, or open a new browser tab to complete for the next section where you will create an IAM role that can operate over this bucket using Athena.

Create role​

AWS Athena requires creating an IAM role with necessary permissions and configuring a query result location in S3. You can see a full guide here.

Create an IAM role with the following minimum permissions:

  • AWSAthenaFullAccess managed policy
  • S3 bucket access for query results
  • Glue Data Catalog access if using Glue catalogs
If you don't know where to start, here is a starter policy permission template to understand.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<YOUR_BUCKET>/*",
"arn:aws:s3:::<YOUR_BUCKET>"
]
},
{
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution",
"athena:GetQueryExecution",
"athena:GetQueryResults",
"athena:GetWorkGroup"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"glue:GetTable",
"glue:GetDatabase",
"glue:CreateDatabase",
"glue:CreateTable",
"glue:UpdateDatabase",
"glue:UpdateTable",
"glue:DeleteTable"
],
"Resource": "*"
}
]
}

Configure query results location​

  1. Create an S3 bucket for Athena to put query results and metadata in.
  2. In the Athena console, set the query result location to your S3 bucket.
  3. Ensure your IAM role has access to this bucket.

Chain Athena role with Outerbounds task role​

Follow instructions on the Integrations view in your Outerbounds deployment to:

  1. Update your role's trust policy to add the Outerbounds task role as a principal.
  2. Tag your role with a key outerbounds.com/accessible-by-deployment and the value from your Outerbounds deployment.

Download the content to your workstation​

Run the

outerbounds tutorials pull --url https://outerbounds-journeys-content.s3.us-west-2.amazonaws.com/main/journeys.tar.gz --destination-dir ~/learn

command to download the content to your workstation.

The downloaded content may include code packages for several journeys; the one we are interested in will reside under ~/learn/athena.

If you are not running this example on Outerbounds, you can change the ~/learn directory to a destination of your choice. If you are running on the platform, click next once you see Tutorials pulled successfully.

Setup​

Open the notebook in 00-setup from the ~/learn/athena directory. This notebook will guide you through the process of putting data into your S3 bucket to work with Athena.

Use Athena in a workstation notebook​

Open the notebook in 01-nb from the ~/learn/athena directory. This notebook will guide you through the process of running SQL queries using Athena. You will:

  • Connect to Athena using your configured role,
  • write and execute SQL queries, and
  • retrieve and analyze query results.

Use Athena in a workflow​

Open the 02-flow directory from the ~/learn/athena directory. This directory contains a Metaflow flow that runs SQL queries using Athena.

Run the flow with the following command:

python flow.py --environment=fast-bakery run --with kubernetes

Next steps​

You have completed the primary steps of this journey, showing how you can use Athena features from Outerbounds. There are many more ways to integrate Outerbounds with other AWS services!

Some potential next steps:

  • Create more complex queries combining multiple data sources.
  • Build automated reporting workflows using Athena.
  • Integrate Athena queries into your ML pipelines.