Athena on Outerbounds
Introductionβ
Welcome to the Athena on Outerbounds journey!
π Learning objectivesβ
The goal of this self-contained lesson is to configure your Outerbounds account to work with Amazon Athena. You will:
- configure an S3 bucket and form an Athena database on top of it
- set up an IAM role that allows Outerbounds to interact with Athena, and
- run SQL queries using Athena from Outerbounds workstations and Metaflow tasks.
Create Athena resourcesβ
AWS Athena is a way to run SQL queries over data assets in S3. If you want to use your own bucket, skip this section and fetch the IAM role that can access your Athena account. If you want to set up a test bucket, follow the rest of this section.
Create an S3 bucketβ
Go to the AWS console and create a bucket. Make sure to copy the arn, or open a new browser tab to complete for the next section where you will create an IAM role that can operate over this bucket using Athena.
Create roleβ
AWS Athena requires creating an IAM role with necessary permissions and configuring a query result location in S3. You can see a full guide here.
Create an IAM role with the following minimum permissions:
AWSAthenaFullAccess
managed policy- S3 bucket access for query results
- Glue Data Catalog access if using Glue catalogs
If you don't know where to start, here is a starter policy permission template to understand.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::<YOUR_BUCKET>/*",
"arn:aws:s3:::<YOUR_BUCKET>"
]
},
{
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution",
"athena:GetQueryExecution",
"athena:GetQueryResults",
"athena:GetWorkGroup"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"glue:GetTable",
"glue:GetDatabase",
"glue:CreateDatabase",
"glue:CreateTable",
"glue:UpdateDatabase",
"glue:UpdateTable",
"glue:DeleteTable"
],
"Resource": "*"
}
]
}
Configure query results locationβ
- Create an S3 bucket for Athena to put query results and metadata in.
- In the Athena console, set the query result location to your S3 bucket.
- Ensure your IAM role has access to this bucket.
Chain Athena role with Outerbounds task roleβ
Follow instructions on the Integrations view in your Outerbounds deployment to:
- Update your role's trust policy to add the Outerbounds task role as a principal.
- Tag your role with a key
outerbounds.com/accessible-by-deployment
and the value from your Outerbounds deployment.
Download the content to your workstationβ
Run the
outerbounds tutorials pull --url https://outerbounds-journeys-content.s3.us-west-2.amazonaws.com/main/journeys.tar.gz --destination-dir ~/learn
command to download the content to your workstation.
The downloaded content may include code packages for several journeys; the one we are interested in will reside under ~/learn/athena
.
If you are not running this example on Outerbounds, you can change the ~/learn
directory to a destination of your choice. If you are running on the platform, click next once you see Tutorials pulled successfully.
Setupβ
Open the notebook in 00-setup
from the ~/learn/athena
directory. This notebook will guide you through the process of putting data into your S3 bucket to work with Athena.
Use Athena in a workstation notebookβ
Open the notebook in 01-nb
from the ~/learn/athena
directory. This notebook will guide you through the process of running SQL queries using Athena. You will:
- Connect to Athena using your configured role,
- write and execute SQL queries, and
- retrieve and analyze query results.
Use Athena in a workflowβ
Open the 02-flow
directory from the ~/learn/athena
directory. This directory contains a Metaflow flow that runs SQL queries using Athena.
Run the flow with the following command:
python flow.py --environment=fast-bakery run --with kubernetes
Next stepsβ
You have completed the primary steps of this journey, showing how you can use Athena features from Outerbounds. There are many more ways to integrate Outerbounds with other AWS services!
Some potential next steps:
- Create more complex queries combining multiple data sources.
- Build automated reporting workflows using Athena.
- Integrate Athena queries into your ML pipelines.