Amazon Athena makes it easy to analyze semi-structured and non-structured data like json, csv & xml directly in Amazon S3 using SQL. However, it also allows you to easily query a number of relational databases hosted in AWS such as mySQL and PostgreSQL. For any Data Scientist, this opens up a world of potential because now it’s possible to write SQL queries that combine data from both relational and non-relational data that’s stored in your AWS data lake in a single query!
In this post, I’ll bring you through all the steps ( and all the gotchas!) to get up and running with the Athena federated query service.
Create a new Data Source
The first step, to get going, is to open the Athena service and click Connect data Source to setup a new connection.
Choose Query a data source then PostgreSQL as the data source that you want to query:
Athena uses data source connectors that run on AWS Lambda to run federated queries. Luckily, AWS have a number of prebuilt Athena data source connectors for JDBC-compliant relational data sources. In this next step we will choose to Configure a new AWS Lambda function to run our federated queries.
Selecting Configure a new AWS Lambda will open up the prebuilt Lambda function in the AWS Lambda service. All that’s left for you to do is to configure it to point to your PostgreSQL DB.
The first thing that you need to configure is the SpillBucket. Any query result sets larger than ~6MB, will spill this S3 bucket. The value supplied should be the bucket name and NOT the ARN. Also, do make sure that the account that you will be running the queries with has write access to the bucket.
The next bit of configuration that needs to be added is the default connection string for your PostgreSQL DB. It takes the form: postgres://jdbc:postgresql://<database endpoint>:<database port>/<database name>?user=<username>&password=<password>. All the details for the connection string, can be found in the Connectivity & security section for your DB in the Amazon RDS service.