When setting up JDBC or ODBC drivers to AWS Athena proves to be too difficult, is there another way? Well yes! with the new RAthena and noctua packages, R users can now connect to AWS Athena by using AWS SDK's.
What is AWS Athena?
Amazon Athena is an interactive query service that makes it easy
to analyse data in Amazon S3 using standard SQL
select * from iris
My S3 Bucket
R_data
iris
iris.csv
mtcars
mtcars.csv
nyc_taxi
yellow_taxi_trip
year=2019
month=01
yellow_taxi_trip.csv
month=02
yellow_taxi_trip.csv
...
...
select * from yellow_taxi_trip
select * from yellow_taxi_trip
where year = ‘2019’
select * from yellow_taxi_trip
where year = ‘2019’
and month = ‘01’
How to connect to AWS Athena?
# connecting using ODBC
library(DBI)
con <- dbConnect(odbc::odbc(), ...)
# connecting using jdbc
library(DBI)
library(RJDBC)
drv <- jdbc(...)
con <- dbConnect(drv, ...)
# connecting using AWS SDK
????
A new Connection Method …
# connecting using AWS SDK
library(reticulate)
boto <- import(“boto3)
athena <- boto$Session()$client(“athena”)
# query athena
res <- athena$start_query_execution(QueryString = “select * from iris”,
QueryExecutionContext =
list(‘Database’ = ‘default’),
ResultConfiguration=
list(‘OutputLocation’= ‘s3://path/to/my/bucket/’))
# check status of query
status <- athena$get_query_execution(QueryExecutionId = res$QueryExecutionId)
# if query was successful then download output to local computer
if (status$QueryExecution$Status$State == “SUCCEEDED”) {
s3 <- boto$Session()$resource(“s3”)
key <- paste0(res$QueryExecutionId, “.csv”)
s3$Bucket(‘s3://path/to/my/bucket/’)$download_file(key, ‘c://my/computer/file.csv’)
}
That is a lot of code to just query a database! Plus there is no poll function to let us
know if the query has been successful or not. Is there a better way?
What a R SDK!?
# connecting using AWS SDK
athena <- paws::athena()
# query Athena
res <- athena$start_query_execution(QueryString = “select * from iris”,
QueryExecutionContext =
list(‘Database’ = ‘default’),
ResultConfiguration=
list(‘OutputLocation’= ‘s3://path/to/my/bucket/’))
# check status of query
status <- athena$get_query_execution(QueryExecutionId = res$QueryExecutionId)
# if query was successful then download output to local computer
if (status$QueryExecution$Status$State == “SUCCEEDED”) {
s3 <-paws::s3()
key <- paste0(res$QueryExecutionId, “.csv”)
obj <- s3$(Bucket = ‘s3://path/to/my/bucket/’, Key = key)
writeBin(obj$Body, “c://my/computer/file.csv”)
}