SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
When setting up JDBC or ODBC drivers to AWS Athena proves to be too difficult, is there another way? Well yes! with the new RAthena and noctua packages, R users can now connect to AWS Athena by using AWS SDK's.
When setting up JDBC or ODBC drivers to AWS Athena proves to be too difficult, is there another way? Well yes! with the new RAthena and noctua packages, R users can now connect to AWS Athena by using AWS SDK's.
2.
What is AWS Athena?
Amazon Athena is an interactive query service that makes it easy
to analyse data in Amazon S3 using standard SQL
select * from iris
My S3 Bucket
R_data
iris
iris.csv
mtcars
mtcars.csv
nyc_taxi
yellow_taxi_trip
year=2019
month=01
yellow_taxi_trip.csv
month=02
yellow_taxi_trip.csv
...
...
select * from yellow_taxi_trip
select * from yellow_taxi_trip
where year = ‘2019’
select * from yellow_taxi_trip
where year = ‘2019’
and month = ‘01’
3.
How to connect to AWS Athena?
# connecting using ODBC
library(DBI)
con <- dbConnect(odbc::odbc(), ...)
# connecting using jdbc
library(DBI)
library(RJDBC)
drv <- jdbc(...)
con <- dbConnect(drv, ...)
# connecting using AWS SDK
????
4.
A new Connection Method …
# connecting using AWS SDK
library(reticulate)
boto <- import(“boto3)
athena <- boto$Session()$client(“athena”)
# query athena
res <- athena$start_query_execution(QueryString = “select * from iris”,
QueryExecutionContext =
list(‘Database’ = ‘default’),
ResultConfiguration=
list(‘OutputLocation’= ‘s3://path/to/my/bucket/’))
# check status of query
status <- athena$get_query_execution(QueryExecutionId = res$QueryExecutionId)
# if query was successful then download output to local computer
if (status$QueryExecution$Status$State == “SUCCEEDED”) {
s3 <- boto$Session()$resource(“s3”)
key <- paste0(res$QueryExecutionId, “.csv”)
s3$Bucket(‘s3://path/to/my/bucket/’)$download_file(key, ‘c://my/computer/file.csv’)
}
That is a lot of code to just query a database! Plus there is no poll function to let us
know if the query has been successful or not. Is there a better way?
5.
Introducing RAthena
library(DBI)
con <- dbConnect(RAthena::athena(),
s3_staging_dir = ‘s3://path/to/my/bucket/’)
dbGetQuery(con, “select * from iris”)
6.
What a R SDK!?
# connecting using AWS SDK
athena <- paws::athena()
# query Athena
res <- athena$start_query_execution(QueryString = “select * from iris”,
QueryExecutionContext =
list(‘Database’ = ‘default’),
ResultConfiguration=
list(‘OutputLocation’= ‘s3://path/to/my/bucket/’))
# check status of query
status <- athena$get_query_execution(QueryExecutionId = res$QueryExecutionId)
# if query was successful then download output to local computer
if (status$QueryExecution$Status$State == “SUCCEEDED”) {
s3 <-paws::s3()
key <- paste0(res$QueryExecutionId, “.csv”)
obj <- s3$(Bucket = ‘s3://path/to/my/bucket/’, Key = key)
writeBin(obj$Body, “c://my/computer/file.csv”)
}
7.
Introducing noctua
library(DBI)
con <- dbConnect(noctua::athena(),
s3_staging_dir = ‘s3://path/to/my/bucket/’)
dbGetQuery(con, “select * from iris”)
9.
Other features
library(DBI)
library(dplyr)
con <- dbConnect(noctua::athena(),
s3_staging_dir = ‘s3://path/to/my/bucket/’)
dbWriteTable(con, “mtcars”, mtcars)
# or dplyr method
copy_to(con, mtcars)