Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

R and Athena … there is another way!?

49 views

Published on

When setting up JDBC or ODBC drivers to AWS Athena proves to be too difficult, is there another way? Well yes! with the new RAthena and noctua packages, R users can now connect to AWS Athena by using AWS SDK's.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

R and Athena … there is another way!?

  1. 1. R and Athena … there is another way!?
  2. 2. What is AWS Athena? Amazon Athena is an interactive query service that makes it easy to analyse data in Amazon S3 using standard SQL select * from iris My S3 Bucket R_data iris iris.csv mtcars mtcars.csv nyc_taxi yellow_taxi_trip year=2019 month=01 yellow_taxi_trip.csv month=02 yellow_taxi_trip.csv ... ... select * from yellow_taxi_trip select * from yellow_taxi_trip where year = ‘2019’ select * from yellow_taxi_trip where year = ‘2019’ and month = ‘01’
  3. 3. How to connect to AWS Athena? # connecting using ODBC library(DBI) con <- dbConnect(odbc::odbc(), ...) # connecting using jdbc library(DBI) library(RJDBC) drv <- jdbc(...) con <- dbConnect(drv, ...) # connecting using AWS SDK ????
  4. 4. A new Connection Method … # connecting using AWS SDK library(reticulate) boto <- import(“boto3) athena <- boto$Session()$client(“athena”) # query athena res <- athena$start_query_execution(QueryString = “select * from iris”, QueryExecutionContext = list(‘Database’ = ‘default’), ResultConfiguration= list(‘OutputLocation’= ‘s3://path/to/my/bucket/’)) # check status of query status <- athena$get_query_execution(QueryExecutionId = res$QueryExecutionId) # if query was successful then download output to local computer if (status$QueryExecution$Status$State == “SUCCEEDED”) { s3 <- boto$Session()$resource(“s3”) key <- paste0(res$QueryExecutionId, “.csv”) s3$Bucket(‘s3://path/to/my/bucket/’)$download_file(key, ‘c://my/computer/file.csv’) } That is a lot of code to just query a database! Plus there is no poll function to let us know if the query has been successful or not. Is there a better way?
  5. 5. Introducing RAthena library(DBI) con <- dbConnect(RAthena::athena(), s3_staging_dir = ‘s3://path/to/my/bucket/’) dbGetQuery(con, “select * from iris”)
  6. 6. What a R SDK!? # connecting using AWS SDK athena <- paws::athena() # query Athena res <- athena$start_query_execution(QueryString = “select * from iris”, QueryExecutionContext = list(‘Database’ = ‘default’), ResultConfiguration= list(‘OutputLocation’= ‘s3://path/to/my/bucket/’)) # check status of query status <- athena$get_query_execution(QueryExecutionId = res$QueryExecutionId) # if query was successful then download output to local computer if (status$QueryExecution$Status$State == “SUCCEEDED”) { s3 <-paws::s3() key <- paste0(res$QueryExecutionId, “.csv”) obj <- s3$(Bucket = ‘s3://path/to/my/bucket/’, Key = key) writeBin(obj$Body, “c://my/computer/file.csv”) }
  7. 7. Introducing noctua library(DBI) con <- dbConnect(noctua::athena(), s3_staging_dir = ‘s3://path/to/my/bucket/’) dbGetQuery(con, “select * from iris”)
  8. 8. RStudio integration
  9. 9. Other features library(DBI) library(dplyr) con <- dbConnect(noctua::athena(), s3_staging_dir = ‘s3://path/to/my/bucket/’) dbWriteTable(con, “mtcars”, mtcars) # or dplyr method copy_to(con, mtcars)
  10. 10. Useful links …. • AWS Athena: https://aws.amazon.com/athena/ • Boto3: https://boto3.amazonaws.com/v1/documentation/api/latest/index.html • reticulate: https://rstudio.github.io/reticulate/ • paws: https://paws-r.github.io/ • RAthena: https://dyfanjones.github.io/RAthena/ • noctua: https://dyfanjones.github.io/noctua/ Stuff about Me: • https://github.com/DyfanJones • https://dyfanjones.me/ • https://www.linkedin.com/in/dyfan-jones-a8261799/
  11. 11. Any Questions? Thanks for listening

×