Advertisement

R and Athena … there is another way!?

Feb. 21, 2020
Advertisement

More Related Content

Slideshows for you(20)

Advertisement

R and Athena … there is another way!?

  1. R and Athena … there is another way!?
  2. What is AWS Athena? Amazon Athena is an interactive query service that makes it easy to analyse data in Amazon S3 using standard SQL select * from iris My S3 Bucket R_data iris iris.csv mtcars mtcars.csv nyc_taxi yellow_taxi_trip year=2019 month=01 yellow_taxi_trip.csv month=02 yellow_taxi_trip.csv ... ... select * from yellow_taxi_trip select * from yellow_taxi_trip where year = ‘2019’ select * from yellow_taxi_trip where year = ‘2019’ and month = ‘01’
  3. How to connect to AWS Athena? # connecting using ODBC library(DBI) con <- dbConnect(odbc::odbc(), ...) # connecting using jdbc library(DBI) library(RJDBC) drv <- jdbc(...) con <- dbConnect(drv, ...) # connecting using AWS SDK ????
  4. A new Connection Method … # connecting using AWS SDK library(reticulate) boto <- import(“boto3) athena <- boto$Session()$client(“athena”) # query athena res <- athena$start_query_execution(QueryString = “select * from iris”, QueryExecutionContext = list(‘Database’ = ‘default’), ResultConfiguration= list(‘OutputLocation’= ‘s3://path/to/my/bucket/’)) # check status of query status <- athena$get_query_execution(QueryExecutionId = res$QueryExecutionId) # if query was successful then download output to local computer if (status$QueryExecution$Status$State == “SUCCEEDED”) { s3 <- boto$Session()$resource(“s3”) key <- paste0(res$QueryExecutionId, “.csv”) s3$Bucket(‘s3://path/to/my/bucket/’)$download_file(key, ‘c://my/computer/file.csv’) } That is a lot of code to just query a database! Plus there is no poll function to let us know if the query has been successful or not. Is there a better way?
  5. Introducing RAthena library(DBI) con <- dbConnect(RAthena::athena(), s3_staging_dir = ‘s3://path/to/my/bucket/’) dbGetQuery(con, “select * from iris”)
  6. What a R SDK!? # connecting using AWS SDK athena <- paws::athena() # query Athena res <- athena$start_query_execution(QueryString = “select * from iris”, QueryExecutionContext = list(‘Database’ = ‘default’), ResultConfiguration= list(‘OutputLocation’= ‘s3://path/to/my/bucket/’)) # check status of query status <- athena$get_query_execution(QueryExecutionId = res$QueryExecutionId) # if query was successful then download output to local computer if (status$QueryExecution$Status$State == “SUCCEEDED”) { s3 <-paws::s3() key <- paste0(res$QueryExecutionId, “.csv”) obj <- s3$(Bucket = ‘s3://path/to/my/bucket/’, Key = key) writeBin(obj$Body, “c://my/computer/file.csv”) }
  7. Introducing noctua library(DBI) con <- dbConnect(noctua::athena(), s3_staging_dir = ‘s3://path/to/my/bucket/’) dbGetQuery(con, “select * from iris”)
  8. RStudio integration
  9. Other features library(DBI) library(dplyr) con <- dbConnect(noctua::athena(), s3_staging_dir = ‘s3://path/to/my/bucket/’) dbWriteTable(con, “mtcars”, mtcars) # or dplyr method copy_to(con, mtcars)
  10. Useful links …. • AWS Athena: https://aws.amazon.com/athena/ • Boto3: https://boto3.amazonaws.com/v1/documentation/api/latest/index.html • reticulate: https://rstudio.github.io/reticulate/ • paws: https://paws-r.github.io/ • RAthena: https://dyfanjones.github.io/RAthena/ • noctua: https://dyfanjones.github.io/noctua/ Stuff about Me: • https://github.com/DyfanJones • https://dyfanjones.me/ • https://www.linkedin.com/in/dyfan-jones-a8261799/
  11. Any Questions? Thanks for listening
Advertisement