Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

R Packages githubinstall magicfor dplyr.teradata

1,341 views

Published on

Introduce R pacages: githubinstall, magicfor and dplyr.teradata.
Global Tokyo.R #2
https://japanr.connpass.com/event/54006/

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

R Packages githubinstall magicfor dplyr.teradata

  1. 1. No package? OK, develop it !!! Koji Makiyama Global Tokyo.R #2 2017/04/01 1
  2. 2. About Me HOXO-M Inc. President & CEO 2
  3. 3. HOXO-M Inc. •  Consists of –  Awesome young data scientists and –  R-Ojisan: people who too love R. •  Our Stance: “No package? OK, develop it !!!” (なければ作る!それがホクソエムの誓い!) 3
  4. 4. Developed Packages •  CRAN githubinstall batade RODBCDBI densratio jpmesh magicfor •  GitHub pforeach jaguchi RFinanceJ SparkRext easyRFM healthplanet dplyr.teradata rOpenWeatherMap etc. 4
  5. 5. Today’s Story •  Introduce Our Packages Picked up –  githubinstall –  magicfor –  dplyr.teradata 5
  6. 6. githubinstall 6
  7. 7. githubinstall •  Install R Packages on GitHub without Developer Names •  Using devtools > install_github("hadley/dplyr") •  Using githubinstall > githubinstall("dplyr") 7
  8. 8. Motivation •  There is a great package ggfortify. •  In the past, it is not on CRAN. •  I used to install it using devtools > install_github("sinhrks/ggfortify") •  It is hard for me to remember the correct spelling of “sinhrks” 8
  9. 9. Motivation •  I wanted to install packages even if I forget who created them. •  “No package? OK, develop it !!!” 9
  10. 10. How Does It Work? •  Gepuro Task Views http://rpkg.gepuro.net –  Crawling GitHub Repositories Every Day •  Atsushi Hayakawa –  One of the Awesome –  HOXO-M 10
  11. 11. Preparation •  Install > install.packages("githubinstall") •  Load Library > library(githubinstall) 11
  12. 12. Basics •  Install Packages Hosted on GitHub > githubinstall("package_name") •  Example: twitter/AnomalyDetection > githubinstall("AnomalyDetection") 12
  13. 13. Fuzzy Matching •  Example: DiagrammeR > githubinstall("DiagramR") Suggestion: - rich-iannone/DiagrammeR Do you want to install the package? (Y/n) 13
  14. 14. Specify Git References •  From Branches githubinstall("ggplot2", ref="sf") •  From Tags Githubinstall("ggplot2", ref="v1.1.0") •  From Commits githubinstall("ggplot2", ref="f4398b") 14
  15. 15. Suggest •  To Know Repository Names w/o Install > gh_suggest("DiagramR") [1] "rich-iannone/DiagrammeR" •  Fuzzy Search for Developer Names > gh_suggest_username("yuhui") [1] "yihui" 15
  16. 16. List Packages •  List Packages by Developer Names •  Example: Packages Created by Hadley > hadleyverse <- gh_list_packages(username="hadley") > head(hadleyverse) username package_name title 1 hadley RcppDateTime 2 hadley S3 Helpers for Programming with 3 hadley assertthat User friendly assertions for 4 hadley babynames An R package contain all bab 5 hadley bench Bechmarking tools for 6 hadley bigrquery An interface to Google's big 16
  17. 17. Search Packages •  Search Packages by Keywords •  Example: Search Packages Related Lasso > lasso_packages <- gh_search_packages("lasso") > head(lasso_packages) username package_name title CY-dev sparseSVM Solution Paths of Spar ChingChuan-Chen milr multiple-instance logi FrankD fuser Fused lasso for high-d ManuSetty SeqGL SeqGL is a group lasso PingYangChen milr multiple-instance logi TaddyLab gamlr Gamma lasso regression 17
  18. 18. Summary 18 •  The githubinstall package provides helper functions to install and find packages hosted on GitHub.
  19. 19. magicfor 19
  20. 20. magicfor •  Remember Printed Values in for Loops > magic_for(print) > for (i in 1:3) { + squared <- i ^ 2 + print(squared) + } > magic_result_as_vector() [1] 1 4 9 20
  21. 21. Motivation •  Printed Values in for loops go away. > for (i in 1:3) { + squared <- i ^ 2 + print(squared) + } [1] 1 [1] 4 [1] 9 21
  22. 22. Motivation •  To keep it, we need to change the code. > result <- vector("numeric", 3) > for (i in 1:3) { + squared <- i ^ 2 + result[i] <- squared + } > result [1] 1 4 9 22
  23. 23. Motivation •  Too much hassle to carefully do that to: –  Prepare some containers –  With the correct length and –  Add assignment statements. •  I don’t want to do that. •  “No package? OK, develop it !!!” 23
  24. 24. magicfor •  Insert one line spell magic_for() > magic_for(print) > for (i in 1:3) { + squared <- i ^ 2 + print(squared) + } •  You can take the values out at later. > magic_result_as_vector() [1] 1 4 9 24
  25. 25. How Does It Work? • Magic 25
  26. 26. How Does It Work? •  If you want really to know the magic, read “Advanced R.” •  Then check out magicfor codes on GitHub. https://github.com/hoxo-m/magicfor 26
  27. 27. Preparation •  Install > install.packages("magicfor") •  Load Library > library(magicfor) 27
  28. 28. Basics •  magic_for(func, progress, test, silent) •  Arguments func: function to print values (e.g. print, cat) progress: whether to display a progress bar test: number of iteration for test silent: whether to suppress messages 28
  29. 29. Choose Print Function •  You can choose function to print values > magic_for(cat) > for (i in 1:3) { + squared <- i ^ 2 + cat(squared) + } > magic_result_as_vector() [1] 1 4 9 •  Default print function is put(). 29
  30. 30. put() •  put() displays values with high flexibility. > x <- 2; y <- 3 > put(x) x: 2 > put(x, y) x: 2, y: 3 > put(x, x^2, x^3) x: 2, x^2: 4, x^3: 8 > put(x, squared = x^2, cubed = x^3) x: 2, squared: 4, cubed: 8 30
  31. 31. magicfor with put() •  magicfor & put() are very compatible > magic_for() > for (i in 1:3) { + put(x = i, squared = i^2, cubed = i^3) + } > magic_result_as_dataframe(F) x squared cubed 1 1 1 1 2 2 4 8 3 3 9 27 31
  32. 32. Summary •  The magicfor package provides a magic function to store values in for loops automatically. 32
  33. 33. dplyr.teradata 33
  34. 34. dplyr.teradata •  Teradata Backend for dplyr > tera_db <- src_teradata("schema_name") > table <- tbl(tera_db, "table_name”) > query <- count(table, gender) > collect(query) gender n 1 ♀ 123 2 ♂ 456 34
  35. 35. Motivation •  I would like to extract data with dplyr verbs from Teradata. •  I found teradata.dplyr package on GitHub but it is not maintained. •  “No package? OK, develop it !!!” 35
  36. 36. How Does it Work? •  There is a way to add new SQL backends to dplyr. > vignette("new-sql-backend") 36
  37. 37. Preparation •  Install Teradata ODBC Driver •  Install Package > library(githubinstall) > githubinstall("dplyr.teradata") •  Load Library > library(dplyr.teradata) 37
  38. 38. Usage •  Connect to Teradata > tera_db <- src_teradata("schema_name") •  Create Table Object > table <- tbl(tera_db, "table_name") •  Construct Query > query <- count(table, column_name) •  Send Query > result <- collect(query) 38
  39. 39. Construct Query •  Construct Queries using dplyr Verbs > query <- table %>% + select(gender, reg_date) %>% + filter(reg_date == "2017-04-01") %>% + group_by(gender) %>% + summarise(count = n()) •  They are converted to SQL implicitly. 39
  40. 40. Check SQL •  Check Converted SQL > show_query(query) <SQL> SELECT "gender", count(*) AS "count" FROM (SELECT * FROM (SELECT "gender" AS "gender", "reg_date" AS "reg_date" FROM table_name) AS "nwkksckhfq" WHERE ("reg_date" = '2017-04-01') AS "vmmugivqkw” GROUP BY "gender" 40
  41. 41. NOTE •  Generated SQL are usually redundant. •  We can only pray that the compiler will do well. 41
  42. 42. TIPS •  In most cases, a large amount of data is stored in Teradata. •  We should take care some points when extracting data for saving time. –  Establish connection with schema –  Use filter() and select() –  Use summarise() whenever possible –  Check explain(query) before send 42
  43. 43. Query Execution Plan > explain(query) <PLAN> 1) First, we lock stable.table_name for read on a reserved RowHash to prevent global deadlock. 2) Next, we lock stable.table_name for read. 3) We do an all-AMPs SUM step to aggregate from .. 5) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 2.05 seconds. 43
  44. 44. Join Across Schema •  When you need to join tables across schemas, you should establish connections without schemas. > db <- src_teradata("") > table1 <- tbl(db, "schema1.table1") > table2 <- tbl(db, "schema2.table2") > left_join(table1, table2, by="id") 44
  45. 45. Summary •  The dplyr.teradata package provides a way to extract data using dplyr verbs from Teradata. •  It is a beta version. •  We welcome your bug reports and contributions. https://github.com/hoxo-m/dplyr.teradata 45

×