Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM

351 views

Published on

Sparklyr: Big Data enabler for R users

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM

  1. 1. 1 ICTeam S.p.A. – Presentazione della Divisione Progettazione Sparklyr: Big Data enabler for R users Serena Signorelli Data Science Milan, May 15th 2017
  2. 2. 2 ICTeam S.p.A. – Presentazione della Divisione Progettazione Outline  About me  The Data Science process  Package and its functionalities  SparkR vs Sparklyr  Demo on NYC taxi data
  3. 3. 3 ICTeam S.p.A. – Presentazione della Divisione Progettazione About me Experience:  Business administration and management  Research grants in Economics Statistics  PhD in Analytics for Economics and Business  Traineeship at Eurostat Big Data Task Force  Data scientist at ICTeam SpA Why Sparklyr?  R user  No computer science background  Need to handle Big Data
  4. 4. 4 ICTeam S.p.A. – Presentazione della Divisione Progettazione R language  Open source  5th most popular programming language in 2016 (IEEE Spectrum ranking)  Data analysis, statistical modelling and visualization  Historically limited to in-memory data
  5. 5. 5 ICTeam S.p.A. – Presentazione della Divisione Progettazione Data Science process 1. Import data into memory 2. Clean and tidy the data 3. Cyclical process called understand: 1. making transformations to tidied data 2. using the transformed data to fit models 3. visualizing results 4. Communicate the results
  6. 6. 6 ICTeam S.p.A. – Presentazione della Divisione Progettazione Data Science process with big data Problem: data is too large to download into memory Workaround: use a very small sample or download as much data as possible
  7. 7. 7 ICTeam S.p.A. – Presentazione della Divisione Progettazione Data Science process with big data Limitations: the sample may not be representative, long waiting time in every iteration of importing, exploring and modeling Solution: use Sparklyr to access and analyze the data inside Spark and only bring results into R
  8. 8. 8 ICTeam S.p.A. – Presentazione della Divisione Progettazione Data Science process with big data
  9. 9. 9 ICTeam S.p.A. – Presentazione della Divisione Progettazione Sparklyr: R interface for Apache Spark  First release: 0.4 – September 24th, 2016  Current release: 0.5.4 – April 25th, 2017
  10. 10. 10 ICTeam S.p.A. – Presentazione della Divisione Progettazione Dplyr  Dplyr verbs:  select ~ SELECT  filter ~ WHERE  arrange ~ ORDER  summarise ~ aggregators: sum, min, sd, etc.  mutate ~ operators: +, *, log, etc.  Grouping: group_by ~ GROUP BY  Window functions: rank, dense_rank, percent_rank, ntile, row_number, cume_dist, first_value, last_value, lag, lead  Performing joins: inner_join, semi_join, left_join, anti_join, full_join  Sampling: sample_n, sample_frac
  11. 11. 11 ICTeam S.p.A. – Presentazione della Divisione Progettazione Dplyr SQL translation:  Basic math operators: +, -, *, /, %%, ^  Math functions: abs, acos, asin, asinh, atan, atan2, ceiling, cos, cosh, exp, floor, log, log10, round, sign, sin, sinh, sqrt, tan, tanh  Logical comparisons: <, <=, !=, >=, >, ==, %in%  Boolean operations: &, &&, |, ||, !  Character functions: paste, tolower, toupper, nchar  Casting: as.double, as.integer, as.logical, as.character, as.date  Basic aggregations: mean, sum, min, max, sd, var, cor, cov, n
  12. 12. 12 ICTeam S.p.A. – Presentazione della Divisione Progettazione Dplyr in Sparklyr  Hive functions: many of Hive’s built-in functions (UDF) and built-in aggregate functions (UDAF) can be called inside dplyr’s mutate and summarize  Reading and writing data: spark_read_csv, spark_read_json, spark_read_parquet, spark_write_csv, spark_write_json, spark_write_parquet  Collecting to R: collect()
  13. 13. 13 ICTeam S.p.A. – Presentazione della Divisione Progettazione Dplyr in Sparklyr Characteristics:  Laziness  It never pulls data into R unless you explicitly ask for it  It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step  Piping %>%  From package magrittr  Provides a mechanism for chaining commands with a forward-pipe operator
  14. 14. 14 ICTeam S.p.A. – Presentazione della Divisione Progettazione Dplyr in Sparklyr: an example SELECT `dropoff_ntacode`, `dropoff_ntaname`, count(*) AS `n`, AVG(`trip_time`) AS `trip_time_mean`, AVG(`trip_distance`) AS `trip_dist_mean`, AVG(`dropoff_latitude`) AS `dropoff_latitude`, AVG(`dropoff_longitude`) AS `dropoff_longitude`, AVG(`passenger_count`) AS `passenger_mean`, AVG(`fare_amount`) AS `fare_amount`, AVG(`tip_amount`) AS `tip_amount` FROM (SELECT `vendorid`, `pickup_datetime`, `dropoff_datetime`, `passenger_count`, `trip_distance`, `pickup_longitude`, `pickup_latitude`, `ratecodeid`, `store_and_fwd_flag`, `dropoff_longitude`, `dropoff_latitude`, `payment_type`, `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`, `improvement_surcharge`, `total_amount`, `pickup_borocode`, `pickup_boroname`, `pickup_ntacode`, `pickup_ntaname`, `dropoff_borocode`, `dropoff_boroname`, `dropoff_ntacode`, `dropoff_ntaname`, UNIX_TIMESTAMP(`dropoff_datetime`) - UNIX_TIMESTAMP(`pickup_datetime`) AS `trip_time` FROM (SELECT * FROM (SELECT * FROM `yellow_taxi_raw_data_2009_2016_june_geo_partitioned` WHERE (`pickup_ntacode` = 'QN98')) `jxjgtsgzwv` WHERE (NOT((`dropoff_ntacode`) IS NULL))) `cobkqjitky`) `vquwddaabv` GROUP BY `dropoff_ntacode`, `dropoff_ntaname` jfk_pickup_tbl <- yellow_taxi_raw_data_prepared_tbl %>% filter(pickup_ntacode == 'QN98') %>% filter(!is.na(dropoff_ntacode)) %>% mutate(trip_time = unix_timestamp(dropoff_datetime) - unix_timestamp(pickup_datetime)) %>% group_by(dropoff_ntacode, dropoff_ntaname) %>% summarize(n = n(), trip_time_mean = mean(trip_time), trip_dist_mean = mean(trip_distance), dropoff_latitude = mean(dropoff_latitude), dropoff_longitude = mean(dropoff_longitude), passenger_mean = mean(passenger_count), fare_amount = mean(fare_amount), tip_amount = mean(tip_amount)) dplyrSparkSQL
  15. 15. 15 ICTeam S.p.A. – Presentazione della Divisione Progettazione Dplyr in Sparklyr: an example SELECT * FROM (SELECT `dropoff_ntacode`, `dropoff_ntaname`, `n`, `trip_time_mean`, `trip_dist_mean`, `dropoff_latitude`, `dropoff_longitude`, `passenger_mean`, `fare_amount`, `tip_amount`, rank() OVER (PARTITION BY `dropoff_ntacode` ORDER BY `n` DESC) AS `n_rank` FROM (SELECT `dropoff_ntacode`, `dropoff_ntaname`, count(*) AS `n`, AVG(`trip_time`) AS `trip_time_mean`, AVG(`trip_distance`) AS `trip_dist_mean`, AVG(`dropoff_latitude`) AS `dropoff_latitude`, AVG(`dropoff_longitude`) AS `dropoff_longitude`, AVG(`passenger_count`) AS `passenger_mean`, AVG(`fare_amount`) AS `fare_amount`, AVG(`tip_amount`) AS `tip_amount` FROM (SELECT `vendorid`, `pickup_datetime`, `dropoff_datetime`, `passenger_count`, `trip_distance`, `pickup_longitude`, `pickup_latitude`, `ratecodeid`, `store_and_fwd_flag`, `dropoff_longitude`, `dropoff_latitude`, `payment_type`, `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`, `improvement_surcharge`, `total_amount`, `pickup_borocode`, `pickup_boroname`, `pickup_ntacode`, `pickup_ntaname`, `dropoff_borocode`, `dropoff_boroname`, `dropoff_ntacode`, `dropoff_ntaname`, UNIX_TIMESTAMP(`dropoff_datetime`) - UNIX_TIMESTAMP(`pickup_datetime`) AS `trip_time` FROM (SELECT * FROM (SELECT * FROM `yellow_taxi_raw_data_2009_2016_june_geo_partitioned` WHERE (`pickup_ntacode` = 'QN98')) `zrhxchievt` WHERE (NOT((`dropoff_ntacode`) IS NULL))) `wedupsfkki`) `hkpwsclpve` GROUP BY `dropoff_ntacode`, `dropoff_ntaname`) `bugdzqpxlv`) `haexysqfhn` WHERE (`n_rank` <= 25.0) Jfk_pickup <- jfk_pickup_tbl %>% mutate(n_rank = min_rank(desc(n))) %>% filter(n_rank <= 25) dplyrSparkSQL
  16. 16. 16 ICTeam S.p.A. – Presentazione della Divisione Progettazione ML in Sparklyr Sparklyr allows to access the machine learning routines provided by the spark.ml package Three families of functions:  Machine learning algorithms for analyzing data (ml_*)  Feature transformers for manipulating individual features (ft_*)  Functions for manipulating Spark DataFrames (sdf_*) Example:  Perform SQL queries through the sparklyr dplyr interface  Use the sdf_* and ft_* family of functions to generate new columns, or partition your data set  Choose an appropriate machine learning algorithm from the ml_* family of functions to model your data
  17. 17. 17 ICTeam S.p.A. – Presentazione della Divisione Progettazione Extensions in Sparklyr Extensions can be created to call the full Spark API and to provide interfaces to Spark packages Package Description spark.sas7bdat Read in SAS data in parallel into Apache Spark. rsparkling Extension for using H2O machine learning algorithms against Spark Data Frames. sparkhello Simple example of including a custom JAR file within an extension package. rddlist Implements some methods of an R list as a Spark RDD (resilient distributed dataset). sparkwarc Load WARC files into Apache Spark with sparklyr. sparkavro Load Avro data into Spark with sparklyr. It is a wrapper of spark-avro
  18. 18. 18 ICTeam S.p.A. – Presentazione della Divisione Progettazione Sparklyr help: the RStudio cheat sheet
  19. 19. 19 ICTeam S.p.A. – Presentazione della Divisione Progettazione SparkR vs Sparklyr natively included in Spark after version 1.6.2 developed by RStudio, available on CRAN and GitHub it allows to download and install Spark for development purposes df <- createDataFrame(flights) head(select(df, df$distance, df$origin)) or head(df[, c(‘distance', ‘origin')]) filter(df, df$distance > 3000) df <- copy_to(sc2, flights) head(select(df, distance, origin)) filter(df, distance > 3000) documentation through R’s help documentation through R’s help SparkR Sparklyr
  20. 20. 20 ICTeam S.p.A. – Presentazione della Divisione Progettazione SparkR vs Sparklyr spark.logit spark.mlp spark.naiveBayes spark.survreg spark.glm spark.gbt spark.randomForest spark.kmeans spark.lda spark.isoreg spark.gaussianMixture spark.als spark.kstest ml_logistic_regression ml_multilayer_perceptron ml_naive_bayes ml_survival_regression ml_generalized_linear_regression ml_gradient_boosted_trees ml_random_forest ml_kmeans ml_lda ml_linear_regression ml_decision_tree ml_pca ml_one_vs_rest UDF functions UDF functions (but can invoke Scala code) SparkR Sparklyr
  21. 21. 21 ICTeam S.p.A. – Presentazione della Divisione Progettazione SparkR vs Sparklyr in Google Trends
  22. 22. 22 ICTeam S.p.A. – Presentazione della Divisione Progettazione SparkR vs Sparklyr in Google Trends
  23. 23. 23 ICTeam S.p.A. – Presentazione della Divisione Progettazione Demo on NYC taxi data  1 billion NYC taxi data  Original analysis by Todd W. Schneider1, November 2015 +  Rstudio webinar2, October 2016  77 GB of data stored in a Hive table 1 http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/ 2 https://www.rstudio.com/resources/webinars/using-spark-with-shiny-and-r-markdown/
  24. 24. 24 ICTeam S.p.A. – Presentazione della Divisione Progettazione serena.signorelli@icteam.it Thank you for your attention

×