SlideShare a Scribd company logo
1 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Sparklyr: Big Data
enabler for R users
Serena Signorelli
Data Science Milan, May 15th 2017
2 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Outline
 About me
 The Data Science process
 Package and its functionalities
 SparkR vs Sparklyr
 Demo on NYC taxi data
3 ICTeam S.p.A. – Presentazione della Divisione Progettazione
About me
Experience:
 Business administration and management
 Research grants in Economics Statistics
 PhD in Analytics for Economics and Business
 Traineeship at Eurostat Big Data Task Force
 Data scientist at ICTeam SpA
Why Sparklyr?
 R user
 No computer science background
 Need to handle Big Data
4 ICTeam S.p.A. – Presentazione della Divisione Progettazione
R language
 Open source
 5th most popular programming language in
2016 (IEEE Spectrum ranking)
 Data analysis, statistical modelling and visualization
 Historically limited to in-memory data
5 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Data Science process
1. Import data into memory
2. Clean and tidy the data
3. Cyclical process called understand:
1. making transformations to tidied data
2. using the transformed data to fit models
3. visualizing results
4. Communicate the results
6 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Data Science process with big data
Problem: data is too large to download into memory
Workaround: use a very small sample or download as
much data as possible
7 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Data Science process with big data
Limitations: the sample may not be representative, long
waiting time in every iteration of importing, exploring and
modeling
Solution: use Sparklyr to access and analyze the data
inside Spark and only bring results into R
8 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Data Science process with big data
9 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Sparklyr: R interface for Apache Spark
 First release: 0.4 – September 24th, 2016
 Current release: 0.5.4 – April 25th, 2017
10 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Dplyr
 Dplyr verbs:
 select ~ SELECT
 filter ~ WHERE
 arrange ~ ORDER
 summarise ~ aggregators: sum, min, sd, etc.
 mutate ~ operators: +, *, log, etc.
 Grouping: group_by ~ GROUP BY
 Window functions: rank, dense_rank, percent_rank, ntile,
row_number, cume_dist, first_value, last_value, lag, lead
 Performing joins: inner_join, semi_join, left_join, anti_join,
full_join
 Sampling: sample_n, sample_frac
11 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Dplyr
SQL translation:
 Basic math operators: +, -, *, /, %%, ^
 Math functions: abs, acos, asin, asinh, atan, atan2,
ceiling, cos, cosh, exp, floor, log, log10, round, sign,
sin, sinh, sqrt, tan, tanh
 Logical comparisons: <, <=, !=, >=, >, ==, %in%
 Boolean operations: &, &&, |, ||, !
 Character functions: paste, tolower, toupper, nchar
 Casting: as.double, as.integer, as.logical,
as.character, as.date
 Basic aggregations: mean, sum, min, max, sd, var,
cor, cov, n
12 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Dplyr in Sparklyr
 Hive functions:
many of Hive’s built-in functions (UDF) and built-in aggregate
functions (UDAF) can be called inside dplyr’s mutate and
summarize
 Reading and writing data:
spark_read_csv, spark_read_json, spark_read_parquet,
spark_write_csv, spark_write_json, spark_write_parquet
 Collecting to R:
collect()
13 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Dplyr in Sparklyr
Characteristics:
 Laziness
 It never pulls data into R unless you explicitly ask for it
 It delays doing any work until the last possible moment:
it collects together everything you want to do and then
sends it to the database in one step
 Piping %>%
 From package magrittr
 Provides a mechanism for chaining commands with a
forward-pipe operator
14 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Dplyr in Sparklyr: an example
SELECT `dropoff_ntacode`, `dropoff_ntaname`, count(*) AS `n`, AVG(`trip_time`) AS `trip_time_mean`,
AVG(`trip_distance`) AS `trip_dist_mean`, AVG(`dropoff_latitude`) AS `dropoff_latitude`,
AVG(`dropoff_longitude`) AS `dropoff_longitude`, AVG(`passenger_count`) AS `passenger_mean`,
AVG(`fare_amount`) AS `fare_amount`, AVG(`tip_amount`) AS `tip_amount`
FROM (SELECT `vendorid`, `pickup_datetime`, `dropoff_datetime`, `passenger_count`, `trip_distance`,
`pickup_longitude`, `pickup_latitude`, `ratecodeid`, `store_and_fwd_flag`, `dropoff_longitude`,
`dropoff_latitude`, `payment_type`, `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`,
`improvement_surcharge`, `total_amount`, `pickup_borocode`, `pickup_boroname`, `pickup_ntacode`,
`pickup_ntaname`, `dropoff_borocode`, `dropoff_boroname`, `dropoff_ntacode`, `dropoff_ntaname`,
UNIX_TIMESTAMP(`dropoff_datetime`) - UNIX_TIMESTAMP(`pickup_datetime`) AS `trip_time`
FROM (SELECT *
FROM (SELECT *
FROM `yellow_taxi_raw_data_2009_2016_june_geo_partitioned`
WHERE (`pickup_ntacode` = 'QN98')) `jxjgtsgzwv`
WHERE (NOT((`dropoff_ntacode`) IS NULL))) `cobkqjitky`) `vquwddaabv`
GROUP BY `dropoff_ntacode`, `dropoff_ntaname`
jfk_pickup_tbl <- yellow_taxi_raw_data_prepared_tbl %>%
filter(pickup_ntacode == 'QN98') %>%
filter(!is.na(dropoff_ntacode)) %>%
mutate(trip_time = unix_timestamp(dropoff_datetime) - unix_timestamp(pickup_datetime)) %>%
group_by(dropoff_ntacode, dropoff_ntaname) %>%
summarize(n = n(),
trip_time_mean = mean(trip_time),
trip_dist_mean = mean(trip_distance),
dropoff_latitude = mean(dropoff_latitude),
dropoff_longitude = mean(dropoff_longitude),
passenger_mean = mean(passenger_count),
fare_amount = mean(fare_amount),
tip_amount = mean(tip_amount))
dplyrSparkSQL
15 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Dplyr in Sparklyr: an example
SELECT *
FROM (SELECT `dropoff_ntacode`, `dropoff_ntaname`, `n`, `trip_time_mean`, `trip_dist_mean`,
`dropoff_latitude`, `dropoff_longitude`, `passenger_mean`, `fare_amount`, `tip_amount`, rank() OVER
(PARTITION BY `dropoff_ntacode` ORDER BY `n` DESC) AS `n_rank`
FROM (SELECT `dropoff_ntacode`, `dropoff_ntaname`, count(*) AS `n`, AVG(`trip_time`) AS
`trip_time_mean`, AVG(`trip_distance`) AS `trip_dist_mean`, AVG(`dropoff_latitude`) AS
`dropoff_latitude`, AVG(`dropoff_longitude`) AS `dropoff_longitude`, AVG(`passenger_count`) AS
`passenger_mean`, AVG(`fare_amount`) AS `fare_amount`, AVG(`tip_amount`) AS `tip_amount`
FROM (SELECT `vendorid`, `pickup_datetime`, `dropoff_datetime`, `passenger_count`, `trip_distance`,
`pickup_longitude`, `pickup_latitude`, `ratecodeid`, `store_and_fwd_flag`, `dropoff_longitude`,
`dropoff_latitude`, `payment_type`, `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`,
`improvement_surcharge`, `total_amount`, `pickup_borocode`, `pickup_boroname`, `pickup_ntacode`,
`pickup_ntaname`, `dropoff_borocode`, `dropoff_boroname`, `dropoff_ntacode`, `dropoff_ntaname`,
UNIX_TIMESTAMP(`dropoff_datetime`) - UNIX_TIMESTAMP(`pickup_datetime`) AS `trip_time`
FROM (SELECT *
FROM (SELECT *
FROM `yellow_taxi_raw_data_2009_2016_june_geo_partitioned`
WHERE (`pickup_ntacode` = 'QN98')) `zrhxchievt`
WHERE (NOT((`dropoff_ntacode`) IS NULL))) `wedupsfkki`) `hkpwsclpve`
GROUP BY `dropoff_ntacode`, `dropoff_ntaname`) `bugdzqpxlv`) `haexysqfhn`
WHERE (`n_rank` <= 25.0)
Jfk_pickup <- jfk_pickup_tbl %>%
mutate(n_rank = min_rank(desc(n))) %>%
filter(n_rank <= 25)
dplyrSparkSQL
16 ICTeam S.p.A. – Presentazione della Divisione Progettazione
ML in Sparklyr
Sparklyr allows to access the machine learning routines
provided by the spark.ml package
Three families of functions:
 Machine learning algorithms for analyzing data (ml_*)
 Feature transformers for manipulating individual features
(ft_*)
 Functions for manipulating Spark DataFrames (sdf_*)
Example:
 Perform SQL queries through the sparklyr dplyr interface
 Use the sdf_* and ft_* family of functions to generate new
columns, or partition your data set
 Choose an appropriate machine learning algorithm from the
ml_* family of functions to model your data
17 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Extensions in Sparklyr
Extensions can be created to call the full Spark API and to
provide interfaces to Spark packages
Package Description
spark.sas7bdat Read in SAS data in parallel into Apache Spark.
rsparkling Extension for using H2O machine learning
algorithms against Spark Data Frames.
sparkhello Simple example of including a custom JAR file
within an extension package.
rddlist Implements some methods of an R list as a
Spark RDD (resilient distributed dataset).
sparkwarc Load WARC files into Apache Spark with
sparklyr.
sparkavro Load Avro data into Spark with sparklyr. It is a
wrapper of spark-avro
18 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Sparklyr help: the RStudio cheat sheet
19 ICTeam S.p.A. – Presentazione della Divisione Progettazione
SparkR vs Sparklyr
natively included in Spark after
version 1.6.2
developed by RStudio, available
on CRAN and GitHub
it allows to download and install
Spark for development purposes
df <- createDataFrame(flights)
head(select(df, df$distance, df$origin))
or
head(df[, c(‘distance', ‘origin')])
filter(df, df$distance > 3000)
df <- copy_to(sc2, flights)
head(select(df, distance, origin))
filter(df, distance > 3000)
documentation through R’s help documentation through R’s help
SparkR Sparklyr
20 ICTeam S.p.A. – Presentazione della Divisione Progettazione
SparkR vs Sparklyr
spark.logit
spark.mlp
spark.naiveBayes
spark.survreg
spark.glm
spark.gbt
spark.randomForest
spark.kmeans
spark.lda
spark.isoreg
spark.gaussianMixture
spark.als
spark.kstest
ml_logistic_regression
ml_multilayer_perceptron
ml_naive_bayes
ml_survival_regression
ml_generalized_linear_regression
ml_gradient_boosted_trees
ml_random_forest
ml_kmeans
ml_lda
ml_linear_regression
ml_decision_tree
ml_pca
ml_one_vs_rest
UDF functions UDF functions
(but can invoke Scala code)
SparkR Sparklyr
21 ICTeam S.p.A. – Presentazione della Divisione Progettazione
SparkR vs Sparklyr in Google Trends
22 ICTeam S.p.A. – Presentazione della Divisione Progettazione
SparkR vs Sparklyr in Google Trends
23 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Demo on NYC taxi data
 1 billion NYC taxi data
 Original analysis by Todd W. Schneider1, November 2015
+
 Rstudio webinar2, October 2016
 77 GB of data stored in a Hive table
1 http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/
2 https://www.rstudio.com/resources/webinars/using-spark-with-shiny-and-r-markdown/
24 ICTeam S.p.A. – Presentazione della Divisione Progettazione
serena.signorelli@icteam.it
Thank you for your attention

More Related Content

What's hot

Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Databricks
 
Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache Spark
Databricks
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Databricks
 
ROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlowROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlow
Databricks
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin WilkinsSpark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
DataWorks Summit
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
Spark Summit
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseBuild, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Databricks
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programming
phanleson
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
Databricks
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
Craig Chao
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Flock: Data Science Platform @ CISL
Flock: Data Science Platform @ CISLFlock: Data Science Platform @ CISL
Flock: Data Science Platform @ CISL
Databricks
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Databricks
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Ahsan Javed Awan
 

What's hot (20)

Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
 
Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache Spark
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
ROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlowROCm and Distributed Deep Learning on Spark and TensorFlow
ROCm and Distributed Deep Learning on Spark and TensorFlow
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin WilkinsSpark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseBuild, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with Ease
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programming
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Flock: Data Science Platform @ CISL
Flock: Data Science Platform @ CISLFlock: Data Science Platform @ CISL
Flock: Data Science Platform @ CISL
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
 

Similar to Sparklyr: Big Data enabler for R users

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
Ajay Ohri
 
Briefing on the Modern ML Stack with R
 Briefing on the Modern ML Stack with R Briefing on the Modern ML Stack with R
Briefing on the Modern ML Stack with R
Databricks
 
SparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big DataSparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big Data
samuel shamiri
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
Databricks
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
Pattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and HadoopPattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and Hadoop
Paco Nathan
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for Everyone
Giovanna Roda
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Image Caption Generation: Intro to Distributed Tensorflow and Distributed Sco...
Image Caption Generation: Intro to Distributed Tensorflow and Distributed Sco...Image Caption Generation: Intro to Distributed Tensorflow and Distributed Sco...
Image Caption Generation: Intro to Distributed Tensorflow and Distributed Sco...
ICTeam S.p.A.
 
Deep Learning - Luca Grazioli, ICTEAM
Deep Learning - Luca Grazioli, ICTEAMDeep Learning - Luca Grazioli, ICTEAM
Deep Learning - Luca Grazioli, ICTEAM
Data Science Milan
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Databricks
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
Amazon Web Services
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
Dipendra Kusi
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science Project
Matthew Gerring
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark Applications
Tzach Zohar
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL Engine
Databricks
 
Real timeml visualizationwithspark_v6
Real timeml visualizationwithspark_v6Real timeml visualizationwithspark_v6
Real timeml visualizationwithspark_v6
sfbiganalytics
 
Real Time Visualization with Spark
Real Time Visualization with SparkReal Time Visualization with Spark
Real Time Visualization with Spark
Alpine Data
 

Similar to Sparklyr: Big Data enabler for R users (20)

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Briefing on the Modern ML Stack with R
 Briefing on the Modern ML Stack with R Briefing on the Modern ML Stack with R
Briefing on the Modern ML Stack with R
 
SparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big DataSparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big Data
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
 
Pattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and HadoopPattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and Hadoop
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for Everyone
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Image Caption Generation: Intro to Distributed Tensorflow and Distributed Sco...
Image Caption Generation: Intro to Distributed Tensorflow and Distributed Sco...Image Caption Generation: Intro to Distributed Tensorflow and Distributed Sco...
Image Caption Generation: Intro to Distributed Tensorflow and Distributed Sco...
 
Deep Learning - Luca Grazioli, ICTEAM
Deep Learning - Luca Grazioli, ICTEAMDeep Learning - Luca Grazioli, ICTEAM
Deep Learning - Luca Grazioli, ICTEAM
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science Project
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark Applications
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL Engine
 
Real timeml visualizationwithspark_v6
Real timeml visualizationwithspark_v6Real timeml visualizationwithspark_v6
Real timeml visualizationwithspark_v6
 
Real Time Visualization with Spark
Real Time Visualization with SparkReal Time Visualization with Spark
Real Time Visualization with Spark
 

Recently uploaded

NAEYC Code of Ethical Conduct Resource Book
NAEYC Code of Ethical Conduct Resource BookNAEYC Code of Ethical Conduct Resource Book
NAEYC Code of Ethical Conduct Resource Book
lakitawilson
 
Chapter III - Vital force: Herbert A. Roberts
Chapter III - Vital force: Herbert A. RobertsChapter III - Vital force: Herbert A. Roberts
Chapter III - Vital force: Herbert A. Roberts
Niranjan Bapat
 
How to Add a Filter in the Odoo 17 - Odoo 17 Slides
How to Add a Filter in the Odoo 17 - Odoo 17 SlidesHow to Add a Filter in the Odoo 17 - Odoo 17 Slides
How to Add a Filter in the Odoo 17 - Odoo 17 Slides
Celine George
 
The Cruelty of Animal Testing in the Industry.pdf
The Cruelty of Animal Testing in the Industry.pdfThe Cruelty of Animal Testing in the Industry.pdf
The Cruelty of Animal Testing in the Industry.pdf
luzmilaglez334
 
How To Update One2many Field From OnChange of Field in Odoo 17
How To Update One2many Field From OnChange of Field in Odoo 17How To Update One2many Field From OnChange of Field in Odoo 17
How To Update One2many Field From OnChange of Field in Odoo 17
Celine George
 
(T.L.E.) Agriculture: Essentials of Gardening
(T.L.E.) Agriculture: Essentials of Gardening(T.L.E.) Agriculture: Essentials of Gardening
(T.L.E.) Agriculture: Essentials of Gardening
MJDuyan
 
Genetics Teaching Plan: Dr.Kshirsagar R.V.
Genetics Teaching Plan: Dr.Kshirsagar R.V.Genetics Teaching Plan: Dr.Kshirsagar R.V.
Genetics Teaching Plan: Dr.Kshirsagar R.V.
DrRavindrakshirsagar1
 
RDBMS Lecture Notes Unit4 chapter12 VIEW
RDBMS Lecture Notes Unit4 chapter12 VIEWRDBMS Lecture Notes Unit4 chapter12 VIEW
RDBMS Lecture Notes Unit4 chapter12 VIEW
Murugan Solaiyappan
 
How to Manage Large Scrollbar in Odoo 17 POS
How to Manage Large Scrollbar in Odoo 17 POSHow to Manage Large Scrollbar in Odoo 17 POS
How to Manage Large Scrollbar in Odoo 17 POS
Celine George
 
2024 KWL Back 2 School Summer Conference
2024 KWL Back 2 School Summer Conference2024 KWL Back 2 School Summer Conference
2024 KWL Back 2 School Summer Conference
KlettWorldLanguages
 
formative Evaluation By Dr.Kshirsagar R.V
formative Evaluation By Dr.Kshirsagar R.Vformative Evaluation By Dr.Kshirsagar R.V
formative Evaluation By Dr.Kshirsagar R.V
DrRavindrakshirsagar1
 
Bài tập bộ trợ anh 7 I learn smart world kì 1 năm học 2022 2023 unit 1.doc
Bài tập bộ trợ anh 7 I learn smart world kì 1 năm học 2022 2023 unit 1.docBài tập bộ trợ anh 7 I learn smart world kì 1 năm học 2022 2023 unit 1.doc
Bài tập bộ trợ anh 7 I learn smart world kì 1 năm học 2022 2023 unit 1.doc
PhngThLmHnh
 
How To Create a Transient Model in Odoo 17
How To Create a Transient Model in Odoo 17How To Create a Transient Model in Odoo 17
How To Create a Transient Model in Odoo 17
Celine George
 
DANH SÁCH THÍ SINH XÉT TUYỂN SỚM ĐỦ ĐIỀU KIỆN TRÚNG TUYỂN ĐẠI HỌC CHÍNH QUY N...
DANH SÁCH THÍ SINH XÉT TUYỂN SỚM ĐỦ ĐIỀU KIỆN TRÚNG TUYỂN ĐẠI HỌC CHÍNH QUY N...DANH SÁCH THÍ SINH XÉT TUYỂN SỚM ĐỦ ĐIỀU KIỆN TRÚNG TUYỂN ĐẠI HỌC CHÍNH QUY N...
DANH SÁCH THÍ SINH XÉT TUYỂN SỚM ĐỦ ĐIỀU KIỆN TRÚNG TUYỂN ĐẠI HỌC CHÍNH QUY N...
thanhluan21
 
AZ-900 Microsoft Azure Fundamentals Summary.pdf
AZ-900 Microsoft Azure Fundamentals Summary.pdfAZ-900 Microsoft Azure Fundamentals Summary.pdf
AZ-900 Microsoft Azure Fundamentals Summary.pdf
OlivierLumeau1
 
How to Manage Access Rights & User Types in Odoo 17
How to Manage Access Rights & User Types in Odoo 17How to Manage Access Rights & User Types in Odoo 17
How to Manage Access Rights & User Types in Odoo 17
Celine George
 
matatag curriculum education for Kindergarten
matatag curriculum education for Kindergartenmatatag curriculum education for Kindergarten
matatag curriculum education for Kindergarten
SarahAlie1
 
How to Manage Early Receipt Printing in Odoo 17 POS
How to Manage Early Receipt Printing in Odoo 17 POSHow to Manage Early Receipt Printing in Odoo 17 POS
How to Manage Early Receipt Printing in Odoo 17 POS
Celine George
 
Configuring Single Sign-On (SSO) via Identity Management | MuleSoft Mysore Me...
Configuring Single Sign-On (SSO) via Identity Management | MuleSoft Mysore Me...Configuring Single Sign-On (SSO) via Identity Management | MuleSoft Mysore Me...
Configuring Single Sign-On (SSO) via Identity Management | MuleSoft Mysore Me...
MysoreMuleSoftMeetup
 
The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

NAEYC Code of Ethical Conduct Resource Book
NAEYC Code of Ethical Conduct Resource BookNAEYC Code of Ethical Conduct Resource Book
NAEYC Code of Ethical Conduct Resource Book
 
Chapter III - Vital force: Herbert A. Roberts
Chapter III - Vital force: Herbert A. RobertsChapter III - Vital force: Herbert A. Roberts
Chapter III - Vital force: Herbert A. Roberts
 
How to Add a Filter in the Odoo 17 - Odoo 17 Slides
How to Add a Filter in the Odoo 17 - Odoo 17 SlidesHow to Add a Filter in the Odoo 17 - Odoo 17 Slides
How to Add a Filter in the Odoo 17 - Odoo 17 Slides
 
The Cruelty of Animal Testing in the Industry.pdf
The Cruelty of Animal Testing in the Industry.pdfThe Cruelty of Animal Testing in the Industry.pdf
The Cruelty of Animal Testing in the Industry.pdf
 
How To Update One2many Field From OnChange of Field in Odoo 17
How To Update One2many Field From OnChange of Field in Odoo 17How To Update One2many Field From OnChange of Field in Odoo 17
How To Update One2many Field From OnChange of Field in Odoo 17
 
(T.L.E.) Agriculture: Essentials of Gardening
(T.L.E.) Agriculture: Essentials of Gardening(T.L.E.) Agriculture: Essentials of Gardening
(T.L.E.) Agriculture: Essentials of Gardening
 
Genetics Teaching Plan: Dr.Kshirsagar R.V.
Genetics Teaching Plan: Dr.Kshirsagar R.V.Genetics Teaching Plan: Dr.Kshirsagar R.V.
Genetics Teaching Plan: Dr.Kshirsagar R.V.
 
RDBMS Lecture Notes Unit4 chapter12 VIEW
RDBMS Lecture Notes Unit4 chapter12 VIEWRDBMS Lecture Notes Unit4 chapter12 VIEW
RDBMS Lecture Notes Unit4 chapter12 VIEW
 
How to Manage Large Scrollbar in Odoo 17 POS
How to Manage Large Scrollbar in Odoo 17 POSHow to Manage Large Scrollbar in Odoo 17 POS
How to Manage Large Scrollbar in Odoo 17 POS
 
2024 KWL Back 2 School Summer Conference
2024 KWL Back 2 School Summer Conference2024 KWL Back 2 School Summer Conference
2024 KWL Back 2 School Summer Conference
 
formative Evaluation By Dr.Kshirsagar R.V
formative Evaluation By Dr.Kshirsagar R.Vformative Evaluation By Dr.Kshirsagar R.V
formative Evaluation By Dr.Kshirsagar R.V
 
Bài tập bộ trợ anh 7 I learn smart world kì 1 năm học 2022 2023 unit 1.doc
Bài tập bộ trợ anh 7 I learn smart world kì 1 năm học 2022 2023 unit 1.docBài tập bộ trợ anh 7 I learn smart world kì 1 năm học 2022 2023 unit 1.doc
Bài tập bộ trợ anh 7 I learn smart world kì 1 năm học 2022 2023 unit 1.doc
 
How To Create a Transient Model in Odoo 17
How To Create a Transient Model in Odoo 17How To Create a Transient Model in Odoo 17
How To Create a Transient Model in Odoo 17
 
DANH SÁCH THÍ SINH XÉT TUYỂN SỚM ĐỦ ĐIỀU KIỆN TRÚNG TUYỂN ĐẠI HỌC CHÍNH QUY N...
DANH SÁCH THÍ SINH XÉT TUYỂN SỚM ĐỦ ĐIỀU KIỆN TRÚNG TUYỂN ĐẠI HỌC CHÍNH QUY N...DANH SÁCH THÍ SINH XÉT TUYỂN SỚM ĐỦ ĐIỀU KIỆN TRÚNG TUYỂN ĐẠI HỌC CHÍNH QUY N...
DANH SÁCH THÍ SINH XÉT TUYỂN SỚM ĐỦ ĐIỀU KIỆN TRÚNG TUYỂN ĐẠI HỌC CHÍNH QUY N...
 
AZ-900 Microsoft Azure Fundamentals Summary.pdf
AZ-900 Microsoft Azure Fundamentals Summary.pdfAZ-900 Microsoft Azure Fundamentals Summary.pdf
AZ-900 Microsoft Azure Fundamentals Summary.pdf
 
How to Manage Access Rights & User Types in Odoo 17
How to Manage Access Rights & User Types in Odoo 17How to Manage Access Rights & User Types in Odoo 17
How to Manage Access Rights & User Types in Odoo 17
 
matatag curriculum education for Kindergarten
matatag curriculum education for Kindergartenmatatag curriculum education for Kindergarten
matatag curriculum education for Kindergarten
 
How to Manage Early Receipt Printing in Odoo 17 POS
How to Manage Early Receipt Printing in Odoo 17 POSHow to Manage Early Receipt Printing in Odoo 17 POS
How to Manage Early Receipt Printing in Odoo 17 POS
 
Configuring Single Sign-On (SSO) via Identity Management | MuleSoft Mysore Me...
Configuring Single Sign-On (SSO) via Identity Management | MuleSoft Mysore Me...Configuring Single Sign-On (SSO) via Identity Management | MuleSoft Mysore Me...
Configuring Single Sign-On (SSO) via Identity Management | MuleSoft Mysore Me...
 
The basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptxThe basics of sentences session 10pptx.pptx
The basics of sentences session 10pptx.pptx
 

Sparklyr: Big Data enabler for R users

  • 1. 1 ICTeam S.p.A. – Presentazione della Divisione Progettazione Sparklyr: Big Data enabler for R users Serena Signorelli Data Science Milan, May 15th 2017
  • 2. 2 ICTeam S.p.A. – Presentazione della Divisione Progettazione Outline  About me  The Data Science process  Package and its functionalities  SparkR vs Sparklyr  Demo on NYC taxi data
  • 3. 3 ICTeam S.p.A. – Presentazione della Divisione Progettazione About me Experience:  Business administration and management  Research grants in Economics Statistics  PhD in Analytics for Economics and Business  Traineeship at Eurostat Big Data Task Force  Data scientist at ICTeam SpA Why Sparklyr?  R user  No computer science background  Need to handle Big Data
  • 4. 4 ICTeam S.p.A. – Presentazione della Divisione Progettazione R language  Open source  5th most popular programming language in 2016 (IEEE Spectrum ranking)  Data analysis, statistical modelling and visualization  Historically limited to in-memory data
  • 5. 5 ICTeam S.p.A. – Presentazione della Divisione Progettazione Data Science process 1. Import data into memory 2. Clean and tidy the data 3. Cyclical process called understand: 1. making transformations to tidied data 2. using the transformed data to fit models 3. visualizing results 4. Communicate the results
  • 6. 6 ICTeam S.p.A. – Presentazione della Divisione Progettazione Data Science process with big data Problem: data is too large to download into memory Workaround: use a very small sample or download as much data as possible
  • 7. 7 ICTeam S.p.A. – Presentazione della Divisione Progettazione Data Science process with big data Limitations: the sample may not be representative, long waiting time in every iteration of importing, exploring and modeling Solution: use Sparklyr to access and analyze the data inside Spark and only bring results into R
  • 8. 8 ICTeam S.p.A. – Presentazione della Divisione Progettazione Data Science process with big data
  • 9. 9 ICTeam S.p.A. – Presentazione della Divisione Progettazione Sparklyr: R interface for Apache Spark  First release: 0.4 – September 24th, 2016  Current release: 0.5.4 – April 25th, 2017
  • 10. 10 ICTeam S.p.A. – Presentazione della Divisione Progettazione Dplyr  Dplyr verbs:  select ~ SELECT  filter ~ WHERE  arrange ~ ORDER  summarise ~ aggregators: sum, min, sd, etc.  mutate ~ operators: +, *, log, etc.  Grouping: group_by ~ GROUP BY  Window functions: rank, dense_rank, percent_rank, ntile, row_number, cume_dist, first_value, last_value, lag, lead  Performing joins: inner_join, semi_join, left_join, anti_join, full_join  Sampling: sample_n, sample_frac
  • 11. 11 ICTeam S.p.A. – Presentazione della Divisione Progettazione Dplyr SQL translation:  Basic math operators: +, -, *, /, %%, ^  Math functions: abs, acos, asin, asinh, atan, atan2, ceiling, cos, cosh, exp, floor, log, log10, round, sign, sin, sinh, sqrt, tan, tanh  Logical comparisons: <, <=, !=, >=, >, ==, %in%  Boolean operations: &, &&, |, ||, !  Character functions: paste, tolower, toupper, nchar  Casting: as.double, as.integer, as.logical, as.character, as.date  Basic aggregations: mean, sum, min, max, sd, var, cor, cov, n
  • 12. 12 ICTeam S.p.A. – Presentazione della Divisione Progettazione Dplyr in Sparklyr  Hive functions: many of Hive’s built-in functions (UDF) and built-in aggregate functions (UDAF) can be called inside dplyr’s mutate and summarize  Reading and writing data: spark_read_csv, spark_read_json, spark_read_parquet, spark_write_csv, spark_write_json, spark_write_parquet  Collecting to R: collect()
  • 13. 13 ICTeam S.p.A. – Presentazione della Divisione Progettazione Dplyr in Sparklyr Characteristics:  Laziness  It never pulls data into R unless you explicitly ask for it  It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step  Piping %>%  From package magrittr  Provides a mechanism for chaining commands with a forward-pipe operator
  • 14. 14 ICTeam S.p.A. – Presentazione della Divisione Progettazione Dplyr in Sparklyr: an example SELECT `dropoff_ntacode`, `dropoff_ntaname`, count(*) AS `n`, AVG(`trip_time`) AS `trip_time_mean`, AVG(`trip_distance`) AS `trip_dist_mean`, AVG(`dropoff_latitude`) AS `dropoff_latitude`, AVG(`dropoff_longitude`) AS `dropoff_longitude`, AVG(`passenger_count`) AS `passenger_mean`, AVG(`fare_amount`) AS `fare_amount`, AVG(`tip_amount`) AS `tip_amount` FROM (SELECT `vendorid`, `pickup_datetime`, `dropoff_datetime`, `passenger_count`, `trip_distance`, `pickup_longitude`, `pickup_latitude`, `ratecodeid`, `store_and_fwd_flag`, `dropoff_longitude`, `dropoff_latitude`, `payment_type`, `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`, `improvement_surcharge`, `total_amount`, `pickup_borocode`, `pickup_boroname`, `pickup_ntacode`, `pickup_ntaname`, `dropoff_borocode`, `dropoff_boroname`, `dropoff_ntacode`, `dropoff_ntaname`, UNIX_TIMESTAMP(`dropoff_datetime`) - UNIX_TIMESTAMP(`pickup_datetime`) AS `trip_time` FROM (SELECT * FROM (SELECT * FROM `yellow_taxi_raw_data_2009_2016_june_geo_partitioned` WHERE (`pickup_ntacode` = 'QN98')) `jxjgtsgzwv` WHERE (NOT((`dropoff_ntacode`) IS NULL))) `cobkqjitky`) `vquwddaabv` GROUP BY `dropoff_ntacode`, `dropoff_ntaname` jfk_pickup_tbl <- yellow_taxi_raw_data_prepared_tbl %>% filter(pickup_ntacode == 'QN98') %>% filter(!is.na(dropoff_ntacode)) %>% mutate(trip_time = unix_timestamp(dropoff_datetime) - unix_timestamp(pickup_datetime)) %>% group_by(dropoff_ntacode, dropoff_ntaname) %>% summarize(n = n(), trip_time_mean = mean(trip_time), trip_dist_mean = mean(trip_distance), dropoff_latitude = mean(dropoff_latitude), dropoff_longitude = mean(dropoff_longitude), passenger_mean = mean(passenger_count), fare_amount = mean(fare_amount), tip_amount = mean(tip_amount)) dplyrSparkSQL
  • 15. 15 ICTeam S.p.A. – Presentazione della Divisione Progettazione Dplyr in Sparklyr: an example SELECT * FROM (SELECT `dropoff_ntacode`, `dropoff_ntaname`, `n`, `trip_time_mean`, `trip_dist_mean`, `dropoff_latitude`, `dropoff_longitude`, `passenger_mean`, `fare_amount`, `tip_amount`, rank() OVER (PARTITION BY `dropoff_ntacode` ORDER BY `n` DESC) AS `n_rank` FROM (SELECT `dropoff_ntacode`, `dropoff_ntaname`, count(*) AS `n`, AVG(`trip_time`) AS `trip_time_mean`, AVG(`trip_distance`) AS `trip_dist_mean`, AVG(`dropoff_latitude`) AS `dropoff_latitude`, AVG(`dropoff_longitude`) AS `dropoff_longitude`, AVG(`passenger_count`) AS `passenger_mean`, AVG(`fare_amount`) AS `fare_amount`, AVG(`tip_amount`) AS `tip_amount` FROM (SELECT `vendorid`, `pickup_datetime`, `dropoff_datetime`, `passenger_count`, `trip_distance`, `pickup_longitude`, `pickup_latitude`, `ratecodeid`, `store_and_fwd_flag`, `dropoff_longitude`, `dropoff_latitude`, `payment_type`, `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`, `improvement_surcharge`, `total_amount`, `pickup_borocode`, `pickup_boroname`, `pickup_ntacode`, `pickup_ntaname`, `dropoff_borocode`, `dropoff_boroname`, `dropoff_ntacode`, `dropoff_ntaname`, UNIX_TIMESTAMP(`dropoff_datetime`) - UNIX_TIMESTAMP(`pickup_datetime`) AS `trip_time` FROM (SELECT * FROM (SELECT * FROM `yellow_taxi_raw_data_2009_2016_june_geo_partitioned` WHERE (`pickup_ntacode` = 'QN98')) `zrhxchievt` WHERE (NOT((`dropoff_ntacode`) IS NULL))) `wedupsfkki`) `hkpwsclpve` GROUP BY `dropoff_ntacode`, `dropoff_ntaname`) `bugdzqpxlv`) `haexysqfhn` WHERE (`n_rank` <= 25.0) Jfk_pickup <- jfk_pickup_tbl %>% mutate(n_rank = min_rank(desc(n))) %>% filter(n_rank <= 25) dplyrSparkSQL
  • 16. 16 ICTeam S.p.A. – Presentazione della Divisione Progettazione ML in Sparklyr Sparklyr allows to access the machine learning routines provided by the spark.ml package Three families of functions:  Machine learning algorithms for analyzing data (ml_*)  Feature transformers for manipulating individual features (ft_*)  Functions for manipulating Spark DataFrames (sdf_*) Example:  Perform SQL queries through the sparklyr dplyr interface  Use the sdf_* and ft_* family of functions to generate new columns, or partition your data set  Choose an appropriate machine learning algorithm from the ml_* family of functions to model your data
  • 17. 17 ICTeam S.p.A. – Presentazione della Divisione Progettazione Extensions in Sparklyr Extensions can be created to call the full Spark API and to provide interfaces to Spark packages Package Description spark.sas7bdat Read in SAS data in parallel into Apache Spark. rsparkling Extension for using H2O machine learning algorithms against Spark Data Frames. sparkhello Simple example of including a custom JAR file within an extension package. rddlist Implements some methods of an R list as a Spark RDD (resilient distributed dataset). sparkwarc Load WARC files into Apache Spark with sparklyr. sparkavro Load Avro data into Spark with sparklyr. It is a wrapper of spark-avro
  • 18. 18 ICTeam S.p.A. – Presentazione della Divisione Progettazione Sparklyr help: the RStudio cheat sheet
  • 19. 19 ICTeam S.p.A. – Presentazione della Divisione Progettazione SparkR vs Sparklyr natively included in Spark after version 1.6.2 developed by RStudio, available on CRAN and GitHub it allows to download and install Spark for development purposes df <- createDataFrame(flights) head(select(df, df$distance, df$origin)) or head(df[, c(‘distance', ‘origin')]) filter(df, df$distance > 3000) df <- copy_to(sc2, flights) head(select(df, distance, origin)) filter(df, distance > 3000) documentation through R’s help documentation through R’s help SparkR Sparklyr
  • 20. 20 ICTeam S.p.A. – Presentazione della Divisione Progettazione SparkR vs Sparklyr spark.logit spark.mlp spark.naiveBayes spark.survreg spark.glm spark.gbt spark.randomForest spark.kmeans spark.lda spark.isoreg spark.gaussianMixture spark.als spark.kstest ml_logistic_regression ml_multilayer_perceptron ml_naive_bayes ml_survival_regression ml_generalized_linear_regression ml_gradient_boosted_trees ml_random_forest ml_kmeans ml_lda ml_linear_regression ml_decision_tree ml_pca ml_one_vs_rest UDF functions UDF functions (but can invoke Scala code) SparkR Sparklyr
  • 21. 21 ICTeam S.p.A. – Presentazione della Divisione Progettazione SparkR vs Sparklyr in Google Trends
  • 22. 22 ICTeam S.p.A. – Presentazione della Divisione Progettazione SparkR vs Sparklyr in Google Trends
  • 23. 23 ICTeam S.p.A. – Presentazione della Divisione Progettazione Demo on NYC taxi data  1 billion NYC taxi data  Original analysis by Todd W. Schneider1, November 2015 +  Rstudio webinar2, October 2016  77 GB of data stored in a Hive table 1 http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/ 2 https://www.rstudio.com/resources/webinars/using-spark-with-shiny-and-r-markdown/
  • 24. 24 ICTeam S.p.A. – Presentazione della Divisione Progettazione serena.signorelli@icteam.it Thank you for your attention