SlideShare a Scribd company logo
1 of 24
Download to read offline
1 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Sparklyr: Big Data
enabler for R users
Serena Signorelli
Data Science Milan, May 15th 2017
2 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Outline
 About me
 The Data Science process
 Package and its functionalities
 SparkR vs Sparklyr
 Demo on NYC taxi data
3 ICTeam S.p.A. – Presentazione della Divisione Progettazione
About me
Experience:
 Business administration and management
 Research grants in Economics Statistics
 PhD in Analytics for Economics and Business
 Traineeship at Eurostat Big Data Task Force
 Data scientist at ICTeam SpA
Why Sparklyr?
 R user
 No computer science background
 Need to handle Big Data
4 ICTeam S.p.A. – Presentazione della Divisione Progettazione
R language
 Open source
 5th most popular programming language in
2016 (IEEE Spectrum ranking)
 Data analysis, statistical modelling and visualization
 Historically limited to in-memory data
5 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Data Science process
1. Import data into memory
2. Clean and tidy the data
3. Cyclical process called understand:
1. making transformations to tidied data
2. using the transformed data to fit models
3. visualizing results
4. Communicate the results
6 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Data Science process with big data
Problem: data is too large to download into memory
Workaround: use a very small sample or download as
much data as possible
7 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Data Science process with big data
Limitations: the sample may not be representative, long
waiting time in every iteration of importing, exploring and
modeling
Solution: use Sparklyr to access and analyze the data
inside Spark and only bring results into R
8 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Data Science process with big data
9 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Sparklyr: R interface for Apache Spark
 First release: 0.4 – September 24th, 2016
 Current release: 0.5.4 – April 25th, 2017
10 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Dplyr
 Dplyr verbs:
 select ~ SELECT
 filter ~ WHERE
 arrange ~ ORDER
 summarise ~ aggregators: sum, min, sd, etc.
 mutate ~ operators: +, *, log, etc.
 Grouping: group_by ~ GROUP BY
 Window functions: rank, dense_rank, percent_rank, ntile,
row_number, cume_dist, first_value, last_value, lag, lead
 Performing joins: inner_join, semi_join, left_join, anti_join,
full_join
 Sampling: sample_n, sample_frac
11 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Dplyr
SQL translation:
 Basic math operators: +, -, *, /, %%, ^
 Math functions: abs, acos, asin, asinh, atan, atan2,
ceiling, cos, cosh, exp, floor, log, log10, round, sign,
sin, sinh, sqrt, tan, tanh
 Logical comparisons: <, <=, !=, >=, >, ==, %in%
 Boolean operations: &, &&, |, ||, !
 Character functions: paste, tolower, toupper, nchar
 Casting: as.double, as.integer, as.logical,
as.character, as.date
 Basic aggregations: mean, sum, min, max, sd, var,
cor, cov, n
12 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Dplyr in Sparklyr
 Hive functions:
many of Hive’s built-in functions (UDF) and built-in aggregate
functions (UDAF) can be called inside dplyr’s mutate and
summarize
 Reading and writing data:
spark_read_csv, spark_read_json, spark_read_parquet,
spark_write_csv, spark_write_json, spark_write_parquet
 Collecting to R:
collect()
13 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Dplyr in Sparklyr
Characteristics:
 Laziness
 It never pulls data into R unless you explicitly ask for it
 It delays doing any work until the last possible moment:
it collects together everything you want to do and then
sends it to the database in one step
 Piping %>%
 From package magrittr
 Provides a mechanism for chaining commands with a
forward-pipe operator
14 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Dplyr in Sparklyr: an example
SELECT `dropoff_ntacode`, `dropoff_ntaname`, count(*) AS `n`, AVG(`trip_time`) AS `trip_time_mean`,
AVG(`trip_distance`) AS `trip_dist_mean`, AVG(`dropoff_latitude`) AS `dropoff_latitude`,
AVG(`dropoff_longitude`) AS `dropoff_longitude`, AVG(`passenger_count`) AS `passenger_mean`,
AVG(`fare_amount`) AS `fare_amount`, AVG(`tip_amount`) AS `tip_amount`
FROM (SELECT `vendorid`, `pickup_datetime`, `dropoff_datetime`, `passenger_count`, `trip_distance`,
`pickup_longitude`, `pickup_latitude`, `ratecodeid`, `store_and_fwd_flag`, `dropoff_longitude`,
`dropoff_latitude`, `payment_type`, `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`,
`improvement_surcharge`, `total_amount`, `pickup_borocode`, `pickup_boroname`, `pickup_ntacode`,
`pickup_ntaname`, `dropoff_borocode`, `dropoff_boroname`, `dropoff_ntacode`, `dropoff_ntaname`,
UNIX_TIMESTAMP(`dropoff_datetime`) - UNIX_TIMESTAMP(`pickup_datetime`) AS `trip_time`
FROM (SELECT *
FROM (SELECT *
FROM `yellow_taxi_raw_data_2009_2016_june_geo_partitioned`
WHERE (`pickup_ntacode` = 'QN98')) `jxjgtsgzwv`
WHERE (NOT((`dropoff_ntacode`) IS NULL))) `cobkqjitky`) `vquwddaabv`
GROUP BY `dropoff_ntacode`, `dropoff_ntaname`
jfk_pickup_tbl <- yellow_taxi_raw_data_prepared_tbl %>%
filter(pickup_ntacode == 'QN98') %>%
filter(!is.na(dropoff_ntacode)) %>%
mutate(trip_time = unix_timestamp(dropoff_datetime) - unix_timestamp(pickup_datetime)) %>%
group_by(dropoff_ntacode, dropoff_ntaname) %>%
summarize(n = n(),
trip_time_mean = mean(trip_time),
trip_dist_mean = mean(trip_distance),
dropoff_latitude = mean(dropoff_latitude),
dropoff_longitude = mean(dropoff_longitude),
passenger_mean = mean(passenger_count),
fare_amount = mean(fare_amount),
tip_amount = mean(tip_amount))
dplyrSparkSQL
15 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Dplyr in Sparklyr: an example
SELECT *
FROM (SELECT `dropoff_ntacode`, `dropoff_ntaname`, `n`, `trip_time_mean`, `trip_dist_mean`,
`dropoff_latitude`, `dropoff_longitude`, `passenger_mean`, `fare_amount`, `tip_amount`, rank() OVER
(PARTITION BY `dropoff_ntacode` ORDER BY `n` DESC) AS `n_rank`
FROM (SELECT `dropoff_ntacode`, `dropoff_ntaname`, count(*) AS `n`, AVG(`trip_time`) AS
`trip_time_mean`, AVG(`trip_distance`) AS `trip_dist_mean`, AVG(`dropoff_latitude`) AS
`dropoff_latitude`, AVG(`dropoff_longitude`) AS `dropoff_longitude`, AVG(`passenger_count`) AS
`passenger_mean`, AVG(`fare_amount`) AS `fare_amount`, AVG(`tip_amount`) AS `tip_amount`
FROM (SELECT `vendorid`, `pickup_datetime`, `dropoff_datetime`, `passenger_count`, `trip_distance`,
`pickup_longitude`, `pickup_latitude`, `ratecodeid`, `store_and_fwd_flag`, `dropoff_longitude`,
`dropoff_latitude`, `payment_type`, `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`,
`improvement_surcharge`, `total_amount`, `pickup_borocode`, `pickup_boroname`, `pickup_ntacode`,
`pickup_ntaname`, `dropoff_borocode`, `dropoff_boroname`, `dropoff_ntacode`, `dropoff_ntaname`,
UNIX_TIMESTAMP(`dropoff_datetime`) - UNIX_TIMESTAMP(`pickup_datetime`) AS `trip_time`
FROM (SELECT *
FROM (SELECT *
FROM `yellow_taxi_raw_data_2009_2016_june_geo_partitioned`
WHERE (`pickup_ntacode` = 'QN98')) `zrhxchievt`
WHERE (NOT((`dropoff_ntacode`) IS NULL))) `wedupsfkki`) `hkpwsclpve`
GROUP BY `dropoff_ntacode`, `dropoff_ntaname`) `bugdzqpxlv`) `haexysqfhn`
WHERE (`n_rank` <= 25.0)
Jfk_pickup <- jfk_pickup_tbl %>%
mutate(n_rank = min_rank(desc(n))) %>%
filter(n_rank <= 25)
dplyrSparkSQL
16 ICTeam S.p.A. – Presentazione della Divisione Progettazione
ML in Sparklyr
Sparklyr allows to access the machine learning routines
provided by the spark.ml package
Three families of functions:
 Machine learning algorithms for analyzing data (ml_*)
 Feature transformers for manipulating individual features
(ft_*)
 Functions for manipulating Spark DataFrames (sdf_*)
Example:
 Perform SQL queries through the sparklyr dplyr interface
 Use the sdf_* and ft_* family of functions to generate new
columns, or partition your data set
 Choose an appropriate machine learning algorithm from the
ml_* family of functions to model your data
17 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Extensions in Sparklyr
Extensions can be created to call the full Spark API and to
provide interfaces to Spark packages
Package Description
spark.sas7bdat Read in SAS data in parallel into Apache Spark.
rsparkling Extension for using H2O machine learning
algorithms against Spark Data Frames.
sparkhello Simple example of including a custom JAR file
within an extension package.
rddlist Implements some methods of an R list as a
Spark RDD (resilient distributed dataset).
sparkwarc Load WARC files into Apache Spark with
sparklyr.
sparkavro Load Avro data into Spark with sparklyr. It is a
wrapper of spark-avro
18 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Sparklyr help: the RStudio cheat sheet
19 ICTeam S.p.A. – Presentazione della Divisione Progettazione
SparkR vs Sparklyr
natively included in Spark after
version 1.6.2
developed by RStudio, available
on CRAN and GitHub
it allows to download and install
Spark for development purposes
df <- createDataFrame(flights)
head(select(df, df$distance, df$origin))
or
head(df[, c(‘distance', ‘origin')])
filter(df, df$distance > 3000)
df <- copy_to(sc2, flights)
head(select(df, distance, origin))
filter(df, distance > 3000)
documentation through R’s help documentation through R’s help
SparkR Sparklyr
20 ICTeam S.p.A. – Presentazione della Divisione Progettazione
SparkR vs Sparklyr
spark.logit
spark.mlp
spark.naiveBayes
spark.survreg
spark.glm
spark.gbt
spark.randomForest
spark.kmeans
spark.lda
spark.isoreg
spark.gaussianMixture
spark.als
spark.kstest
ml_logistic_regression
ml_multilayer_perceptron
ml_naive_bayes
ml_survival_regression
ml_generalized_linear_regression
ml_gradient_boosted_trees
ml_random_forest
ml_kmeans
ml_lda
ml_linear_regression
ml_decision_tree
ml_pca
ml_one_vs_rest
UDF functions UDF functions
(but can invoke Scala code)
SparkR Sparklyr
21 ICTeam S.p.A. – Presentazione della Divisione Progettazione
SparkR vs Sparklyr in Google Trends
22 ICTeam S.p.A. – Presentazione della Divisione Progettazione
SparkR vs Sparklyr in Google Trends
23 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Demo on NYC taxi data
 1 billion NYC taxi data
 Original analysis by Todd W. Schneider1, November 2015
+
 Rstudio webinar2, October 2016
 77 GB of data stored in a Hive table
1 http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/
2 https://www.rstudio.com/resources/webinars/using-spark-with-shiny-and-r-markdown/
24 ICTeam S.p.A. – Presentazione della Divisione Progettazione
serena.signorelli@icteam.it
Thank you for your attention

More Related Content

What's hot

Machine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro NegroMachine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro NegroGraphAware
 
Spark Hearts GraphLab Create
Spark Hearts GraphLab CreateSpark Hearts GraphLab Create
Spark Hearts GraphLab CreateAmanda Casari
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...Databricks
 
MATLAB Simulink Research Help
MATLAB Simulink Research HelpMATLAB Simulink Research Help
MATLAB Simulink Research HelpPhD Direction
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science PlatformQAware GmbH
 
Hadoop Mapreduce Projects
Hadoop Mapreduce ProjectsHadoop Mapreduce Projects
Hadoop Mapreduce ProjectsPhD Direction
 
Seldon: Deploying Models at Scale
Seldon: Deploying Models at ScaleSeldon: Deploying Models at Scale
Seldon: Deploying Models at ScaleSeldon
 
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 DatasetGraph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 DatasetTigerGraph
 
TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...
TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...
TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...Seldon
 
Data Visualization Project Presentation
Data Visualization Project PresentationData Visualization Project Presentation
Data Visualization Project PresentationShubham Shrivastava
 
MLSEV. BigML Workshop II
MLSEV. BigML Workshop IIMLSEV. BigML Workshop II
MLSEV. BigML Workshop IIBigML, Inc
 
MLSD18. Automating Machine Learning Workflows
MLSD18. Automating Machine Learning WorkflowsMLSD18. Automating Machine Learning Workflows
MLSD18. Automating Machine Learning WorkflowsBigML, Inc
 
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle ManagementMLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle ManagementDatabricks
 
Graph analytics in Linkurious Enterprise
Graph analytics in Linkurious EnterpriseGraph analytics in Linkurious Enterprise
Graph analytics in Linkurious EnterpriseLinkurious
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...Databricks
 
Seamless End-to-End Production Machine Learning with Seldon and MLflow
 Seamless End-to-End Production Machine Learning with Seldon and MLflow Seamless End-to-End Production Machine Learning with Seldon and MLflow
Seamless End-to-End Production Machine Learning with Seldon and MLflowDatabricks
 

What's hot (20)

Machine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro NegroMachine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro Negro
 
Spark Hearts GraphLab Create
Spark Hearts GraphLab CreateSpark Hearts GraphLab Create
Spark Hearts GraphLab Create
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
 
MATLAB Simulink Research Help
MATLAB Simulink Research HelpMATLAB Simulink Research Help
MATLAB Simulink Research Help
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
 
Hadoop Mapreduce Projects
Hadoop Mapreduce ProjectsHadoop Mapreduce Projects
Hadoop Mapreduce Projects
 
Seldon: Deploying Models at Scale
Seldon: Deploying Models at ScaleSeldon: Deploying Models at Scale
Seldon: Deploying Models at Scale
 
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 DatasetGraph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset
 
TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...
TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...
TensorFlow London 18: Dr Alastair Moore, Towards the use of Graphical Models ...
 
Data Visualization Project Presentation
Data Visualization Project PresentationData Visualization Project Presentation
Data Visualization Project Presentation
 
Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
 
MLSEV. BigML Workshop II
MLSEV. BigML Workshop IIMLSEV. BigML Workshop II
MLSEV. BigML Workshop II
 
Mp resume
Mp resumeMp resume
Mp resume
 
MLSD18. Automating Machine Learning Workflows
MLSD18. Automating Machine Learning WorkflowsMLSD18. Automating Machine Learning Workflows
MLSD18. Automating Machine Learning Workflows
 
TigerGraph.js
TigerGraph.jsTigerGraph.js
TigerGraph.js
 
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle ManagementMLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
 
Graph analytics in Linkurious Enterprise
Graph analytics in Linkurious EnterpriseGraph analytics in Linkurious Enterprise
Graph analytics in Linkurious Enterprise
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 
Seamless End-to-End Production Machine Learning with Seldon and MLflow
 Seamless End-to-End Production Machine Learning with Seldon and MLflow Seamless End-to-End Production Machine Learning with Seldon and MLflow
Seamless End-to-End Production Machine Learning with Seldon and MLflow
 

Similar to Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
 
Briefing on the Modern ML Stack with R
 Briefing on the Modern ML Stack with R Briefing on the Modern ML Stack with R
Briefing on the Modern ML Stack with RDatabricks
 
SparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big DataSparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big Datasamuel shamiri
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherDatabricks
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)Spark Summit
 
Pattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and HadoopPattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and HadoopPaco Nathan
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for EveryoneGiovanna Roda
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
Deep Learning - Luca Grazioli, ICTEAM
Deep Learning - Luca Grazioli, ICTEAMDeep Learning - Luca Grazioli, ICTEAM
Deep Learning - Luca Grazioli, ICTEAMData Science Milan
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsDatabricks
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r publishedDipendra Kusi
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectMatthew Gerring
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark ApplicationsTzach Zohar
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineDatabricks
 
Real timeml visualizationwithspark_v6
Real timeml visualizationwithspark_v6Real timeml visualizationwithspark_v6
Real timeml visualizationwithspark_v6sfbiganalytics
 
Real Time Visualization with Spark
Real Time Visualization with SparkReal Time Visualization with Spark
Real Time Visualization with SparkAlpine Data
 

Similar to Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM (20)

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Briefing on the Modern ML Stack with R
 Briefing on the Modern ML Stack with R Briefing on the Modern ML Stack with R
Briefing on the Modern ML Stack with R
 
SparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big DataSparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big Data
 
Tactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark TogetherTactical Data Science Tips: Python and Spark Together
Tactical Data Science Tips: Python and Spark Together
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
 
Pattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and HadoopPattern: PMML for Cascading and Hadoop
Pattern: PMML for Cascading and Hadoop
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for Everyone
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Deep Learning - Luca Grazioli, ICTEAM
Deep Learning - Luca Grazioli, ICTEAMDeep Learning - Luca Grazioli, ICTEAM
Deep Learning - Luca Grazioli, ICTEAM
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science Project
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark Applications
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL Engine
 
Real timeml visualizationwithspark_v6
Real timeml visualizationwithspark_v6Real timeml visualizationwithspark_v6
Real timeml visualizationwithspark_v6
 
Real Time Visualization with Spark
Real Time Visualization with SparkReal Time Visualization with Spark
Real Time Visualization with Spark
 

More from Data Science Milan

ML & Graph algorithms to prevent financial crime in digital payments
ML & Graph  algorithms to prevent  financial crime in  digital paymentsML & Graph  algorithms to prevent  financial crime in  digital payments
ML & Graph algorithms to prevent financial crime in digital paymentsData Science Milan
 
How to use the Economic Complexity Index to guide innovation plans
How to use the Economic Complexity Index to guide innovation plansHow to use the Economic Complexity Index to guide innovation plans
How to use the Economic Complexity Index to guide innovation plansData Science Milan
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsData Science Milan
 
"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companiesData Science Milan
 
Question generation using Natural Language Processing by QuestGen.AI
Question generation using Natural Language Processing by QuestGen.AIQuestion generation using Natural Language Processing by QuestGen.AI
Question generation using Natural Language Processing by QuestGen.AIData Science Milan
 
Speed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWSSpeed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWSData Science Milan
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaData Science Milan
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraData Science Milan
 
Time Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del PraTime Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del PraData Science Milan
 
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AILudwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AIData Science Milan
 
Audience projection of target consumers over multiple domains a ner and baye...
Audience projection of target consumers over multiple domains  a ner and baye...Audience projection of target consumers over multiple domains  a ner and baye...
Audience projection of target consumers over multiple domains a ner and baye...Data Science Milan
 
Weak supervised learning - Kristina Khvatova
Weak supervised learning - Kristina KhvatovaWeak supervised learning - Kristina Khvatova
Weak supervised learning - Kristina KhvatovaData Science Milan
 
GANs beyond nice pictures: real value of data generation, Alex Honchar
GANs beyond nice pictures: real value of data generation, Alex HoncharGANs beyond nice pictures: real value of data generation, Alex Honchar
GANs beyond nice pictures: real value of data generation, Alex HoncharData Science Milan
 
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo LomonacoContinual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo LomonacoData Science Milan
 
3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep LearningData Science Milan
 
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Deep time-to-failure: predicting failures, churns and customer lifetime with ...Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Deep time-to-failure: predicting failures, churns and customer lifetime with ...Data Science Milan
 
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...Data Science Milan
 
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Pricing Optimization: Close-out, Online and Renewal strategies, Data ReplyPricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Pricing Optimization: Close-out, Online and Renewal strategies, Data ReplyData Science Milan
 
A view of graph data usage by Cerved
A view of graph data usage by CervedA view of graph data usage by Cerved
A view of graph data usage by CervedData Science Milan
 

More from Data Science Milan (20)

ML & Graph algorithms to prevent financial crime in digital payments
ML & Graph  algorithms to prevent  financial crime in  digital paymentsML & Graph  algorithms to prevent  financial crime in  digital payments
ML & Graph algorithms to prevent financial crime in digital payments
 
How to use the Economic Complexity Index to guide innovation plans
How to use the Economic Complexity Index to guide innovation plansHow to use the Economic Complexity Index to guide innovation plans
How to use the Economic Complexity Index to guide innovation plans
 
Robustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning MethodsRobustness Metrics for ML Models based on Deep Learning Methods
Robustness Metrics for ML Models based on Deep Learning Methods
 
"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies"You don't need a bigger boat": serverless MLOps for reasonable companies
"You don't need a bigger boat": serverless MLOps for reasonable companies
 
Question generation using Natural Language Processing by QuestGen.AI
Question generation using Natural Language Processing by QuestGen.AIQuestion generation using Natural Language Processing by QuestGen.AI
Question generation using Natural Language Processing by QuestGen.AI
 
Speed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWSSpeed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWS
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
 
Time Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del PraTime Series Classification with Deep Learning | Marco Del Pra
Time Series Classification with Deep Learning | Marco Del Pra
 
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AILudwig: A code-free deep learning toolbox | Piero Molino, Uber AI
Ludwig: A code-free deep learning toolbox | Piero Molino, Uber AI
 
Audience projection of target consumers over multiple domains a ner and baye...
Audience projection of target consumers over multiple domains  a ner and baye...Audience projection of target consumers over multiple domains  a ner and baye...
Audience projection of target consumers over multiple domains a ner and baye...
 
Weak supervised learning - Kristina Khvatova
Weak supervised learning - Kristina KhvatovaWeak supervised learning - Kristina Khvatova
Weak supervised learning - Kristina Khvatova
 
GANs beyond nice pictures: real value of data generation, Alex Honchar
GANs beyond nice pictures: real value of data generation, Alex HoncharGANs beyond nice pictures: real value of data generation, Alex Honchar
GANs beyond nice pictures: real value of data generation, Alex Honchar
 
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo LomonacoContinual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
Continual/Lifelong Learning with Deep Architectures, Vincenzo Lomonaco
 
3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning
 
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Deep time-to-failure: predicting failures, churns and customer lifetime with ...Deep time-to-failure: predicting failures, churns and customer lifetime with ...
Deep time-to-failure: predicting failures, churns and customer lifetime with ...
 
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
50 Shades of Text - Leveraging Natural Language Processing (NLP), Alessandro ...
 
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Pricing Optimization: Close-out, Online and Renewal strategies, Data ReplyPricing Optimization: Close-out, Online and Renewal strategies, Data Reply
Pricing Optimization: Close-out, Online and Renewal strategies, Data Reply
 
A view of graph data usage by Cerved
A view of graph data usage by CervedA view of graph data usage by Cerved
A view of graph data usage by Cerved
 

Recently uploaded

Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 

Recently uploaded (20)

Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 

Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM

  • 1. 1 ICTeam S.p.A. – Presentazione della Divisione Progettazione Sparklyr: Big Data enabler for R users Serena Signorelli Data Science Milan, May 15th 2017
  • 2. 2 ICTeam S.p.A. – Presentazione della Divisione Progettazione Outline  About me  The Data Science process  Package and its functionalities  SparkR vs Sparklyr  Demo on NYC taxi data
  • 3. 3 ICTeam S.p.A. – Presentazione della Divisione Progettazione About me Experience:  Business administration and management  Research grants in Economics Statistics  PhD in Analytics for Economics and Business  Traineeship at Eurostat Big Data Task Force  Data scientist at ICTeam SpA Why Sparklyr?  R user  No computer science background  Need to handle Big Data
  • 4. 4 ICTeam S.p.A. – Presentazione della Divisione Progettazione R language  Open source  5th most popular programming language in 2016 (IEEE Spectrum ranking)  Data analysis, statistical modelling and visualization  Historically limited to in-memory data
  • 5. 5 ICTeam S.p.A. – Presentazione della Divisione Progettazione Data Science process 1. Import data into memory 2. Clean and tidy the data 3. Cyclical process called understand: 1. making transformations to tidied data 2. using the transformed data to fit models 3. visualizing results 4. Communicate the results
  • 6. 6 ICTeam S.p.A. – Presentazione della Divisione Progettazione Data Science process with big data Problem: data is too large to download into memory Workaround: use a very small sample or download as much data as possible
  • 7. 7 ICTeam S.p.A. – Presentazione della Divisione Progettazione Data Science process with big data Limitations: the sample may not be representative, long waiting time in every iteration of importing, exploring and modeling Solution: use Sparklyr to access and analyze the data inside Spark and only bring results into R
  • 8. 8 ICTeam S.p.A. – Presentazione della Divisione Progettazione Data Science process with big data
  • 9. 9 ICTeam S.p.A. – Presentazione della Divisione Progettazione Sparklyr: R interface for Apache Spark  First release: 0.4 – September 24th, 2016  Current release: 0.5.4 – April 25th, 2017
  • 10. 10 ICTeam S.p.A. – Presentazione della Divisione Progettazione Dplyr  Dplyr verbs:  select ~ SELECT  filter ~ WHERE  arrange ~ ORDER  summarise ~ aggregators: sum, min, sd, etc.  mutate ~ operators: +, *, log, etc.  Grouping: group_by ~ GROUP BY  Window functions: rank, dense_rank, percent_rank, ntile, row_number, cume_dist, first_value, last_value, lag, lead  Performing joins: inner_join, semi_join, left_join, anti_join, full_join  Sampling: sample_n, sample_frac
  • 11. 11 ICTeam S.p.A. – Presentazione della Divisione Progettazione Dplyr SQL translation:  Basic math operators: +, -, *, /, %%, ^  Math functions: abs, acos, asin, asinh, atan, atan2, ceiling, cos, cosh, exp, floor, log, log10, round, sign, sin, sinh, sqrt, tan, tanh  Logical comparisons: <, <=, !=, >=, >, ==, %in%  Boolean operations: &, &&, |, ||, !  Character functions: paste, tolower, toupper, nchar  Casting: as.double, as.integer, as.logical, as.character, as.date  Basic aggregations: mean, sum, min, max, sd, var, cor, cov, n
  • 12. 12 ICTeam S.p.A. – Presentazione della Divisione Progettazione Dplyr in Sparklyr  Hive functions: many of Hive’s built-in functions (UDF) and built-in aggregate functions (UDAF) can be called inside dplyr’s mutate and summarize  Reading and writing data: spark_read_csv, spark_read_json, spark_read_parquet, spark_write_csv, spark_write_json, spark_write_parquet  Collecting to R: collect()
  • 13. 13 ICTeam S.p.A. – Presentazione della Divisione Progettazione Dplyr in Sparklyr Characteristics:  Laziness  It never pulls data into R unless you explicitly ask for it  It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step  Piping %>%  From package magrittr  Provides a mechanism for chaining commands with a forward-pipe operator
  • 14. 14 ICTeam S.p.A. – Presentazione della Divisione Progettazione Dplyr in Sparklyr: an example SELECT `dropoff_ntacode`, `dropoff_ntaname`, count(*) AS `n`, AVG(`trip_time`) AS `trip_time_mean`, AVG(`trip_distance`) AS `trip_dist_mean`, AVG(`dropoff_latitude`) AS `dropoff_latitude`, AVG(`dropoff_longitude`) AS `dropoff_longitude`, AVG(`passenger_count`) AS `passenger_mean`, AVG(`fare_amount`) AS `fare_amount`, AVG(`tip_amount`) AS `tip_amount` FROM (SELECT `vendorid`, `pickup_datetime`, `dropoff_datetime`, `passenger_count`, `trip_distance`, `pickup_longitude`, `pickup_latitude`, `ratecodeid`, `store_and_fwd_flag`, `dropoff_longitude`, `dropoff_latitude`, `payment_type`, `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`, `improvement_surcharge`, `total_amount`, `pickup_borocode`, `pickup_boroname`, `pickup_ntacode`, `pickup_ntaname`, `dropoff_borocode`, `dropoff_boroname`, `dropoff_ntacode`, `dropoff_ntaname`, UNIX_TIMESTAMP(`dropoff_datetime`) - UNIX_TIMESTAMP(`pickup_datetime`) AS `trip_time` FROM (SELECT * FROM (SELECT * FROM `yellow_taxi_raw_data_2009_2016_june_geo_partitioned` WHERE (`pickup_ntacode` = 'QN98')) `jxjgtsgzwv` WHERE (NOT((`dropoff_ntacode`) IS NULL))) `cobkqjitky`) `vquwddaabv` GROUP BY `dropoff_ntacode`, `dropoff_ntaname` jfk_pickup_tbl <- yellow_taxi_raw_data_prepared_tbl %>% filter(pickup_ntacode == 'QN98') %>% filter(!is.na(dropoff_ntacode)) %>% mutate(trip_time = unix_timestamp(dropoff_datetime) - unix_timestamp(pickup_datetime)) %>% group_by(dropoff_ntacode, dropoff_ntaname) %>% summarize(n = n(), trip_time_mean = mean(trip_time), trip_dist_mean = mean(trip_distance), dropoff_latitude = mean(dropoff_latitude), dropoff_longitude = mean(dropoff_longitude), passenger_mean = mean(passenger_count), fare_amount = mean(fare_amount), tip_amount = mean(tip_amount)) dplyrSparkSQL
  • 15. 15 ICTeam S.p.A. – Presentazione della Divisione Progettazione Dplyr in Sparklyr: an example SELECT * FROM (SELECT `dropoff_ntacode`, `dropoff_ntaname`, `n`, `trip_time_mean`, `trip_dist_mean`, `dropoff_latitude`, `dropoff_longitude`, `passenger_mean`, `fare_amount`, `tip_amount`, rank() OVER (PARTITION BY `dropoff_ntacode` ORDER BY `n` DESC) AS `n_rank` FROM (SELECT `dropoff_ntacode`, `dropoff_ntaname`, count(*) AS `n`, AVG(`trip_time`) AS `trip_time_mean`, AVG(`trip_distance`) AS `trip_dist_mean`, AVG(`dropoff_latitude`) AS `dropoff_latitude`, AVG(`dropoff_longitude`) AS `dropoff_longitude`, AVG(`passenger_count`) AS `passenger_mean`, AVG(`fare_amount`) AS `fare_amount`, AVG(`tip_amount`) AS `tip_amount` FROM (SELECT `vendorid`, `pickup_datetime`, `dropoff_datetime`, `passenger_count`, `trip_distance`, `pickup_longitude`, `pickup_latitude`, `ratecodeid`, `store_and_fwd_flag`, `dropoff_longitude`, `dropoff_latitude`, `payment_type`, `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`, `improvement_surcharge`, `total_amount`, `pickup_borocode`, `pickup_boroname`, `pickup_ntacode`, `pickup_ntaname`, `dropoff_borocode`, `dropoff_boroname`, `dropoff_ntacode`, `dropoff_ntaname`, UNIX_TIMESTAMP(`dropoff_datetime`) - UNIX_TIMESTAMP(`pickup_datetime`) AS `trip_time` FROM (SELECT * FROM (SELECT * FROM `yellow_taxi_raw_data_2009_2016_june_geo_partitioned` WHERE (`pickup_ntacode` = 'QN98')) `zrhxchievt` WHERE (NOT((`dropoff_ntacode`) IS NULL))) `wedupsfkki`) `hkpwsclpve` GROUP BY `dropoff_ntacode`, `dropoff_ntaname`) `bugdzqpxlv`) `haexysqfhn` WHERE (`n_rank` <= 25.0) Jfk_pickup <- jfk_pickup_tbl %>% mutate(n_rank = min_rank(desc(n))) %>% filter(n_rank <= 25) dplyrSparkSQL
  • 16. 16 ICTeam S.p.A. – Presentazione della Divisione Progettazione ML in Sparklyr Sparklyr allows to access the machine learning routines provided by the spark.ml package Three families of functions:  Machine learning algorithms for analyzing data (ml_*)  Feature transformers for manipulating individual features (ft_*)  Functions for manipulating Spark DataFrames (sdf_*) Example:  Perform SQL queries through the sparklyr dplyr interface  Use the sdf_* and ft_* family of functions to generate new columns, or partition your data set  Choose an appropriate machine learning algorithm from the ml_* family of functions to model your data
  • 17. 17 ICTeam S.p.A. – Presentazione della Divisione Progettazione Extensions in Sparklyr Extensions can be created to call the full Spark API and to provide interfaces to Spark packages Package Description spark.sas7bdat Read in SAS data in parallel into Apache Spark. rsparkling Extension for using H2O machine learning algorithms against Spark Data Frames. sparkhello Simple example of including a custom JAR file within an extension package. rddlist Implements some methods of an R list as a Spark RDD (resilient distributed dataset). sparkwarc Load WARC files into Apache Spark with sparklyr. sparkavro Load Avro data into Spark with sparklyr. It is a wrapper of spark-avro
  • 18. 18 ICTeam S.p.A. – Presentazione della Divisione Progettazione Sparklyr help: the RStudio cheat sheet
  • 19. 19 ICTeam S.p.A. – Presentazione della Divisione Progettazione SparkR vs Sparklyr natively included in Spark after version 1.6.2 developed by RStudio, available on CRAN and GitHub it allows to download and install Spark for development purposes df <- createDataFrame(flights) head(select(df, df$distance, df$origin)) or head(df[, c(‘distance', ‘origin')]) filter(df, df$distance > 3000) df <- copy_to(sc2, flights) head(select(df, distance, origin)) filter(df, distance > 3000) documentation through R’s help documentation through R’s help SparkR Sparklyr
  • 20. 20 ICTeam S.p.A. – Presentazione della Divisione Progettazione SparkR vs Sparklyr spark.logit spark.mlp spark.naiveBayes spark.survreg spark.glm spark.gbt spark.randomForest spark.kmeans spark.lda spark.isoreg spark.gaussianMixture spark.als spark.kstest ml_logistic_regression ml_multilayer_perceptron ml_naive_bayes ml_survival_regression ml_generalized_linear_regression ml_gradient_boosted_trees ml_random_forest ml_kmeans ml_lda ml_linear_regression ml_decision_tree ml_pca ml_one_vs_rest UDF functions UDF functions (but can invoke Scala code) SparkR Sparklyr
  • 21. 21 ICTeam S.p.A. – Presentazione della Divisione Progettazione SparkR vs Sparklyr in Google Trends
  • 22. 22 ICTeam S.p.A. – Presentazione della Divisione Progettazione SparkR vs Sparklyr in Google Trends
  • 23. 23 ICTeam S.p.A. – Presentazione della Divisione Progettazione Demo on NYC taxi data  1 billion NYC taxi data  Original analysis by Todd W. Schneider1, November 2015 +  Rstudio webinar2, October 2016  77 GB of data stored in a Hive table 1 http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/ 2 https://www.rstudio.com/resources/webinars/using-spark-with-shiny-and-r-markdown/
  • 24. 24 ICTeam S.p.A. – Presentazione della Divisione Progettazione serena.signorelli@icteam.it Thank you for your attention