Unlocking the Potential of the Cloud for IBM Power Systems
Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM
1. 1 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Sparklyr: Big Data
enabler for R users
Serena Signorelli
Data Science Milan, May 15th 2017
2. 2 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Outline
About me
The Data Science process
Package and its functionalities
SparkR vs Sparklyr
Demo on NYC taxi data
3. 3 ICTeam S.p.A. – Presentazione della Divisione Progettazione
About me
Experience:
Business administration and management
Research grants in Economics Statistics
PhD in Analytics for Economics and Business
Traineeship at Eurostat Big Data Task Force
Data scientist at ICTeam SpA
Why Sparklyr?
R user
No computer science background
Need to handle Big Data
4. 4 ICTeam S.p.A. – Presentazione della Divisione Progettazione
R language
Open source
5th most popular programming language in
2016 (IEEE Spectrum ranking)
Data analysis, statistical modelling and visualization
Historically limited to in-memory data
5. 5 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Data Science process
1. Import data into memory
2. Clean and tidy the data
3. Cyclical process called understand:
1. making transformations to tidied data
2. using the transformed data to fit models
3. visualizing results
4. Communicate the results
6. 6 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Data Science process with big data
Problem: data is too large to download into memory
Workaround: use a very small sample or download as
much data as possible
7. 7 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Data Science process with big data
Limitations: the sample may not be representative, long
waiting time in every iteration of importing, exploring and
modeling
Solution: use Sparklyr to access and analyze the data
inside Spark and only bring results into R
8. 8 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Data Science process with big data
9. 9 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Sparklyr: R interface for Apache Spark
First release: 0.4 – September 24th, 2016
Current release: 0.5.4 – April 25th, 2017
10. 10 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Dplyr
Dplyr verbs:
select ~ SELECT
filter ~ WHERE
arrange ~ ORDER
summarise ~ aggregators: sum, min, sd, etc.
mutate ~ operators: +, *, log, etc.
Grouping: group_by ~ GROUP BY
Window functions: rank, dense_rank, percent_rank, ntile,
row_number, cume_dist, first_value, last_value, lag, lead
Performing joins: inner_join, semi_join, left_join, anti_join,
full_join
Sampling: sample_n, sample_frac
12. 12 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Dplyr in Sparklyr
Hive functions:
many of Hive’s built-in functions (UDF) and built-in aggregate
functions (UDAF) can be called inside dplyr’s mutate and
summarize
Reading and writing data:
spark_read_csv, spark_read_json, spark_read_parquet,
spark_write_csv, spark_write_json, spark_write_parquet
Collecting to R:
collect()
13. 13 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Dplyr in Sparklyr
Characteristics:
Laziness
It never pulls data into R unless you explicitly ask for it
It delays doing any work until the last possible moment:
it collects together everything you want to do and then
sends it to the database in one step
Piping %>%
From package magrittr
Provides a mechanism for chaining commands with a
forward-pipe operator
14. 14 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Dplyr in Sparklyr: an example
SELECT `dropoff_ntacode`, `dropoff_ntaname`, count(*) AS `n`, AVG(`trip_time`) AS `trip_time_mean`,
AVG(`trip_distance`) AS `trip_dist_mean`, AVG(`dropoff_latitude`) AS `dropoff_latitude`,
AVG(`dropoff_longitude`) AS `dropoff_longitude`, AVG(`passenger_count`) AS `passenger_mean`,
AVG(`fare_amount`) AS `fare_amount`, AVG(`tip_amount`) AS `tip_amount`
FROM (SELECT `vendorid`, `pickup_datetime`, `dropoff_datetime`, `passenger_count`, `trip_distance`,
`pickup_longitude`, `pickup_latitude`, `ratecodeid`, `store_and_fwd_flag`, `dropoff_longitude`,
`dropoff_latitude`, `payment_type`, `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`,
`improvement_surcharge`, `total_amount`, `pickup_borocode`, `pickup_boroname`, `pickup_ntacode`,
`pickup_ntaname`, `dropoff_borocode`, `dropoff_boroname`, `dropoff_ntacode`, `dropoff_ntaname`,
UNIX_TIMESTAMP(`dropoff_datetime`) - UNIX_TIMESTAMP(`pickup_datetime`) AS `trip_time`
FROM (SELECT *
FROM (SELECT *
FROM `yellow_taxi_raw_data_2009_2016_june_geo_partitioned`
WHERE (`pickup_ntacode` = 'QN98')) `jxjgtsgzwv`
WHERE (NOT((`dropoff_ntacode`) IS NULL))) `cobkqjitky`) `vquwddaabv`
GROUP BY `dropoff_ntacode`, `dropoff_ntaname`
jfk_pickup_tbl <- yellow_taxi_raw_data_prepared_tbl %>%
filter(pickup_ntacode == 'QN98') %>%
filter(!is.na(dropoff_ntacode)) %>%
mutate(trip_time = unix_timestamp(dropoff_datetime) - unix_timestamp(pickup_datetime)) %>%
group_by(dropoff_ntacode, dropoff_ntaname) %>%
summarize(n = n(),
trip_time_mean = mean(trip_time),
trip_dist_mean = mean(trip_distance),
dropoff_latitude = mean(dropoff_latitude),
dropoff_longitude = mean(dropoff_longitude),
passenger_mean = mean(passenger_count),
fare_amount = mean(fare_amount),
tip_amount = mean(tip_amount))
dplyrSparkSQL
15. 15 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Dplyr in Sparklyr: an example
SELECT *
FROM (SELECT `dropoff_ntacode`, `dropoff_ntaname`, `n`, `trip_time_mean`, `trip_dist_mean`,
`dropoff_latitude`, `dropoff_longitude`, `passenger_mean`, `fare_amount`, `tip_amount`, rank() OVER
(PARTITION BY `dropoff_ntacode` ORDER BY `n` DESC) AS `n_rank`
FROM (SELECT `dropoff_ntacode`, `dropoff_ntaname`, count(*) AS `n`, AVG(`trip_time`) AS
`trip_time_mean`, AVG(`trip_distance`) AS `trip_dist_mean`, AVG(`dropoff_latitude`) AS
`dropoff_latitude`, AVG(`dropoff_longitude`) AS `dropoff_longitude`, AVG(`passenger_count`) AS
`passenger_mean`, AVG(`fare_amount`) AS `fare_amount`, AVG(`tip_amount`) AS `tip_amount`
FROM (SELECT `vendorid`, `pickup_datetime`, `dropoff_datetime`, `passenger_count`, `trip_distance`,
`pickup_longitude`, `pickup_latitude`, `ratecodeid`, `store_and_fwd_flag`, `dropoff_longitude`,
`dropoff_latitude`, `payment_type`, `fare_amount`, `extra`, `mta_tax`, `tip_amount`, `tolls_amount`,
`improvement_surcharge`, `total_amount`, `pickup_borocode`, `pickup_boroname`, `pickup_ntacode`,
`pickup_ntaname`, `dropoff_borocode`, `dropoff_boroname`, `dropoff_ntacode`, `dropoff_ntaname`,
UNIX_TIMESTAMP(`dropoff_datetime`) - UNIX_TIMESTAMP(`pickup_datetime`) AS `trip_time`
FROM (SELECT *
FROM (SELECT *
FROM `yellow_taxi_raw_data_2009_2016_june_geo_partitioned`
WHERE (`pickup_ntacode` = 'QN98')) `zrhxchievt`
WHERE (NOT((`dropoff_ntacode`) IS NULL))) `wedupsfkki`) `hkpwsclpve`
GROUP BY `dropoff_ntacode`, `dropoff_ntaname`) `bugdzqpxlv`) `haexysqfhn`
WHERE (`n_rank` <= 25.0)
Jfk_pickup <- jfk_pickup_tbl %>%
mutate(n_rank = min_rank(desc(n))) %>%
filter(n_rank <= 25)
dplyrSparkSQL
16. 16 ICTeam S.p.A. – Presentazione della Divisione Progettazione
ML in Sparklyr
Sparklyr allows to access the machine learning routines
provided by the spark.ml package
Three families of functions:
Machine learning algorithms for analyzing data (ml_*)
Feature transformers for manipulating individual features
(ft_*)
Functions for manipulating Spark DataFrames (sdf_*)
Example:
Perform SQL queries through the sparklyr dplyr interface
Use the sdf_* and ft_* family of functions to generate new
columns, or partition your data set
Choose an appropriate machine learning algorithm from the
ml_* family of functions to model your data
17. 17 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Extensions in Sparklyr
Extensions can be created to call the full Spark API and to
provide interfaces to Spark packages
Package Description
spark.sas7bdat Read in SAS data in parallel into Apache Spark.
rsparkling Extension for using H2O machine learning
algorithms against Spark Data Frames.
sparkhello Simple example of including a custom JAR file
within an extension package.
rddlist Implements some methods of an R list as a
Spark RDD (resilient distributed dataset).
sparkwarc Load WARC files into Apache Spark with
sparklyr.
sparkavro Load Avro data into Spark with sparklyr. It is a
wrapper of spark-avro
18. 18 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Sparklyr help: the RStudio cheat sheet
19. 19 ICTeam S.p.A. – Presentazione della Divisione Progettazione
SparkR vs Sparklyr
natively included in Spark after
version 1.6.2
developed by RStudio, available
on CRAN and GitHub
it allows to download and install
Spark for development purposes
df <- createDataFrame(flights)
head(select(df, df$distance, df$origin))
or
head(df[, c(‘distance', ‘origin')])
filter(df, df$distance > 3000)
df <- copy_to(sc2, flights)
head(select(df, distance, origin))
filter(df, distance > 3000)
documentation through R’s help documentation through R’s help
SparkR Sparklyr
21. 21 ICTeam S.p.A. – Presentazione della Divisione Progettazione
SparkR vs Sparklyr in Google Trends
22. 22 ICTeam S.p.A. – Presentazione della Divisione Progettazione
SparkR vs Sparklyr in Google Trends
23. 23 ICTeam S.p.A. – Presentazione della Divisione Progettazione
Demo on NYC taxi data
1 billion NYC taxi data
Original analysis by Todd W. Schneider1, November 2015
+
Rstudio webinar2, October 2016
77 GB of data stored in a Hive table
1 http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/
2 https://www.rstudio.com/resources/webinars/using-spark-with-shiny-and-r-markdown/
24. 24 ICTeam S.p.A. – Presentazione della Divisione Progettazione
serena.signorelli@icteam.it
Thank you for your attention