SlideShare a Scribd company logo
1 of 21
Download to read offline
Heiko Korndorf, Wireframe
An Update on
Scaling Data Science
with SparkR
#DSSAIS18
Agenda
• About Me
• Spark & R
• Spark Architecture
• Spark DataFrames and SparkSQL
• Natively Distributed ML with Spark ML
• Big Compute: Parallelization with Spark UDFs
• ML & Data-in-motion: Spark Streaming
• Tips & Pitfalls
• What About Python?
• Summary & Outlook
2#DSSAIS18
About Me
3#DSSAIS18
• MSc in Computer Science, University of Zurich
• > 20 Years in Consulting
• Finance, Energy, Telco, Pharma, Manufacturing
• EAI, BI/Data Warehousing, CRM, ERP, Technology
• Speaker at SparkSummit, HadoopSummit, and others
• Founder & CEO Wireframe AG
• PatternFinder: Data Science Data Warehouse / Business Machine Intelligence
• PatternGenerator: Development Tool for Streaming Applications
• https://wireframe.digital
SparkR Architecture
4#DSSAIS18
Integration
Libraries
Core
• Execute R on cluster
• Master/Slave
• Out-Of-Memory Datasets
• Access Data Lake
• Powerful Libraries
• Machine Learning
• SQL
• Streaming
• R Integration through SparkR
Data (Lake) Access
5#DSSAIS18
• Ability to read Big Data File Formats
• HDFS, AWS S3, Azure WASB, …
• Parquet, CSV, JSON, ORC, ...
• Security
• Fine grained authorization
• Role-/Attribute-based Access Control
• Governance
• Metadata Management
• Lineage
DataFrame
SparkSQL
6#DSSAIS18
• Execute SQL against Spark DataFrame
• SELECT
• Specify Projection
• WHERE
• Filter criteria
• GROUPBY
• Group/Aggregate
• JOIN
• Join tables
• Alternatively, use
• select(), where(), groupBy(),
count(), etc.
Spark MLlib
7#DSSAIS18
SparkR & Streaming
8#DSSAIS18
Use R to process data streams
• Structured Streaming:
• DataFrames with streaming sources
• New data in stream is appended to an
unbounded table (i.e. DataFrame)
• Seamless integration:
• read.stream(“kafka”, ….)
Data Stream Unbounded Table
SparkR UDFs
9#DSSAIS18
SparkR Functions: gapply()/gapplyCollect(), dapply()/dapplyCollect()
• Apply a function to each group/partition of a Spark DataFrame
• Input: Grouping Key and DataFrame partition as R data.frame
• Output: R data.frame
SparkR UDFs
10#DSSAIS18
SparkR spark.lapply()
• Run a function over a list of elements and
• Distribute the computation with Spark
Big Compute
11#DSSAIS18
Areas where massively parallel computation is relevant:
• Ensemble Learning for Time Series
• Hyperparameter Sweeps
• High-Dimensional Problem/Search-Space
• Wireframe PatternFinder
• Shape/Motif Detection
• IoT Pattern/Shapes
• Monte-Carlo Simulation
• Value-at-Risk (Finance)
• Catastrophe Modeling (Reinsurance)
• Inventory Optimization (Oil & Gas, Manufacturing)
Big Compute
12#DSSAIS18
Massive Time Series Forecasting
• Sequential computation: > 22 hours
• Single-Server, parallelized: > 4.5 hours
• SparkR, Cluster w/ 25 nodes: ~ 12 minutes
Wireframe PatternFinder
• 15.500.000 Models to be computed
• 50 DataFrames x 5 Dependants x 10 Independants x 5 Models x 100
Segments
• 0.1 Sec. per Model
• Sequential: ~ 18 Days
• 1000 Cores: ~ 26 Minutes
Implications
• Minor refactoring of R code
• Massive cost reduction by using elastically scaling of Cloud resources
Tips & Pitfalls
13#DSSAIS18
• Generate diagrams in SparkR
• PDF, HTML, Rmarkdown
• Store in shared persistence or serialize into Spark DataFrame
• SPARK-21866 might be helpful?
• Store complex object (models) from SparkR
• saveRDS() saves to local storage
• Store in shared persistence or serialize into Spark DataFrame
• Run R on a YARN cluster w/o locally installed R
• / / -- - / .- . ../ . / -
• Mixing Scala & R
• Not supported by Oozie’s SparkAction
• Can be replaced with ShellAction
• Not supported by Apache Livy
• Only support for Scala, Java, and Python
And What About Python?
14#DSSAIS18
“Do I need to learn Python?”
Let’s compare (Spark)R & Py(Spark):
• Language: Interpreted Languages, R (1993), Python (1991)
• Libraries: CRAN (> 10.000 packages), Numpy, scikit-learn, Pandas
• Package Management
• IDEs/Notebooks: Rstudio/PyCharm, Jupyter, Zeppelin, Databricks Analytics Platform, IBM
Watson Studio, Cloudera Data Science Workbench, ….
And there’s more:
Market Momentum Deep Learning
Spark Support Spark Integration
Market Momentum
15#DSSAIS18
Redmonk (Jan 2018)
1. JavaScript
2. Java
3. Python
4. PHP
5. C#
.
.
10. Swift
11. Objective-C
12. R
TIOBE index (May 2018):
4. Python
11. R https://redmonk.com/sogrady/2018/03/07/language-rankings-1-18/
Deep Learning
16#DSSAIS18
R Python Other APIs
TensorFlow No Yes C++, Java, Go, Swift
Keras Yes Yes No
MXNet Yes Yes C++, Scala, Julia, Perl
PyTorch No Yes No
CNTK No Yes C++
• Python is a first-class citizen in the Deep Learning/Neural Network world
• Using R with these DL frameworks is possible but more complex
SparkR v PySpark
17#DSSAIS18
SparkR PySpark
Data Lake Integration Yes Yes
Spark SQL Yes Yes
Spark ML Yes Yes
UDFs Yes Yes
Streaming Yes Yes
• Both R and Python can access the same types of Spark APIs
Spark Integration
18#DSSAIS18
JVM vs Non-JVM
• Spark Executors run in JVM
• R & Python run in different processes
• Data must be moved between both
environments (SerDe)
• Low performance
Apache Arrow
19#DSSAIS18
In-Memory Columnar Data Format
• Cross-Language
• Optimized for numeric data
• Zero-copy reads (no serialization)
• Support
• Spark 2.3
• Python
• R Bindings not available yet
• And more: Parquet, HBase, Cassandra, …
• Will also improve R-Python-Integration (rpy2, reticulate)
See RStudio Blog (04/19/2018): https://blog.rstudio.com/2018/04/19/arrow-and-beyond/
DataFrame
Summary & Outlook
20#DSSAIS18
• Spark is the best option to scale R
• See also sparklyr, R Server for Spark
• Common Environment for Dev and Production
• “Looks like R to Data Science,
looks like Spark to Data Engineers”
• Security & Data Governance
• Row-/Column-Level Access Control
• Full Data Lineage (Up- and Downstream
• Shared Memory Format
• Apache Arrow!
• Mix and Match R, Python, and Scala
• Towards an Open Data Science Platform
Thank You!
21# DSSAIS18
Heiko Korndorf
heiko.korndorf@wireframe.digital

More Related Content

What's hot

"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
Pavel Hardak
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
Databricks
 

What's hot (20)

The Pursuit of Happiness: Building a Scalable Pipeline Using Apache Spark and...
The Pursuit of Happiness: Building a Scalable Pipeline Using Apache Spark and...The Pursuit of Happiness: Building a Scalable Pipeline Using Apache Spark and...
The Pursuit of Happiness: Building a Scalable Pipeline Using Apache Spark and...
 
Spark at Airbnb
Spark at AirbnbSpark at Airbnb
Spark at Airbnb
 
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
 
The Evolution of the Fashion Retail Industry in the Age of AI with Kshitij Ku...
The Evolution of the Fashion Retail Industry in the Age of AI with Kshitij Ku...The Evolution of the Fashion Retail Industry in the Age of AI with Kshitij Ku...
The Evolution of the Fashion Retail Industry in the Age of AI with Kshitij Ku...
 
Flink London meetup 3 March 2016 - Flink basics
Flink London meetup 3 March 2016 - Flink basicsFlink London meetup 3 March 2016 - Flink basics
Flink London meetup 3 March 2016 - Flink basics
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Big data knolx
Big data knolxBig data knolx
Big data knolx
 
Automated Production Ready ML at Scale
Automated Production Ready ML at ScaleAutomated Production Ready ML at Scale
Automated Production Ready ML at Scale
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
 
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
 
Data-Driven @ Netflix
Data-Driven @ NetflixData-Driven @ Netflix
Data-Driven @ Netflix
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
 
Data analytics at a petabyte scale final
Data analytics at a petabyte scale   finalData analytics at a petabyte scale   final
Data analytics at a petabyte scale final
 
Bridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesBridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFrames
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
 
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Build Your Own Recommendation Engine
Build Your Own Recommendation EngineBuild Your Own Recommendation Engine
Build Your Own Recommendation Engine
 

Similar to An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko Korndorf

Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
Peyman Mohajerian
 

Similar to An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko Korndorf (20)

Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
First impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithmFirst impressions of SparkR: our own machine learning algorithm
First impressions of SparkR: our own machine learning algorithm
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko Korndorf
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Spark Worshop
Spark WorshopSpark Worshop
Spark Worshop
 
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceNodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 

Recently uploaded (20)

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 

An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko Korndorf

  • 1. Heiko Korndorf, Wireframe An Update on Scaling Data Science with SparkR #DSSAIS18
  • 2. Agenda • About Me • Spark & R • Spark Architecture • Spark DataFrames and SparkSQL • Natively Distributed ML with Spark ML • Big Compute: Parallelization with Spark UDFs • ML & Data-in-motion: Spark Streaming • Tips & Pitfalls • What About Python? • Summary & Outlook 2#DSSAIS18
  • 3. About Me 3#DSSAIS18 • MSc in Computer Science, University of Zurich • > 20 Years in Consulting • Finance, Energy, Telco, Pharma, Manufacturing • EAI, BI/Data Warehousing, CRM, ERP, Technology • Speaker at SparkSummit, HadoopSummit, and others • Founder & CEO Wireframe AG • PatternFinder: Data Science Data Warehouse / Business Machine Intelligence • PatternGenerator: Development Tool for Streaming Applications • https://wireframe.digital
  • 4. SparkR Architecture 4#DSSAIS18 Integration Libraries Core • Execute R on cluster • Master/Slave • Out-Of-Memory Datasets • Access Data Lake • Powerful Libraries • Machine Learning • SQL • Streaming • R Integration through SparkR
  • 5. Data (Lake) Access 5#DSSAIS18 • Ability to read Big Data File Formats • HDFS, AWS S3, Azure WASB, … • Parquet, CSV, JSON, ORC, ... • Security • Fine grained authorization • Role-/Attribute-based Access Control • Governance • Metadata Management • Lineage DataFrame
  • 6. SparkSQL 6#DSSAIS18 • Execute SQL against Spark DataFrame • SELECT • Specify Projection • WHERE • Filter criteria • GROUPBY • Group/Aggregate • JOIN • Join tables • Alternatively, use • select(), where(), groupBy(), count(), etc.
  • 8. SparkR & Streaming 8#DSSAIS18 Use R to process data streams • Structured Streaming: • DataFrames with streaming sources • New data in stream is appended to an unbounded table (i.e. DataFrame) • Seamless integration: • read.stream(“kafka”, ….) Data Stream Unbounded Table
  • 9. SparkR UDFs 9#DSSAIS18 SparkR Functions: gapply()/gapplyCollect(), dapply()/dapplyCollect() • Apply a function to each group/partition of a Spark DataFrame • Input: Grouping Key and DataFrame partition as R data.frame • Output: R data.frame
  • 10. SparkR UDFs 10#DSSAIS18 SparkR spark.lapply() • Run a function over a list of elements and • Distribute the computation with Spark
  • 11. Big Compute 11#DSSAIS18 Areas where massively parallel computation is relevant: • Ensemble Learning for Time Series • Hyperparameter Sweeps • High-Dimensional Problem/Search-Space • Wireframe PatternFinder • Shape/Motif Detection • IoT Pattern/Shapes • Monte-Carlo Simulation • Value-at-Risk (Finance) • Catastrophe Modeling (Reinsurance) • Inventory Optimization (Oil & Gas, Manufacturing)
  • 12. Big Compute 12#DSSAIS18 Massive Time Series Forecasting • Sequential computation: > 22 hours • Single-Server, parallelized: > 4.5 hours • SparkR, Cluster w/ 25 nodes: ~ 12 minutes Wireframe PatternFinder • 15.500.000 Models to be computed • 50 DataFrames x 5 Dependants x 10 Independants x 5 Models x 100 Segments • 0.1 Sec. per Model • Sequential: ~ 18 Days • 1000 Cores: ~ 26 Minutes Implications • Minor refactoring of R code • Massive cost reduction by using elastically scaling of Cloud resources
  • 13. Tips & Pitfalls 13#DSSAIS18 • Generate diagrams in SparkR • PDF, HTML, Rmarkdown • Store in shared persistence or serialize into Spark DataFrame • SPARK-21866 might be helpful? • Store complex object (models) from SparkR • saveRDS() saves to local storage • Store in shared persistence or serialize into Spark DataFrame • Run R on a YARN cluster w/o locally installed R • / / -- - / .- . ../ . / - • Mixing Scala & R • Not supported by Oozie’s SparkAction • Can be replaced with ShellAction • Not supported by Apache Livy • Only support for Scala, Java, and Python
  • 14. And What About Python? 14#DSSAIS18 “Do I need to learn Python?” Let’s compare (Spark)R & Py(Spark): • Language: Interpreted Languages, R (1993), Python (1991) • Libraries: CRAN (> 10.000 packages), Numpy, scikit-learn, Pandas • Package Management • IDEs/Notebooks: Rstudio/PyCharm, Jupyter, Zeppelin, Databricks Analytics Platform, IBM Watson Studio, Cloudera Data Science Workbench, …. And there’s more: Market Momentum Deep Learning Spark Support Spark Integration
  • 15. Market Momentum 15#DSSAIS18 Redmonk (Jan 2018) 1. JavaScript 2. Java 3. Python 4. PHP 5. C# . . 10. Swift 11. Objective-C 12. R TIOBE index (May 2018): 4. Python 11. R https://redmonk.com/sogrady/2018/03/07/language-rankings-1-18/
  • 16. Deep Learning 16#DSSAIS18 R Python Other APIs TensorFlow No Yes C++, Java, Go, Swift Keras Yes Yes No MXNet Yes Yes C++, Scala, Julia, Perl PyTorch No Yes No CNTK No Yes C++ • Python is a first-class citizen in the Deep Learning/Neural Network world • Using R with these DL frameworks is possible but more complex
  • 17. SparkR v PySpark 17#DSSAIS18 SparkR PySpark Data Lake Integration Yes Yes Spark SQL Yes Yes Spark ML Yes Yes UDFs Yes Yes Streaming Yes Yes • Both R and Python can access the same types of Spark APIs
  • 18. Spark Integration 18#DSSAIS18 JVM vs Non-JVM • Spark Executors run in JVM • R & Python run in different processes • Data must be moved between both environments (SerDe) • Low performance
  • 19. Apache Arrow 19#DSSAIS18 In-Memory Columnar Data Format • Cross-Language • Optimized for numeric data • Zero-copy reads (no serialization) • Support • Spark 2.3 • Python • R Bindings not available yet • And more: Parquet, HBase, Cassandra, … • Will also improve R-Python-Integration (rpy2, reticulate) See RStudio Blog (04/19/2018): https://blog.rstudio.com/2018/04/19/arrow-and-beyond/ DataFrame
  • 20. Summary & Outlook 20#DSSAIS18 • Spark is the best option to scale R • See also sparklyr, R Server for Spark • Common Environment for Dev and Production • “Looks like R to Data Science, looks like Spark to Data Engineers” • Security & Data Governance • Row-/Column-Level Access Control • Full Data Lineage (Up- and Downstream • Shared Memory Format • Apache Arrow! • Mix and Match R, Python, and Scala • Towards an Open Data Science Platform
  • 21. Thank You! 21# DSSAIS18 Heiko Korndorf heiko.korndorf@wireframe.digital