SlideShare a Scribd company logo
1 of 17
Designing a Machine
Learning algorithm for
Apache Spark
Marco Gaido
Software Engineer and Apache Spark
contributor
2017-10-17
2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Apache Spark and Machine Learning
3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
What is Apache Spark?
 A fast and general-purpose cluster computing system
– Fast because it allows in memory computing
 It was created for Machine Learning algorithms
– Very slow on MapReduce
– Iterative
 Easy to be used
– The user can implement his business logic using high level API
– Several APIs: Scala, Java, Python, SQL, R
 4 main modules built on top of it:
– Spark Streaming
– SparkSQL
– MLLib
– GraphX
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
MLLib
 A complete ML library, which aims to cover all ML phases
– Featurization
– Training
– Evaluation
– Persistence
– Prediction
 High level API
 Great performance
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Apache Spark and Machine Learning
Implementing an algorithm on Apache Spark
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
How to write a ML algorithm in MLLib?
 Spark is open source: anybody can contribute or create his/her own version
 As easy as rewriting the implementation using RDDs or DataFrames
 Trivial implementations can be written with few lines of code for many algorithms
 Though, many well-known algorithm are still missing…
WHY?
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
DBSCAN
 DBSCAN is a widespread density-based clustering algorithm
– Two inputs: a radius (ε) and a number of points (minPts) to decide whether an area is dense or
sparse
 Naïve implementation:
– Find the ε (eps) neighbors of a point p
– If they are at least minPts
• If p already belongs to a cluster, then assign the neighbors
to the same cluster
• Otherwise, create a new cluster containing p and its neighbors
– Repeat until all points have been processed
 Computational complexity: O(N²) in computing
or memory
 A parallel (and reliable) implementation is not trivial at all
3
A
B
C
8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Apache Spark and Machine Learning
Implementing an algorithm on Apache Spark
Designing an algorithm for Apache Spark
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Key points
 Shared states should be small (or no shared state at all)
– They have to be kept in memory on all the executors
 The goal computational complexity is O(N/W), where W is the number of executors
– This ensures infinite scalability
– O (N2) is not suitable for Big Data (1M of input data becomes 1T to be analyzed, 1T becomes 1Y)
 Iterating multiple times over the same dataset is fine
– The dataset can be cached in memory
10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
An example: Silhouette
 SPARK-14516: introduced in the next Apache release (2.3.0)
 Measure of the quality of a clustering result
 Implementation of Silhouette algorithm using squared Euclidean distance
 References:
– Design document: https://goo.gl/7cJV64
– Code:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
 Definition
– For each datum i compute the average dissimilarity with all the data in the same cluster (a(i))
– Compute the average dissimilarity to all the other cluster a pick the smallest one (b(i))
– Then compute the Silhouette coefficient for i:
– Compute the average of the Silhouette coefficient for all points
 Computational complexity
– O(N2): for each point, we need to compute its distance to all the other points
Silhouette
12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
 The problem is computing the average distance of a point X to a cluster C
Squared Euclidean Silhouette
𝑖=1
𝑁
𝑗=1
𝐷
𝑥𝑗 − 𝑐𝑖𝑗
2
𝑁𝐶
… after some old but gold algebra …
𝑁𝐶 𝜉 𝑋 + Ψ𝐶 − 2 𝑗=1
𝐷
𝑌𝐶 𝑗
𝑥𝑗
𝑁𝐶
Where 𝜉 𝑋 is a constant which can be precomputed for each point X, Ψ𝐶, 𝑌𝐶 , 𝑁𝐶 are
constant (actually 𝑌𝐶 is a vector) precomputed for each cluster
13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
 With the previous equation, each point Silhouette coefficient can be computed without
computing the distance to all the other points
– We precompute the cluster values (ie. the state)
– We use the above formula for each point for all the clusters
– We compute the average of the Silhouette coefficients
 We can assume the number of cluster is rather small
– Then, our shared state is small
 The overall complexity is O(N C D / W)
– We can assume that C and D are much lower than N, then O(N/W) → infinite scalability
Squared Euclidean Silhouette (2) 𝑁𝐶 𝜉 𝑋 + Ψ𝐶 − 2 𝑗=1
𝐷
𝑌𝐶 𝑗
𝑥𝑗
𝑁𝐶
C1
Ψ𝐶1
𝑌𝐶1
𝑁𝐶1
C2
Ψ𝐶2
𝑌𝐶2
𝑁𝐶2
C3
Ψ𝐶3
𝑌𝐶3
𝑁𝐶3
14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
1
10
100
1000
10000
0 20000 40000 60000 80000 100000 120000 140000 160000
Time(seconds)
Dataset cardinality (N)
Single thread tests on different datasets
Naïve Silhouette Squared Euclidean Silhouette
Performance comparison
15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Apache Spark and Machine Learning
Implementing an algorithm on Apache Spark
Designing an algorithm for Apache Spark
Takeaways
16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Takeaways
 Think, design your algorithms for Apache Spark
– Don’t implement them with Spark
 Everything you do, you must consider parallelism
 Shared states and information are a bottleneck to scalability
– Keep them small!
 If your algorithm is O(N2), re-think it
17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Thank You, Q&A

More Related Content

What's hot

2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...DB Tsai
 
DESIGN OF SIMULATION DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SAIKIR...
DESIGN OF SIMULATION DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SAIKIR...DESIGN OF SIMULATION DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SAIKIR...
DESIGN OF SIMULATION DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SAIKIR...Saikiran perfect
 
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...Saikiran Panjala
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsSujit Pal
 
DESIGN AND PERFORMANCE ANALYSIS OF BINARY ADDERS_edited
DESIGN AND PERFORMANCE ANALYSIS OF BINARY ADDERS_editedDESIGN AND PERFORMANCE ANALYSIS OF BINARY ADDERS_edited
DESIGN AND PERFORMANCE ANALYSIS OF BINARY ADDERS_editedShital Badaik
 
IEEE/RSJ IROS 2008 Real-time Tracker
IEEE/RSJ IROS 2008 Real-time TrackerIEEE/RSJ IROS 2008 Real-time Tracker
IEEE/RSJ IROS 2008 Real-time Trackerc.choi
 
Design and Verification of Area Efficient Carry Select Adder
Design and Verification of Area Efficient Carry Select AdderDesign and Verification of Area Efficient Carry Select Adder
Design and Verification of Area Efficient Carry Select Adderijsrd.com
 
32-bit unsigned multiplier by using CSLA & CLAA
32-bit unsigned multiplier by using CSLA &  CLAA32-bit unsigned multiplier by using CSLA &  CLAA
32-bit unsigned multiplier by using CSLA & CLAAGanesh Sambasivarao
 
Algorithm(BFS, PRIM, DIJKSTRA, LCS)
Algorithm(BFS, PRIM, DIJKSTRA, LCS)Algorithm(BFS, PRIM, DIJKSTRA, LCS)
Algorithm(BFS, PRIM, DIJKSTRA, LCS)TanvirAhammed22
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaopenseesdays
 
Design and implementation of low power
Design and implementation of low powerDesign and implementation of low power
Design and implementation of low powerSurendra Bommavarapu
 
Extracting a Rails Engine to a separated application
Extracting a Rails Engine to a separated applicationExtracting a Rails Engine to a separated application
Extracting a Rails Engine to a separated applicationJônatas Paganini
 
SMiLE: Design and Development of an ISS Payload for Liquid Behavior Study in ...
SMiLE: Design and Development of an ISS Payload for Liquid Behavior Study in ...SMiLE: Design and Development of an ISS Payload for Liquid Behavior Study in ...
SMiLE: Design and Development of an ISS Payload for Liquid Behavior Study in ...Mohamed Elhariry
 
Implementation of Low Power and Area Efficient Carry Select Adder
Implementation of Low Power and Area Efficient Carry Select AdderImplementation of Low Power and Area Efficient Carry Select Adder
Implementation of Low Power and Area Efficient Carry Select Adderinventionjournals
 
Karnaugh map or K-map method
Karnaugh map or K-map methodKarnaugh map or K-map method
Karnaugh map or K-map methodAbdullah Moin
 

What's hot (20)

2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
 
DESIGN OF SIMULATION DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SAIKIR...
DESIGN OF SIMULATION DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SAIKIR...DESIGN OF SIMULATION DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SAIKIR...
DESIGN OF SIMULATION DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SAIKIR...
 
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
 
Haskell Accelerate
Haskell  AccelerateHaskell  Accelerate
Haskell Accelerate
 
Machine learning
Machine learningMachine learning
Machine learning
 
8 Bit A L U
8 Bit  A L U8 Bit  A L U
8 Bit A L U
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
 
DESIGN AND PERFORMANCE ANALYSIS OF BINARY ADDERS_edited
DESIGN AND PERFORMANCE ANALYSIS OF BINARY ADDERS_editedDESIGN AND PERFORMANCE ANALYSIS OF BINARY ADDERS_edited
DESIGN AND PERFORMANCE ANALYSIS OF BINARY ADDERS_edited
 
IEEE/RSJ IROS 2008 Real-time Tracker
IEEE/RSJ IROS 2008 Real-time TrackerIEEE/RSJ IROS 2008 Real-time Tracker
IEEE/RSJ IROS 2008 Real-time Tracker
 
Design and Verification of Area Efficient Carry Select Adder
Design and Verification of Area Efficient Carry Select AdderDesign and Verification of Area Efficient Carry Select Adder
Design and Verification of Area Efficient Carry Select Adder
 
32-bit unsigned multiplier by using CSLA & CLAA
32-bit unsigned multiplier by using CSLA &  CLAA32-bit unsigned multiplier by using CSLA &  CLAA
32-bit unsigned multiplier by using CSLA & CLAA
 
Algorithm(BFS, PRIM, DIJKSTRA, LCS)
Algorithm(BFS, PRIM, DIJKSTRA, LCS)Algorithm(BFS, PRIM, DIJKSTRA, LCS)
Algorithm(BFS, PRIM, DIJKSTRA, LCS)
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKenna
 
Design and implementation of low power
Design and implementation of low powerDesign and implementation of low power
Design and implementation of low power
 
Extracting a Rails Engine to a separated application
Extracting a Rails Engine to a separated applicationExtracting a Rails Engine to a separated application
Extracting a Rails Engine to a separated application
 
SMiLE: Design and Development of an ISS Payload for Liquid Behavior Study in ...
SMiLE: Design and Development of an ISS Payload for Liquid Behavior Study in ...SMiLE: Design and Development of an ISS Payload for Liquid Behavior Study in ...
SMiLE: Design and Development of an ISS Payload for Liquid Behavior Study in ...
 
Implementation of Low Power and Area Efficient Carry Select Adder
Implementation of Low Power and Area Efficient Carry Select AdderImplementation of Low Power and Area Efficient Carry Select Adder
Implementation of Low Power and Area Efficient Carry Select Adder
 
Intro to Elixir
Intro to ElixirIntro to Elixir
Intro to Elixir
 
Keep Calm and Distributed Tracing
Keep Calm and Distributed TracingKeep Calm and Distributed Tracing
Keep Calm and Distributed Tracing
 
Karnaugh map or K-map method
Karnaugh map or K-map methodKarnaugh map or K-map method
Karnaugh map or K-map method
 

Similar to Designing a machine learning algorithm for Apache Spark

Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonDataWorks Summit
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...jsvetter
 
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...inside-BigData.com
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Databricks
 
Hao hsiang ma resume
Hao hsiang ma resumeHao hsiang ma resume
Hao hsiang ma resumeEliot Ma
 
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発Ryo 亮 Kawahara 河原
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systemsinside-BigData.com
 
[E-Dev-Day 2014][14/16] Adding vector graphics support to EFL
[E-Dev-Day 2014][14/16] Adding vector graphics support to EFL[E-Dev-Day 2014][14/16] Adding vector graphics support to EFL
[E-Dev-Day 2014][14/16] Adding vector graphics support to EFLEnlightenmentProject
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectMatthew Gerring
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomFacultad de Informática UCM
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with ZeppelinHortonworks
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkMila, Université de Montréal
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsDataWorks Summit
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5Jeff Larkin
 
Scafi: Scala with Computational Fields
Scafi: Scala with Computational FieldsScafi: Scala with Computational Fields
Scafi: Scala with Computational FieldsRoberto Casadei
 

Similar to Designing a machine learning algorithm for Apache Spark (20)

Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
eScience Cluster Arch. Overview
eScience Cluster Arch. OvervieweScience Cluster Arch. Overview
eScience Cluster Arch. Overview
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
 
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
 
Hao hsiang ma resume
Hao hsiang ma resumeHao hsiang ma resume
Hao hsiang ma resume
 
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systems
 
[E-Dev-Day 2014][14/16] Adding vector graphics support to EFL
[E-Dev-Day 2014][14/16] Adding vector graphics support to EFL[E-Dev-Day 2014][14/16] Adding vector graphics support to EFL
[E-Dev-Day 2014][14/16] Adding vector graphics support to EFL
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science Project
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
 
DhevendranResume
DhevendranResumeDhevendranResume
DhevendranResume
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
 
Scafi: Scala with Computational Fields
Scafi: Scala with Computational FieldsScafi: Scala with Computational Fields
Scafi: Scala with Computational Fields
 

Recently uploaded

Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 

Recently uploaded (20)

Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Designing a machine learning algorithm for Apache Spark

  • 1. Designing a Machine Learning algorithm for Apache Spark Marco Gaido Software Engineer and Apache Spark contributor 2017-10-17
  • 2. 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda Apache Spark and Machine Learning
  • 3. 3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved What is Apache Spark?  A fast and general-purpose cluster computing system – Fast because it allows in memory computing  It was created for Machine Learning algorithms – Very slow on MapReduce – Iterative  Easy to be used – The user can implement his business logic using high level API – Several APIs: Scala, Java, Python, SQL, R  4 main modules built on top of it: – Spark Streaming – SparkSQL – MLLib – GraphX
  • 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved MLLib  A complete ML library, which aims to cover all ML phases – Featurization – Training – Evaluation – Persistence – Prediction  High level API  Great performance
  • 5. 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda Apache Spark and Machine Learning Implementing an algorithm on Apache Spark
  • 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved How to write a ML algorithm in MLLib?  Spark is open source: anybody can contribute or create his/her own version  As easy as rewriting the implementation using RDDs or DataFrames  Trivial implementations can be written with few lines of code for many algorithms  Though, many well-known algorithm are still missing… WHY?
  • 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved DBSCAN  DBSCAN is a widespread density-based clustering algorithm – Two inputs: a radius (ε) and a number of points (minPts) to decide whether an area is dense or sparse  Naïve implementation: – Find the ε (eps) neighbors of a point p – If they are at least minPts • If p already belongs to a cluster, then assign the neighbors to the same cluster • Otherwise, create a new cluster containing p and its neighbors – Repeat until all points have been processed  Computational complexity: O(N²) in computing or memory  A parallel (and reliable) implementation is not trivial at all 3 A B C
  • 8. 8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda Apache Spark and Machine Learning Implementing an algorithm on Apache Spark Designing an algorithm for Apache Spark
  • 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Key points  Shared states should be small (or no shared state at all) – They have to be kept in memory on all the executors  The goal computational complexity is O(N/W), where W is the number of executors – This ensures infinite scalability – O (N2) is not suitable for Big Data (1M of input data becomes 1T to be analyzed, 1T becomes 1Y)  Iterating multiple times over the same dataset is fine – The dataset can be cached in memory
  • 10. 10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved An example: Silhouette  SPARK-14516: introduced in the next Apache release (2.3.0)  Measure of the quality of a clustering result  Implementation of Silhouette algorithm using squared Euclidean distance  References: – Design document: https://goo.gl/7cJV64 – Code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
  • 11. 11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved  Definition – For each datum i compute the average dissimilarity with all the data in the same cluster (a(i)) – Compute the average dissimilarity to all the other cluster a pick the smallest one (b(i)) – Then compute the Silhouette coefficient for i: – Compute the average of the Silhouette coefficient for all points  Computational complexity – O(N2): for each point, we need to compute its distance to all the other points Silhouette
  • 12. 12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved  The problem is computing the average distance of a point X to a cluster C Squared Euclidean Silhouette 𝑖=1 𝑁 𝑗=1 𝐷 𝑥𝑗 − 𝑐𝑖𝑗 2 𝑁𝐶 … after some old but gold algebra … 𝑁𝐶 𝜉 𝑋 + Ψ𝐶 − 2 𝑗=1 𝐷 𝑌𝐶 𝑗 𝑥𝑗 𝑁𝐶 Where 𝜉 𝑋 is a constant which can be precomputed for each point X, Ψ𝐶, 𝑌𝐶 , 𝑁𝐶 are constant (actually 𝑌𝐶 is a vector) precomputed for each cluster
  • 13. 13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved  With the previous equation, each point Silhouette coefficient can be computed without computing the distance to all the other points – We precompute the cluster values (ie. the state) – We use the above formula for each point for all the clusters – We compute the average of the Silhouette coefficients  We can assume the number of cluster is rather small – Then, our shared state is small  The overall complexity is O(N C D / W) – We can assume that C and D are much lower than N, then O(N/W) → infinite scalability Squared Euclidean Silhouette (2) 𝑁𝐶 𝜉 𝑋 + Ψ𝐶 − 2 𝑗=1 𝐷 𝑌𝐶 𝑗 𝑥𝑗 𝑁𝐶 C1 Ψ𝐶1 𝑌𝐶1 𝑁𝐶1 C2 Ψ𝐶2 𝑌𝐶2 𝑁𝐶2 C3 Ψ𝐶3 𝑌𝐶3 𝑁𝐶3
  • 14. 14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved 1 10 100 1000 10000 0 20000 40000 60000 80000 100000 120000 140000 160000 Time(seconds) Dataset cardinality (N) Single thread tests on different datasets Naïve Silhouette Squared Euclidean Silhouette Performance comparison
  • 15. 15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda Apache Spark and Machine Learning Implementing an algorithm on Apache Spark Designing an algorithm for Apache Spark Takeaways
  • 16. 16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Takeaways  Think, design your algorithms for Apache Spark – Don’t implement them with Spark  Everything you do, you must consider parallelism  Shared states and information are a bottleneck to scalability – Keep them small!  If your algorithm is O(N2), re-think it
  • 17. 17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Thank You, Q&A

Editor's Notes

  1. High level API DataFrame abstraction Thought to be used by Data Scientist without Spark knowledge But Spark expertize is needed to have good performance Great performance Parallel and scalable In memory caching and computing
  2. With the previous equation, each point Silhouette coefficient can be computed without computing the distance to all the other points We precompute the needed values for the clusters (ie. We precompute our state) For each cluster we need to compute 2 constant and one vector We can assume the number of cluster is rather small Then, our shared state is small We compute the above formula for all the clusters for each point, and - with these computed average distances - we compute the Silhouette coefficient for each point The average of all the Silhouette coefficients is computed Thus, the computational complexity of the needed steps is: O(N D / W), it requires a one-pass aggregation over the entire dataset O(N C D / W), for each point we compute the average distance to all the clusters O(N / W), it requires a one-pass aggregation over the entire dataset We need 2 passes over the dataset: one to precompute the state one to compute the coefficients and their average
  3. The comparison is fair: no parallelism is exploited. Only thanks to the computational complexity. This is for small dataset, our implementation enables also to compute it over larger ones.