Under the Hood
Meetup @ Data Science London
Aug 27, 2015
Who are we?
Sameer Farooqui Doug Bateman Jon Bates
•  Dir of Training @ NewCircle
•  Spark Trainer for Databricks
•  800+ trainings on Java,
Python, Android, Hibernate,
Spring, etc
•  Trainer @ Databricks
•  150+ trainings on Hadoop,
C*, HBase, Couchbase,
NoSQL, etc
•  Data Scientist
•  Consultant for Databricks
•  EdX assistant instructor on
Scalable ML w/ Spark
Agenda: Talks
Sameer Farooqui Doug Bateman Jon Bates
15	
  mins:	
  
•  Intro & Spark Overview
25	
  mins:	
  
•  Power Plant Demo
•  ETL + Linear Regression
25	
  mins:	
  
•  Iris Flower Demo
•  Model Parallel w/ sci-kit
learn
Agenda: Q & A 30	
  mins	
  
+	
  
•  Consulting Architect for Cloudera
•  Cluster setup, Security/Kerberos,
Hive, Impala, HBase, Spark
•  Based in Germany
•  R, Sci-Kit Learn, Spark, Mahout, HBase,
Hive, Pig
•  Senior Data Scientist @ Big Data
Partnership + Spark Trainer for DB
•  Based in London
Stephane Rion
Lars Francke
Who are you?
1) I have used Spark hands on before…
2) I have more than 1 year hands on experience with ML…
6
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
7
Spark Core
DataFrames ML Pipelines
Spark
Streaming
Spark SQL MLlib GraphX
8
{JSON}
Data Sources
Spark Core
DataFrames ML Pipelines
Spark
Streaming
Spark SQL MLlib GraphX
9
{JSON}
Data Sources
Spark Core
DataFrames ML Pipelines
Spark
Streaming
Spark SQL MLlib GraphX
10
Goal: unified engine across data sources,
workloads and environments
Spark – 100% open source and mature
Used in production by over 500 organizations. From fortune 100 to small innovators
Apache Spark: Large user community
MapReduce YARN HDFS
Storm
Spark
0
1000
2000
3000
4000
Commits in the past year
0
20
40
60
80
100
120
140
2011 2012 2013 2014 2015
Contributors per Month to Spark
Most active project in big data
13
Large-Scale Usage
Largest cluster: 8000 nodes
Largest single job: 1 petabyte
Top streaming intake: 1 TB/hour
2014 on-disk 100 TB sort record
15
On-Disk Sort Record:
Time to sort 100TB
Source: Daytona GraySort benchmark, sortbenchmark.org
2100 machines2013 Record:
Hadoop
72 minutes
2014 Record:
Spark
207 machines
23 minutes
2014: an Amazing Year for Spark
Total contributors: 150 => 500
Lines of code: 190K => 370K
500+ active production deployments
16
The Databricks team contributed more than 75% of the code added to Spark in the past year
Overview of ML Algorithms
Prediction:
•  Regression
•  Classification
Tokenizer, HashingTF, IDF,
Word2Vec,Nomalizer, StandardScaler
LinearRegression, DecisionTree,
SVM,LogisticRegression, NaiveBayes,
DecisionTree
Feature Transformation:
Recommendation: ALS
Clustering: KMeans, GaussianMixtureEM, LDA
Overview of ML Algorithms
Other:
•  Statistics
•  Linear Algebra
•  Optimization
Correlation, ChiSqTest, Statistics,
MultivariateOnlineSummarizer
RowMatrix, EigenValueDecomposition,
Matrix, Vector
GradientDescent, LBFGS
 
Spark	
  Driver	
  
	
  
	
  
Executor	
  
	
  Task	
   Task	
  
Executor	
  
	
  Task	
   Task	
  
Executor	
  
	
  Task	
   Task	
  
Executor	
  
	
  Task	
   Task	
  
Spark Physical Cluster
Spark Data Model
Error,	
  ts,	
  msg1	
  
Warn,	
  ts,	
  msg2	
  
Error,	
  ts,	
  msg1	
  	
  
RDD / DataFrame with 4 partitions
Info,	
  ts,	
  msg8	
  
Warn,	
  ts,	
  msg2	
  
Info,	
  ts,	
  msg8	
  	
  
Error,	
  ts,	
  msg3	
  
Info,	
  ts,	
  msg5	
  
Info,	
  ts,	
  msg5	
  	
  
Error,	
  ts,	
  msg4	
  
Warn,	
  ts,	
  msg9	
  
Error,	
  ts,	
  msg1	
  	
  
logLinesRDD	
  
Spark Data Model
item-­‐1	
  
item-­‐2	
  
	
  
item-­‐3	
  
item-­‐4	
  
item-­‐5	
  
item-­‐6	
  
item-­‐6	
  
item-­‐8	
  
item-­‐9	
  
item-­‐10	
  
Ex
RD
DRD
D
Ex
RD
DRD
D
Ex
RD
D
more	
  par((ons	
  =	
  more	
  parallelism	
  
Power Plant Demo
Use Case: predict power output given a set of readings from various
sensors in a gas-fired power generation plant
Schema Definition:
AT	
  =	
  Atmospheric	
  Temperature	
  in	
  C	
  
V	
  =	
  Exhaust	
  Vacuum	
  Speed	
  
AP	
  =	
  Atmospheric	
  Pressure	
  
RH	
  =	
  RelaCve	
  Humidity	
  
PE	
  =	
  Power	
  Output	
  (value	
  we	
  are	
  trying	
  to	
  predict)	
  
1.  ETL	
  
	
  
2.  Explore + Visualize Data
3.  Apply Machine Learning
Steps:
Iris Flower Demo
Use Case: Link	
  	
  	
  	
  	
  legacy	
  code	
  	
  	
  	
  	
  	
  to	
  Spark
Different ways to parallelize ML
•  Model Parallelism
•  Divide & Conquer
•  Data Parallelism
Model Parallelism
•  Model stored across workers
•  Communicate data to all workers
•  Examples:
•  Grid search
•  Cross validation
•  Ensemble
Divide & Conquer
•  Minimizes communication
•  Leads to approximate solutions
Data Parallelism
•  Data stored across workers
•  Communicate model to all
workers
•  Examples:
•  MLLib Linear models
•  Matrix outer products
Scalability Rules
1st Rule of thumb
Computation & Storage should be linear (in n, d )
2nd Rule of thumb
Perform parallel and in-memory computation
3rd Rule of thumb
Minimize Network Communication
Agenda: Q & A 30	
  mins	
  
Stephane Rion
Lars Francke
Sameer Farooqui
Doug Bateman
Jon Bates

Spark Under the Hood - Meetup @ Data Science London

  • 1.
    Under the Hood Meetup@ Data Science London Aug 27, 2015
  • 2.
    Who are we? SameerFarooqui Doug Bateman Jon Bates •  Dir of Training @ NewCircle •  Spark Trainer for Databricks •  800+ trainings on Java, Python, Android, Hibernate, Spring, etc •  Trainer @ Databricks •  150+ trainings on Hadoop, C*, HBase, Couchbase, NoSQL, etc •  Data Scientist •  Consultant for Databricks •  EdX assistant instructor on Scalable ML w/ Spark
  • 3.
    Agenda: Talks Sameer FarooquiDoug Bateman Jon Bates 15  mins:   •  Intro & Spark Overview 25  mins:   •  Power Plant Demo •  ETL + Linear Regression 25  mins:   •  Iris Flower Demo •  Model Parallel w/ sci-kit learn
  • 4.
    Agenda: Q &A 30  mins   +   •  Consulting Architect for Cloudera •  Cluster setup, Security/Kerberos, Hive, Impala, HBase, Spark •  Based in Germany •  R, Sci-Kit Learn, Spark, Mahout, HBase, Hive, Pig •  Senior Data Scientist @ Big Data Partnership + Spark Trainer for DB •  Based in London Stephane Rion Lars Francke
  • 5.
    Who are you? 1)I have used Spark hands on before… 2) I have more than 1 year hands on experience with ML…
  • 6.
  • 7.
    7 Spark Core DataFrames MLPipelines Spark Streaming Spark SQL MLlib GraphX
  • 8.
    8 {JSON} Data Sources Spark Core DataFramesML Pipelines Spark Streaming Spark SQL MLlib GraphX
  • 9.
    9 {JSON} Data Sources Spark Core DataFramesML Pipelines Spark Streaming Spark SQL MLlib GraphX
  • 10.
    10 Goal: unified engineacross data sources, workloads and environments
  • 11.
    Spark – 100%open source and mature Used in production by over 500 organizations. From fortune 100 to small innovators
  • 12.
    Apache Spark: Largeuser community MapReduce YARN HDFS Storm Spark 0 1000 2000 3000 4000 Commits in the past year
  • 13.
    0 20 40 60 80 100 120 140 2011 2012 20132014 2015 Contributors per Month to Spark Most active project in big data 13
  • 14.
    Large-Scale Usage Largest cluster:8000 nodes Largest single job: 1 petabyte Top streaming intake: 1 TB/hour 2014 on-disk 100 TB sort record
  • 15.
    15 On-Disk Sort Record: Timeto sort 100TB Source: Daytona GraySort benchmark, sortbenchmark.org 2100 machines2013 Record: Hadoop 72 minutes 2014 Record: Spark 207 machines 23 minutes
  • 16.
    2014: an AmazingYear for Spark Total contributors: 150 => 500 Lines of code: 190K => 370K 500+ active production deployments 16
  • 17.
    The Databricks teamcontributed more than 75% of the code added to Spark in the past year
  • 18.
    Overview of MLAlgorithms Prediction: •  Regression •  Classification Tokenizer, HashingTF, IDF, Word2Vec,Nomalizer, StandardScaler LinearRegression, DecisionTree, SVM,LogisticRegression, NaiveBayes, DecisionTree Feature Transformation: Recommendation: ALS Clustering: KMeans, GaussianMixtureEM, LDA
  • 19.
    Overview of MLAlgorithms Other: •  Statistics •  Linear Algebra •  Optimization Correlation, ChiSqTest, Statistics, MultivariateOnlineSummarizer RowMatrix, EigenValueDecomposition, Matrix, Vector GradientDescent, LBFGS
  • 20.
      Spark  Driver       Executor    Task   Task   Executor    Task   Task   Executor    Task   Task   Executor    Task   Task   Spark Physical Cluster
  • 21.
    Spark Data Model Error,  ts,  msg1   Warn,  ts,  msg2   Error,  ts,  msg1     RDD / DataFrame with 4 partitions Info,  ts,  msg8   Warn,  ts,  msg2   Info,  ts,  msg8     Error,  ts,  msg3   Info,  ts,  msg5   Info,  ts,  msg5     Error,  ts,  msg4   Warn,  ts,  msg9   Error,  ts,  msg1     logLinesRDD  
  • 22.
    Spark Data Model item-­‐1   item-­‐2     item-­‐3   item-­‐4   item-­‐5   item-­‐6   item-­‐6   item-­‐8   item-­‐9   item-­‐10   Ex RD DRD D Ex RD DRD D Ex RD D more  par((ons  =  more  parallelism  
  • 23.
  • 24.
    Use Case: predictpower output given a set of readings from various sensors in a gas-fired power generation plant Schema Definition: AT  =  Atmospheric  Temperature  in  C   V  =  Exhaust  Vacuum  Speed   AP  =  Atmospheric  Pressure   RH  =  RelaCve  Humidity   PE  =  Power  Output  (value  we  are  trying  to  predict)  
  • 25.
    1.  ETL     2.  Explore + Visualize Data 3.  Apply Machine Learning Steps:
  • 26.
  • 27.
    Use Case: Link          legacy  code            to  Spark
  • 28.
    Different ways toparallelize ML •  Model Parallelism •  Divide & Conquer •  Data Parallelism
  • 29.
    Model Parallelism •  Modelstored across workers •  Communicate data to all workers •  Examples: •  Grid search •  Cross validation •  Ensemble
  • 30.
    Divide & Conquer • Minimizes communication •  Leads to approximate solutions
  • 31.
    Data Parallelism •  Datastored across workers •  Communicate model to all workers •  Examples: •  MLLib Linear models •  Matrix outer products
  • 32.
    Scalability Rules 1st Ruleof thumb Computation & Storage should be linear (in n, d ) 2nd Rule of thumb Perform parallel and in-memory computation 3rd Rule of thumb Minimize Network Communication
  • 33.
    Agenda: Q &A 30  mins   Stephane Rion Lars Francke Sameer Farooqui Doug Bateman Jon Bates