Who are we?
Sameer Farooqui Doug Bateman Jon Bates
• Dir of Training @ NewCircle
• Spark Trainer for Databricks
• 800+ trainings on Java,
Python, Android, Hibernate,
Spring, etc
• Trainer @ Databricks
• 150+ trainings on Hadoop,
C*, HBase, Couchbase,
NoSQL, etc
• Data Scientist
• Consultant for Databricks
• EdX assistant instructor on
Scalable ML w/ Spark
Agenda: Talks
Sameer Farooqui Doug Bateman Jon Bates
15
mins:
• Intro & Spark Overview
25
mins:
• Power Plant Demo
• ETL + Linear Regression
25
mins:
• Iris Flower Demo
• Model Parallel w/ sci-kit
learn
Agenda: Q & A 30
mins
+
• Consulting Architect for Cloudera
• Cluster setup, Security/Kerberos,
Hive, Impala, HBase, Spark
• Based in Germany
• R, Sci-Kit Learn, Spark, Mahout, HBase,
Hive, Pig
• Senior Data Scientist @ Big Data
Partnership + Spark Trainer for DB
• Based in London
Stephane Rion
Lars Francke
Who are you?
1) I have used Spark hands on before…
2) I have more than 1 year hands on experience with ML…
Spark Data Model
item-‐1
item-‐2
item-‐3
item-‐4
item-‐5
item-‐6
item-‐6
item-‐8
item-‐9
item-‐10
Ex
RD
DRD
D
Ex
RD
DRD
D
Ex
RD
D
more
par((ons
=
more
parallelism
Use Case: predict power output given a set of readings from various
sensors in a gas-fired power generation plant
Schema Definition:
AT
=
Atmospheric
Temperature
in
C
V
=
Exhaust
Vacuum
Speed
AP
=
Atmospheric
Pressure
RH
=
RelaCve
Humidity
PE
=
Power
Output
(value
we
are
trying
to
predict)
Data Parallelism
• Data stored across workers
• Communicate model to all
workers
• Examples:
• MLLib Linear models
• Matrix outer products
Scalability Rules
1st Rule of thumb
Computation & Storage should be linear (in n, d )
2nd Rule of thumb
Perform parallel and in-memory computation
3rd Rule of thumb
Minimize Network Communication
Agenda: Q & A 30
mins
Stephane Rion
Lars Francke
Sameer Farooqui
Doug Bateman
Jon Bates