Machine Learning on Distributed Systems by Josh Poduska

MACHINE LEARNING ON
DISTRIBUTED SYSTEMS
JOSH PODUSKA
JUNE 2017

THE NEED FOR MACHINE LEARNING AT
SCALE

THE MACHINE LEARNING AT SCALE
LANDSCAPE
• MPP (Massively Parallel Processing) Environment
• Distributed Execution
• Different Math
• Pre-Built Machine Learning Functions (not just a dev
environment)
• Able to build models on truly large datasets (>> 1B rows and
>> 100 columns) without running out of memory or taking
days to run
Node 1 Node
2….
Node n

LANDSCAPE
• How many of these
solutions offer distributed
machine learning?

LANDSCAPE
• The machine learning at
scale players
• Spark
• H2O
• Revolution (now part of
Microsoft)
• MADLib (Greenplum)
• Vertica
• Fuzzy Logix (Netezza)

LANDSCAPE
• Distributed Analytical Compute
Engines
• Spark
• H2O
• Revolution (now part of Microsoft)
• MPP Analytical Data Marts
• MADLib (Greenplum)
• Vertica
• Fuzzy Logix (Netezza)
CE DM

DISTRIBUTED ANALYTIC COMPUTE ENGINE
ARCHITECTURE
(SPARK EXAMPLES)

DISTRIBUTED ANALYTIC COMPUTE ENGINE
ARCHITECTURE
• Scaling architecture/Commodity hardware
• Adapts to any data storage location
• All data types welcome
• Custom ingest and data prep
• Built-in visual data discovery
• Customized analytics via programming APIs
• Deepest and widest distributed analytical
libraries available

MPP ANALYTIC DATA MART ARCHITECTURE
(VERTICA EXAMPLE)
User defined loads User defined functions BI & visualization
ODBC
JDBC
OLEDBMessaging
Data transformation
ETL
User defined storage
Security
External tables to analyze in place
R Java Python SQL
Geospatial Real-time Text
analytics
Event series Pattern matching
Time series Machine learning Regression

MPP ANALYTIC DATA MART ARCHITECTURE
• Scaling architecture/Commodity hardware or appliance
• Built-in data storage
• Advanced storage techniques for (semi) structured data
• Fastest analytics at scale available (if optimized for
data layer)
• Fastest streaming analytics/ingest available
• Efficient use of disk when needed so not memory
bound
• Short development times and wide reach via SQL
• All enterprise features come out-of-the-box (security,
HA, DR, resource mngmt, ACID)
• High concurrency is built-in

0.2
0.7 7.4
3.3
14.9
139.2
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
160.0
10 100 1000
Runtime(minutes)
Rows (millions)
KMeans Scalability
(K=5, Col=100)
Vertica 8.0.1 Spark 2.0
0.4 1
36.1
0.0
1.1
29.8
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
10 100 1000
Runtime(minutes)
Rows (millions)
Linear Regression Scalability
(Col=100)
0.7 0.6
8.7
0.2
6.1
37.0
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
10 100 1000
Runtime(minutes)
Rows (millions)
Logistic Regression Scalability
(Col=100)
2.5
11.2
86.1
1.1
61.0
418.1
0
100
200
300
400
500
10 100 1000
Runtime(minutes)
Rows (millions)
Naïve Bayes Scalability
(Col=100)

QUICK MENTION OF OTHER TECHNOLOGIES
• NoSQL
• OLTP
• Cubes
• Batch vs Real Time and Lambda Architectures
• TensorFlow
• GPUs

COMMON ARCHITECTURE CONSIDERATIONS
Compute
Engine or
Data Mart
Data Size
Data Types
Data Location
IT & Data
Engineering
Computer
Scientists
Concurrency
Model Agility
Model Depth
Model Variety
Scoring
Volume &
Speed
Landing
Results
Embedding
Analytics
Data
Demands
Organizationa
l
ResourcesAccuracy
Requiremen
Deployment
Configurati
on

DATA DEMANDS
• Size
• Billions of rows. Hundreds of columns.
• Types
• Structured : RDBMS (feed or replacement), Business App Bulk Load
• Semi-Structured: Logs
• Unstructured: Text, Audio, Video
• Location
• EDW, Mainframe, RDBMS, Hard Drive
• HDFS, S3, ABS
C
E
DM
=
DM
C
E
DM

ORGANIZATIONAL STRUCTURE
• IT & Data Engineering
• Control admin resource costs and complexity
• Control hardware costs
• Control software costs
• Control support costs
• Computer Scientists
• Control headcount
• Concurrency
• Built to support many analysts simultaneously
DM
DM
DM
DM
C
E
=

ACCURACY REQUIREMENTS
• Model Agility
• Iterate/experiment with data preparation strategies
• Deploying models quickly
• Model Depth
• Able to fine tune models
• Balance bias and variance
• Model Variety
• Test/ensemble a large diversity of modeling techniques
• Low-level model customization
DM
C
E
DM
C
E
C
E
C
E

DEPLOYMENT CONFIGURATION
• Scoring Volume & Speed
• Mechanisms for row by row scoring
• Speed of micro-batching and simultaneous scoring
• Landing Results
• Results in traditional business systems
• Results in cloud or long term storage
• Embedding Analytics
• Low-touch embedded deployment and system duplication
C
E
DM
DM
DM
=

ARCHITECTURE DECISION FLOWDataSize
Billions of Rows
and Hundreds
of Columns
DataTypes
Structured :
RDBMS (feed or
replacement),
Business App
Bulk Load
Semi-
Structured: Logs
Unstructured:
Text, Audio,
Video
DataLocation
EDW,
Mainframe,
RDBMS,
Hard Drive
HDFS, S3,
ABS
IT&DataEngineeringResources
Control
admin
resource
costs and
complexity
and
hardware
costs
Control
support
costs
Control
software
costs
ComputerScientists
Control
headcount
Concurrency
Built to
support
many
analysts
simultaneo
usly
ModelAgility
Iterating/e
xperimenti
ng with
data
preparatio
n
strategies
and
deploying
models
quickly
ModelDepth
Able to
fine tune
models
and
balance
bias and
variance
ModelVariety
Test/ense
mble a
large
diversity of
modeling
techniques
and
provide
low-level
model
customizat
ion
ScoringVolume&Speed
Scoring
Volume &
Speed
Mechanism
s for row
by row
scoring
LandingResults
Results in
traditional
business
systems
Results in
cloud or
long term
storage
EmbeddingAnalytics
Low-touch
embedded
deploymen
t and
system
duplication
DM
CE

LARGE CREDIT CARD RT TRANSACTIONAL FRAUD
SYSTEM
DataSize
Billions of Rows
and Hundreds
of Columns
DataTypes
Structured :
RDBMS (feed or
replacement),
Business App
Bulk Load
Semi-
Structured: Logs
Unstructured:
Text, Audio,
Video
DataLocation
EDW,
Mainframe,
RDBMS,
Hard Drive
HDFS, S3,
ABS
Control
admin
resource
costs and
complexity
and
hardware
costs
Control
support
costs
Control
software
costs
ComputerScientists
Control
headcount
Concurrency
Built to
support
many
analysts
simultaneo
usly
ModelAgility
Iterating/e
xperimenti
ng with
data
preparatio
n
strategies
and
deploying
models
quickly
ModelDepth
Able to
fine tune
models
and
balance
bias and
variance
ModelVariety
Test/ense
mble a
large
diversity of
modeling
techniques
and
provide
low-level
model
customizat
ion
ScoringVolume&Speed
Scoring
Volume &
Speed
Mechanism
s for row
by row
scoring
LandingResults
Results in
traditional
business
systems
Results in
cloud or
long term
storage
EmbeddingAnalytics
Low-touch
embedded
deploymen
t and
system
duplication
DM
CE

MID-SIZED NETWORK ANALYTICS/CYBER SECURITY SYSTEMDataSize
Billions of Rows
and Hundreds
of Columns
DataTypes
Structured :
RDBMS (feed or
replacement),
Business App
Bulk Load
Semi-
Structured: Logs
Unstructured:
Text, Audio,
Video
DataLocation
EDW,
Mainframe,
RDBMS,
Hard Drive
HDFS, S3,
ABS
Control
admin
resource
costs and
complexity
and
hardware
costs
Control
support
costs
Control
software
costs
ComputerScientists
Control
headcount
Concurrency
Built to
support
many
analysts
simultaneo
usly
ModelAgility
Iterating/e
xperimenti
ng with
data
preparatio
n
strategies
and
deploying
models
quickly
ModelDepth
Able to
fine tune
models
and
balance
bias and
variance
ModelVariety
Test/ense
mble a
large
diversity of
modeling
techniques
and
provide
low-level
model
customizat
ion
ScoringVolume&Speed
Scoring
Volume &
Speed
Mechanism
s for row
by row
scoring
LandingResults
Results in
traditional
business
systems
Results in
cloud or
long term
storage
EmbeddingAnalytics
Low-touch
embedded
deploymen
t and
system
duplication
DM
CE

COMBINED ARCHITECTURE
(HADOOP EXAMPLE)
• Most orgs have both
already
• Evaluate your analytic
needs
• Dedicate physical and
personnel resources
better
• Consider what data
should be “hot” vs
“cold”
• If the workload fits in a
Distributed Analytical
Data Mart you will
speed up analytics,
save at least 1/3 on
hardware, free up
Greenplu
m
Vertica
Netezza
Spark
H2O
Revolutio
n

Machine Learning on Distributed Systems by Josh Poduska

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine Learning on Distributed Systems by Josh Poduska

Similar to Machine Learning on Distributed Systems by Josh Poduska (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Machine Learning on Distributed Systems by Josh Poduska

Editor's Notes