Abstract:- Most real-world data science workflows require more than multiple cores on a single server to meet scale and speed demands, but there is a general lack of understanding when it comes to what machine learning on distributed systems looks like in practice. Gartner and Forrester do not consider distributed execution when they score advanced analytics software solutions. Many formal machine learning training occurs on single node machines with non-distributed algorithms. In this talk we discuss why an understanding of distributed architectures is important for anyone in the analytical sciences. We will cover the current distributed machine learning ecosystem. We will review common pitfalls when performing machine learning at scale. We will discuss architectural considerations for a machine learning program such as the role of storage and compute and under what circumstances they should be combined or separated.
3. THE MACHINE LEARNING AT SCALE
LANDSCAPE
• MPP (Massively Parallel Processing) Environment
• Distributed Execution
• Different Math
• Pre-Built Machine Learning Functions (not just a dev
environment)
• Able to build models on truly large datasets (>> 1B rows and
>> 100 columns) without running out of memory or taking
days to run
Node 1 Node
2….
Node n
4. THE MACHINE LEARNING AT SCALE
LANDSCAPE
• How many of these
solutions offer distributed
machine learning?
5. THE MACHINE LEARNING AT SCALE
LANDSCAPE
• The machine learning at
scale players
• Spark
• H2O
• Revolution (now part of
Microsoft)
• MADLib (Greenplum)
• Vertica
• Fuzzy Logix (Netezza)
6. THE MACHINE LEARNING AT SCALE
LANDSCAPE
• Distributed Analytical Compute
Engines
• Spark
• H2O
• Revolution (now part of Microsoft)
• MPP Analytical Data Marts
• MADLib (Greenplum)
• Vertica
• Fuzzy Logix (Netezza)
CE DM
8. DISTRIBUTED ANALYTIC COMPUTE ENGINE
ARCHITECTURE
• Scaling architecture/Commodity hardware
• Adapts to any data storage location
• All data types welcome
• Custom ingest and data prep
• Built-in visual data discovery
• Customized analytics via programming APIs
• Deepest and widest distributed analytical
libraries available
9. MPP ANALYTIC DATA MART ARCHITECTURE
(VERTICA EXAMPLE)
User defined loads User defined functions BI & visualization
ODBC
JDBC
OLEDBMessaging
Data transformation
ETL
User defined storage
Security
External tables to analyze in place
R Java Python SQL
Geospatial Real-time Text
analytics
Event series Pattern matching
Time series Machine learning Regression
10. MPP ANALYTIC DATA MART ARCHITECTURE
• Scaling architecture/Commodity hardware or appliance
• Built-in data storage
• Advanced storage techniques for (semi) structured data
• Fastest analytics at scale available (if optimized for
data layer)
• Fastest streaming analytics/ingest available
• Efficient use of disk when needed so not memory
bound
• Short development times and wide reach via SQL
• All enterprise features come out-of-the-box (security,
HA, DR, resource mngmt, ACID)
• High concurrency is built-in
12. QUICK MENTION OF OTHER TECHNOLOGIES
• NoSQL
• OLTP
• Cubes
• Batch vs Real Time and Lambda Architectures
• TensorFlow
• GPUs
13. COMMON ARCHITECTURE CONSIDERATIONS
Compute
Engine or
Data Mart
Data Size
Data Types
Data Location
IT & Data
Engineering
Computer
Scientists
Concurrency
Model Agility
Model Depth
Model Variety
Scoring
Volume &
Speed
Landing
Results
Embedding
Analytics
Data
Demands
Organizationa
l
ResourcesAccuracy
Requiremen
Deployment
Configurati
on
14. DATA DEMANDS
• Size
• Billions of rows. Hundreds of columns.
• Types
• Structured : RDBMS (feed or replacement), Business App Bulk Load
• Semi-Structured: Logs
• Unstructured: Text, Audio, Video
• Location
• EDW, Mainframe, RDBMS, Hard Drive
• HDFS, S3, ABS
C
E
DM
=
DM
C
E
DM
15. ORGANIZATIONAL STRUCTURE
• IT & Data Engineering
• Control admin resource costs and complexity
• Control hardware costs
• Control software costs
• Control support costs
• Computer Scientists
• Control headcount
• Concurrency
• Built to support many analysts simultaneously
DM
DM
DM
DM
C
E
=
16. ACCURACY REQUIREMENTS
• Model Agility
• Iterate/experiment with data preparation strategies
• Deploying models quickly
• Model Depth
• Able to fine tune models
• Balance bias and variance
• Model Variety
• Test/ensemble a large diversity of modeling techniques
• Low-level model customization
DM
C
E
DM
C
E
C
E
C
E
17. DEPLOYMENT CONFIGURATION
• Scoring Volume & Speed
• Mechanisms for row by row scoring
• Speed of micro-batching and simultaneous scoring
• Landing Results
• Results in traditional business systems
• Results in cloud or long term storage
• Embedding Analytics
• Low-touch embedded deployment and system duplication
C
E
DM
DM
DM
=
18. ARCHITECTURE DECISION FLOWDataSize
Billions of Rows
and Hundreds
of Columns
DataTypes
Structured :
RDBMS (feed or
replacement),
Business App
Bulk Load
Semi-
Structured: Logs
Unstructured:
Text, Audio,
Video
DataLocation
EDW,
Mainframe,
RDBMS,
Hard Drive
HDFS, S3,
ABS
IT&DataEngineeringResources
Control
admin
resource
costs and
complexity
and
hardware
costs
Control
support
costs
Control
software
costs
ComputerScientists
Control
headcount
Concurrency
Built to
support
many
analysts
simultaneo
usly
ModelAgility
Iterating/e
xperimenti
ng with
data
preparatio
n
strategies
and
deploying
models
quickly
ModelDepth
Able to
fine tune
models
and
balance
bias and
variance
ModelVariety
Test/ense
mble a
large
diversity of
modeling
techniques
and
provide
low-level
model
customizat
ion
ScoringVolume&Speed
Scoring
Volume &
Speed
Mechanism
s for row
by row
scoring
LandingResults
Results in
traditional
business
systems
Results in
cloud or
long term
storage
EmbeddingAnalytics
Low-touch
embedded
deploymen
t and
system
duplication
DM
CE
19. LARGE CREDIT CARD RT TRANSACTIONAL FRAUD
SYSTEM
DataSize
Billions of Rows
and Hundreds
of Columns
DataTypes
Structured :
RDBMS (feed or
replacement),
Business App
Bulk Load
Semi-
Structured: Logs
Unstructured:
Text, Audio,
Video
DataLocation
EDW,
Mainframe,
RDBMS,
Hard Drive
HDFS, S3,
ABS
IT&DataEngineeringResources
Control
admin
resource
costs and
complexity
and
hardware
costs
Control
support
costs
Control
software
costs
ComputerScientists
Control
headcount
Concurrency
Built to
support
many
analysts
simultaneo
usly
ModelAgility
Iterating/e
xperimenti
ng with
data
preparatio
n
strategies
and
deploying
models
quickly
ModelDepth
Able to
fine tune
models
and
balance
bias and
variance
ModelVariety
Test/ense
mble a
large
diversity of
modeling
techniques
and
provide
low-level
model
customizat
ion
ScoringVolume&Speed
Scoring
Volume &
Speed
Mechanism
s for row
by row
scoring
LandingResults
Results in
traditional
business
systems
Results in
cloud or
long term
storage
EmbeddingAnalytics
Low-touch
embedded
deploymen
t and
system
duplication
DM
CE
20. MID-SIZED NETWORK ANALYTICS/CYBER SECURITY SYSTEMDataSize
Billions of Rows
and Hundreds
of Columns
DataTypes
Structured :
RDBMS (feed or
replacement),
Business App
Bulk Load
Semi-
Structured: Logs
Unstructured:
Text, Audio,
Video
DataLocation
EDW,
Mainframe,
RDBMS,
Hard Drive
HDFS, S3,
ABS
IT&DataEngineeringResources
Control
admin
resource
costs and
complexity
and
hardware
costs
Control
support
costs
Control
software
costs
ComputerScientists
Control
headcount
Concurrency
Built to
support
many
analysts
simultaneo
usly
ModelAgility
Iterating/e
xperimenti
ng with
data
preparatio
n
strategies
and
deploying
models
quickly
ModelDepth
Able to
fine tune
models
and
balance
bias and
variance
ModelVariety
Test/ense
mble a
large
diversity of
modeling
techniques
and
provide
low-level
model
customizat
ion
ScoringVolume&Speed
Scoring
Volume &
Speed
Mechanism
s for row
by row
scoring
LandingResults
Results in
traditional
business
systems
Results in
cloud or
long term
storage
EmbeddingAnalytics
Low-touch
embedded
deploymen
t and
system
duplication
DM
CE
21. COMBINED ARCHITECTURE
(HADOOP EXAMPLE)
• Most orgs have both
already
• Evaluate your analytic
needs
• Dedicate physical and
personnel resources
better
• Consider what data
should be “hot” vs
“cold”
• If the workload fits in a
Distributed Analytical
Data Mart you will
speed up analytics,
save at least 1/3 on
hardware, free up
Greenplu
m
Vertica
Netezza
Spark
H2O
Revolutio
n
Editor's Notes
Josh Poduska is a Senior Data Scientist in HPE’s Big Data Software Group. Josh has 16 years of experience as a practitioner in the analytical sciences with an emphasis on machine learning and statistical applications. He spent the last six years focusing on advanced analytical solutions with MPP columnar databases. At HPE he is part of the Vertica team and uses Vertica and its machine learning library to help organizations solve their toughest data challenges.
Apache Spark is a cluster computing framework, makes your computation faster by providing inmemory computing and easy integration because of the big spark ecosystem. You can use the spark cluster for various task like machine learning, graph computation by paralleling them.
TensorFlow in short is a Library developed by google for improving performance of your numerical computation, it generates the data flow as graphs where nodes denotes operations and edges denotes data array. Google recently released distributed version of TF thus you can run your TF on distributed environment and also on spark.
If I want to apply deep learning algorithms I will use Tensor Flow. If you want to do other data processing then I will use Spark.