SlideShare a Scribd company logo
1 of 21
MACHINE LEARNING ON
DISTRIBUTED SYSTEMS
JOSH PODUSKA
JUNE 2017
THE NEED FOR MACHINE LEARNING AT
SCALE
THE MACHINE LEARNING AT SCALE
LANDSCAPE
• MPP (Massively Parallel Processing) Environment
• Distributed Execution
• Different Math
• Pre-Built Machine Learning Functions (not just a dev
environment)
• Able to build models on truly large datasets (>> 1B rows and
>> 100 columns) without running out of memory or taking
days to run
Node 1 Node
2….
Node n
THE MACHINE LEARNING AT SCALE
LANDSCAPE
• How many of these
solutions offer distributed
machine learning?
THE MACHINE LEARNING AT SCALE
LANDSCAPE
• The machine learning at
scale players
• Spark
• H2O
• Revolution (now part of
Microsoft)
• MADLib (Greenplum)
• Vertica
• Fuzzy Logix (Netezza)
THE MACHINE LEARNING AT SCALE
LANDSCAPE
• Distributed Analytical Compute
Engines
• Spark
• H2O
• Revolution (now part of Microsoft)
• MPP Analytical Data Marts
• MADLib (Greenplum)
• Vertica
• Fuzzy Logix (Netezza)
CE DM
DISTRIBUTED ANALYTIC COMPUTE ENGINE
ARCHITECTURE
(SPARK EXAMPLES)
DISTRIBUTED ANALYTIC COMPUTE ENGINE
ARCHITECTURE
• Scaling architecture/Commodity hardware
• Adapts to any data storage location
• All data types welcome
• Custom ingest and data prep
• Built-in visual data discovery
• Customized analytics via programming APIs
• Deepest and widest distributed analytical
libraries available
MPP ANALYTIC DATA MART ARCHITECTURE
(VERTICA EXAMPLE)
User defined loads User defined functions BI & visualization
ODBC
JDBC
OLEDBMessaging
Data transformation
ETL
User defined storage
Security
External tables to analyze in place
R Java Python SQL
Geospatial Real-time Text
analytics
Event series Pattern matching
Time series Machine learning Regression
MPP ANALYTIC DATA MART ARCHITECTURE
• Scaling architecture/Commodity hardware or appliance
• Built-in data storage
• Advanced storage techniques for (semi) structured data
• Fastest analytics at scale available (if optimized for
data layer)
• Fastest streaming analytics/ingest available
• Efficient use of disk when needed so not memory
bound
• Short development times and wide reach via SQL
• All enterprise features come out-of-the-box (security,
HA, DR, resource mngmt, ACID)
• High concurrency is built-in
0.2
0.7 7.4
3.3
14.9
139.2
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
160.0
10 100 1000
Runtime(minutes)
Rows (millions)
KMeans Scalability
(K=5, Col=100)
Vertica 8.0.1 Spark 2.0
0.4 1
36.1
0.0
1.1
29.8
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
10 100 1000
Runtime(minutes)
Rows (millions)
Linear Regression Scalability
(Col=100)
Vertica 8.0.1 Spark 2.0
0.7 0.6
8.7
0.2
6.1
37.0
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
10 100 1000
Runtime(minutes)
Rows (millions)
Logistic Regression Scalability
(Col=100)
Vertica 8.0.1 Spark 2.0
2.5
11.2
86.1
1.1
61.0
418.1
0
100
200
300
400
500
10 100 1000
Runtime(minutes)
Rows (millions)
Naïve Bayes Scalability
(Col=100)
Vertica 8.0.1 Spark 2.0
QUICK MENTION OF OTHER TECHNOLOGIES
• NoSQL
• OLTP
• Cubes
• Batch vs Real Time and Lambda Architectures
• TensorFlow
• GPUs
COMMON ARCHITECTURE CONSIDERATIONS
Compute
Engine or
Data Mart
Data Size
Data Types
Data Location
IT & Data
Engineering
Computer
Scientists
Concurrency
Model Agility
Model Depth
Model Variety
Scoring
Volume &
Speed
Landing
Results
Embedding
Analytics
Data
Demands
Organizationa
l
ResourcesAccuracy
Requiremen
Deployment
Configurati
on
DATA DEMANDS
• Size
• Billions of rows. Hundreds of columns.
• Types
• Structured : RDBMS (feed or replacement), Business App Bulk Load
• Semi-Structured: Logs
• Unstructured: Text, Audio, Video
• Location
• EDW, Mainframe, RDBMS, Hard Drive
• HDFS, S3, ABS
C
E
DM
=
DM
C
E
DM
ORGANIZATIONAL STRUCTURE
• IT & Data Engineering
• Control admin resource costs and complexity
• Control hardware costs
• Control software costs
• Control support costs
• Computer Scientists
• Control headcount
• Concurrency
• Built to support many analysts simultaneously
DM
DM
DM
DM
C
E
=
ACCURACY REQUIREMENTS
• Model Agility
• Iterate/experiment with data preparation strategies
• Deploying models quickly
• Model Depth
• Able to fine tune models
• Balance bias and variance
• Model Variety
• Test/ensemble a large diversity of modeling techniques
• Low-level model customization
DM
C
E
DM
C
E
C
E
C
E
DEPLOYMENT CONFIGURATION
• Scoring Volume & Speed
• Mechanisms for row by row scoring
• Speed of micro-batching and simultaneous scoring
• Landing Results
• Results in traditional business systems
• Results in cloud or long term storage
• Embedding Analytics
• Low-touch embedded deployment and system duplication
C
E
DM
DM
DM
=
ARCHITECTURE DECISION FLOWDataSize
Billions of Rows
and Hundreds
of Columns
DataTypes
Structured :
RDBMS (feed or
replacement),
Business App
Bulk Load
Semi-
Structured: Logs
Unstructured:
Text, Audio,
Video
DataLocation
EDW,
Mainframe,
RDBMS,
Hard Drive
HDFS, S3,
ABS
IT&DataEngineeringResources
Control
admin
resource
costs and
complexity
and
hardware
costs
Control
support
costs
Control
software
costs
ComputerScientists
Control
headcount
Concurrency
Built to
support
many
analysts
simultaneo
usly
ModelAgility
Iterating/e
xperimenti
ng with
data
preparatio
n
strategies
and
deploying
models
quickly
ModelDepth
Able to
fine tune
models
and
balance
bias and
variance
ModelVariety
Test/ense
mble a
large
diversity of
modeling
techniques
and
provide
low-level
model
customizat
ion
ScoringVolume&Speed
Scoring
Volume &
Speed
Mechanism
s for row
by row
scoring
LandingResults
Results in
traditional
business
systems
Results in
cloud or
long term
storage
EmbeddingAnalytics
Low-touch
embedded
deploymen
t and
system
duplication
DM
CE
LARGE CREDIT CARD RT TRANSACTIONAL FRAUD
SYSTEM
DataSize
Billions of Rows
and Hundreds
of Columns
DataTypes
Structured :
RDBMS (feed or
replacement),
Business App
Bulk Load
Semi-
Structured: Logs
Unstructured:
Text, Audio,
Video
DataLocation
EDW,
Mainframe,
RDBMS,
Hard Drive
HDFS, S3,
ABS
IT&DataEngineeringResources
Control
admin
resource
costs and
complexity
and
hardware
costs
Control
support
costs
Control
software
costs
ComputerScientists
Control
headcount
Concurrency
Built to
support
many
analysts
simultaneo
usly
ModelAgility
Iterating/e
xperimenti
ng with
data
preparatio
n
strategies
and
deploying
models
quickly
ModelDepth
Able to
fine tune
models
and
balance
bias and
variance
ModelVariety
Test/ense
mble a
large
diversity of
modeling
techniques
and
provide
low-level
model
customizat
ion
ScoringVolume&Speed
Scoring
Volume &
Speed
Mechanism
s for row
by row
scoring
LandingResults
Results in
traditional
business
systems
Results in
cloud or
long term
storage
EmbeddingAnalytics
Low-touch
embedded
deploymen
t and
system
duplication
DM
CE
MID-SIZED NETWORK ANALYTICS/CYBER SECURITY SYSTEMDataSize
Billions of Rows
and Hundreds
of Columns
DataTypes
Structured :
RDBMS (feed or
replacement),
Business App
Bulk Load
Semi-
Structured: Logs
Unstructured:
Text, Audio,
Video
DataLocation
EDW,
Mainframe,
RDBMS,
Hard Drive
HDFS, S3,
ABS
IT&DataEngineeringResources
Control
admin
resource
costs and
complexity
and
hardware
costs
Control
support
costs
Control
software
costs
ComputerScientists
Control
headcount
Concurrency
Built to
support
many
analysts
simultaneo
usly
ModelAgility
Iterating/e
xperimenti
ng with
data
preparatio
n
strategies
and
deploying
models
quickly
ModelDepth
Able to
fine tune
models
and
balance
bias and
variance
ModelVariety
Test/ense
mble a
large
diversity of
modeling
techniques
and
provide
low-level
model
customizat
ion
ScoringVolume&Speed
Scoring
Volume &
Speed
Mechanism
s for row
by row
scoring
LandingResults
Results in
traditional
business
systems
Results in
cloud or
long term
storage
EmbeddingAnalytics
Low-touch
embedded
deploymen
t and
system
duplication
DM
CE
COMBINED ARCHITECTURE
(HADOOP EXAMPLE)
• Most orgs have both
already
• Evaluate your analytic
needs
• Dedicate physical and
personnel resources
better
• Consider what data
should be “hot” vs
“cold”
• If the workload fits in a
Distributed Analytical
Data Mart you will
speed up analytics,
save at least 1/3 on
hardware, free up
Greenplu
m
Vertica
Netezza
Spark
H2O
Revolutio
n

More Related Content

What's hot

Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Databricks
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Spark Summit
 

What's hot (20)

Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
 
NLP Text Recommendation System Journey to Automated Training
NLP Text Recommendation System Journey to Automated TrainingNLP Text Recommendation System Journey to Automated Training
NLP Text Recommendation System Journey to Automated Training
 
Free servers to build Big Data Systems on: Bing's Approach
Free servers to build Big Data Systems on: Bing's  Approach Free servers to build Big Data Systems on: Bing's  Approach
Free servers to build Big Data Systems on: Bing's Approach
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
 
Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create
 
Gender Prediction with Databricks AutoML Pipeline
Gender Prediction with Databricks AutoML PipelineGender Prediction with Databricks AutoML Pipeline
Gender Prediction with Databricks AutoML Pipeline
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
GraphLab Conference 2014 Keynote - Carlos Guestrin
GraphLab Conference 2014 Keynote - Carlos GuestrinGraphLab Conference 2014 Keynote - Carlos Guestrin
GraphLab Conference 2014 Keynote - Carlos Guestrin
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
MLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine Learning
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologies
 
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler..."Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
 

Similar to Machine Learning on Distributed Systems by Josh Poduska

SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData
 

Similar to Machine Learning on Distributed Systems by Josh Poduska (20)

Microsoft Azure Data Warehouse Overview
Microsoft Azure Data Warehouse OverviewMicrosoft Azure Data Warehouse Overview
Microsoft Azure Data Warehouse Overview
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابر
 
The Crown Jewels: Is Enterprise Data Ready for the Cloud?
The Crown Jewels: Is Enterprise Data Ready for the Cloud?The Crown Jewels: Is Enterprise Data Ready for the Cloud?
The Crown Jewels: Is Enterprise Data Ready for the Cloud?
 
Netezza Online Training by www.etraining.guru in India
Netezza Online Training by www.etraining.guru in IndiaNetezza Online Training by www.etraining.guru in India
Netezza Online Training by www.etraining.guru in India
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Shark
SharkShark
Shark
 
Azure DocumentDB Overview
Azure DocumentDB OverviewAzure DocumentDB Overview
Azure DocumentDB Overview
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 
Introduction to Couchbase: Onomi
Introduction to Couchbase: OnomiIntroduction to Couchbase: Onomi
Introduction to Couchbase: Onomi
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
Apache Cassandra introduction
Apache Cassandra introductionApache Cassandra introduction
Apache Cassandra introduction
 
[JSS2015] In memory and operational analytics
[JSS2015] In memory and operational analytics[JSS2015] In memory and operational analytics
[JSS2015] In memory and operational analytics
 
Jss 2015 in memory and operational analytics
Jss 2015   in memory and operational analyticsJss 2015   in memory and operational analytics
Jss 2015 in memory and operational analytics
 

More from Data Con LA

Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Machine Learning on Distributed Systems by Josh Poduska

  • 1. MACHINE LEARNING ON DISTRIBUTED SYSTEMS JOSH PODUSKA JUNE 2017
  • 2. THE NEED FOR MACHINE LEARNING AT SCALE
  • 3. THE MACHINE LEARNING AT SCALE LANDSCAPE • MPP (Massively Parallel Processing) Environment • Distributed Execution • Different Math • Pre-Built Machine Learning Functions (not just a dev environment) • Able to build models on truly large datasets (>> 1B rows and >> 100 columns) without running out of memory or taking days to run Node 1 Node 2…. Node n
  • 4. THE MACHINE LEARNING AT SCALE LANDSCAPE • How many of these solutions offer distributed machine learning?
  • 5. THE MACHINE LEARNING AT SCALE LANDSCAPE • The machine learning at scale players • Spark • H2O • Revolution (now part of Microsoft) • MADLib (Greenplum) • Vertica • Fuzzy Logix (Netezza)
  • 6. THE MACHINE LEARNING AT SCALE LANDSCAPE • Distributed Analytical Compute Engines • Spark • H2O • Revolution (now part of Microsoft) • MPP Analytical Data Marts • MADLib (Greenplum) • Vertica • Fuzzy Logix (Netezza) CE DM
  • 7. DISTRIBUTED ANALYTIC COMPUTE ENGINE ARCHITECTURE (SPARK EXAMPLES)
  • 8. DISTRIBUTED ANALYTIC COMPUTE ENGINE ARCHITECTURE • Scaling architecture/Commodity hardware • Adapts to any data storage location • All data types welcome • Custom ingest and data prep • Built-in visual data discovery • Customized analytics via programming APIs • Deepest and widest distributed analytical libraries available
  • 9. MPP ANALYTIC DATA MART ARCHITECTURE (VERTICA EXAMPLE) User defined loads User defined functions BI & visualization ODBC JDBC OLEDBMessaging Data transformation ETL User defined storage Security External tables to analyze in place R Java Python SQL Geospatial Real-time Text analytics Event series Pattern matching Time series Machine learning Regression
  • 10. MPP ANALYTIC DATA MART ARCHITECTURE • Scaling architecture/Commodity hardware or appliance • Built-in data storage • Advanced storage techniques for (semi) structured data • Fastest analytics at scale available (if optimized for data layer) • Fastest streaming analytics/ingest available • Efficient use of disk when needed so not memory bound • Short development times and wide reach via SQL • All enterprise features come out-of-the-box (security, HA, DR, resource mngmt, ACID) • High concurrency is built-in
  • 11. 0.2 0.7 7.4 3.3 14.9 139.2 0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 10 100 1000 Runtime(minutes) Rows (millions) KMeans Scalability (K=5, Col=100) Vertica 8.0.1 Spark 2.0 0.4 1 36.1 0.0 1.1 29.8 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 10 100 1000 Runtime(minutes) Rows (millions) Linear Regression Scalability (Col=100) Vertica 8.0.1 Spark 2.0 0.7 0.6 8.7 0.2 6.1 37.0 0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 10 100 1000 Runtime(minutes) Rows (millions) Logistic Regression Scalability (Col=100) Vertica 8.0.1 Spark 2.0 2.5 11.2 86.1 1.1 61.0 418.1 0 100 200 300 400 500 10 100 1000 Runtime(minutes) Rows (millions) Naïve Bayes Scalability (Col=100) Vertica 8.0.1 Spark 2.0
  • 12. QUICK MENTION OF OTHER TECHNOLOGIES • NoSQL • OLTP • Cubes • Batch vs Real Time and Lambda Architectures • TensorFlow • GPUs
  • 13. COMMON ARCHITECTURE CONSIDERATIONS Compute Engine or Data Mart Data Size Data Types Data Location IT & Data Engineering Computer Scientists Concurrency Model Agility Model Depth Model Variety Scoring Volume & Speed Landing Results Embedding Analytics Data Demands Organizationa l ResourcesAccuracy Requiremen Deployment Configurati on
  • 14. DATA DEMANDS • Size • Billions of rows. Hundreds of columns. • Types • Structured : RDBMS (feed or replacement), Business App Bulk Load • Semi-Structured: Logs • Unstructured: Text, Audio, Video • Location • EDW, Mainframe, RDBMS, Hard Drive • HDFS, S3, ABS C E DM = DM C E DM
  • 15. ORGANIZATIONAL STRUCTURE • IT & Data Engineering • Control admin resource costs and complexity • Control hardware costs • Control software costs • Control support costs • Computer Scientists • Control headcount • Concurrency • Built to support many analysts simultaneously DM DM DM DM C E =
  • 16. ACCURACY REQUIREMENTS • Model Agility • Iterate/experiment with data preparation strategies • Deploying models quickly • Model Depth • Able to fine tune models • Balance bias and variance • Model Variety • Test/ensemble a large diversity of modeling techniques • Low-level model customization DM C E DM C E C E C E
  • 17. DEPLOYMENT CONFIGURATION • Scoring Volume & Speed • Mechanisms for row by row scoring • Speed of micro-batching and simultaneous scoring • Landing Results • Results in traditional business systems • Results in cloud or long term storage • Embedding Analytics • Low-touch embedded deployment and system duplication C E DM DM DM =
  • 18. ARCHITECTURE DECISION FLOWDataSize Billions of Rows and Hundreds of Columns DataTypes Structured : RDBMS (feed or replacement), Business App Bulk Load Semi- Structured: Logs Unstructured: Text, Audio, Video DataLocation EDW, Mainframe, RDBMS, Hard Drive HDFS, S3, ABS IT&DataEngineeringResources Control admin resource costs and complexity and hardware costs Control support costs Control software costs ComputerScientists Control headcount Concurrency Built to support many analysts simultaneo usly ModelAgility Iterating/e xperimenti ng with data preparatio n strategies and deploying models quickly ModelDepth Able to fine tune models and balance bias and variance ModelVariety Test/ense mble a large diversity of modeling techniques and provide low-level model customizat ion ScoringVolume&Speed Scoring Volume & Speed Mechanism s for row by row scoring LandingResults Results in traditional business systems Results in cloud or long term storage EmbeddingAnalytics Low-touch embedded deploymen t and system duplication DM CE
  • 19. LARGE CREDIT CARD RT TRANSACTIONAL FRAUD SYSTEM DataSize Billions of Rows and Hundreds of Columns DataTypes Structured : RDBMS (feed or replacement), Business App Bulk Load Semi- Structured: Logs Unstructured: Text, Audio, Video DataLocation EDW, Mainframe, RDBMS, Hard Drive HDFS, S3, ABS IT&DataEngineeringResources Control admin resource costs and complexity and hardware costs Control support costs Control software costs ComputerScientists Control headcount Concurrency Built to support many analysts simultaneo usly ModelAgility Iterating/e xperimenti ng with data preparatio n strategies and deploying models quickly ModelDepth Able to fine tune models and balance bias and variance ModelVariety Test/ense mble a large diversity of modeling techniques and provide low-level model customizat ion ScoringVolume&Speed Scoring Volume & Speed Mechanism s for row by row scoring LandingResults Results in traditional business systems Results in cloud or long term storage EmbeddingAnalytics Low-touch embedded deploymen t and system duplication DM CE
  • 20. MID-SIZED NETWORK ANALYTICS/CYBER SECURITY SYSTEMDataSize Billions of Rows and Hundreds of Columns DataTypes Structured : RDBMS (feed or replacement), Business App Bulk Load Semi- Structured: Logs Unstructured: Text, Audio, Video DataLocation EDW, Mainframe, RDBMS, Hard Drive HDFS, S3, ABS IT&DataEngineeringResources Control admin resource costs and complexity and hardware costs Control support costs Control software costs ComputerScientists Control headcount Concurrency Built to support many analysts simultaneo usly ModelAgility Iterating/e xperimenti ng with data preparatio n strategies and deploying models quickly ModelDepth Able to fine tune models and balance bias and variance ModelVariety Test/ense mble a large diversity of modeling techniques and provide low-level model customizat ion ScoringVolume&Speed Scoring Volume & Speed Mechanism s for row by row scoring LandingResults Results in traditional business systems Results in cloud or long term storage EmbeddingAnalytics Low-touch embedded deploymen t and system duplication DM CE
  • 21. COMBINED ARCHITECTURE (HADOOP EXAMPLE) • Most orgs have both already • Evaluate your analytic needs • Dedicate physical and personnel resources better • Consider what data should be “hot” vs “cold” • If the workload fits in a Distributed Analytical Data Mart you will speed up analytics, save at least 1/3 on hardware, free up Greenplu m Vertica Netezza Spark H2O Revolutio n

Editor's Notes

  1. Josh Poduska is a Senior Data Scientist in HPE’s Big Data Software Group. Josh has 16 years of experience as a practitioner in the analytical sciences with an emphasis on machine learning and statistical applications. He spent the last six years focusing on advanced analytical solutions with MPP columnar databases. At HPE he is part of the Vertica team and uses Vertica and its machine learning library to help organizations solve their toughest data challenges.
  2. Apache Spark is a cluster computing framework, makes your computation faster by providing inmemory computing and easy integration because of the big spark ecosystem. You can use the spark cluster for various task like machine learning, graph computation by paralleling them. TensorFlow in short is a Library developed by google for improving performance of your numerical computation, it generates the data flow as graphs where nodes denotes operations and edges denotes data array. Google recently released distributed version of TF thus you can run your TF on distributed environment and also on spark. If I want to apply deep learning algorithms I will use Tensor Flow. If you want to do other data processing then I will use Spark.