SlideShare a Scribd company logo
1 of 33
Download to read offline
Subhojit Banerjee
DataScientist/DataEngineer
@subbubanerjee
https://medium.com/@subhojit20_27731
Deploying Large ML models and
scoring in near-real time @scale
Running the last mile of the Data Science journey
2
Big Data* is not easy
Gartner found just 14% of companies
surveyed had big data projects in production
in 2015 and unchanged from the year before
and slowly inching towards 15% in year 2016
*Big Data ~ Large unstructured and structured data
3
Big Data is not easy
● Cloudera: $261M in revenue, $187M in losses (down from $205M the
year before, the only company to narrow its loss)
● Hortonworks: $184M in revenue, $251M in losses (up from $180M the
year before)
● Alteryx: $85M in revenue, $24M in losses (up from $21M)
● Splunk: $950M in revenue, $355M in losses (up from $279M)
● Tableau: $827M in revenue, $144M in losses (up from $84M)
*Big Data ~ Large unstructured and structured data
4
Big Data is not easy
Gartner’s top obstacles for big data success were:
● Determining how to get value from Big data
● Most companies are not set up culturally or oganizationally to suceed, wishing themselves
agile and hoping data silos will dissappear “magically”
● Technical expertise is lacking in areas like agile model development and deployment to
provide quick turnarounds.
5
Big Data is not easy
This talk is to address the last problem -
“Democratising” large scale machine learning
model deployment and model scoring (in
near real time)
01
02
03
04
05
Business case
First solution
Problems faced
Better solution
Demo
06
07
Conclusion
Questions
7
Business use case
But first, we need to collect data to act on it in
real time
8
Business use case
Is real time segmentation of the user into buy
vs defer clusters using GDPR compliant
features possible on the website?
9
10
11
● ~23 million events per month
● ~140G of granular event data collected
● cost of 6 - 12 euro per day
●
12
Business use case
So now that we have the real time pipeline can
we train and score our ML model ?
13
First Solution
Aws lambda based serverless architecture
14
DEMO
15
First solution
16
Problem faced
The first model was a markov chain on the
sequence of webpages visited
17
Problems faced
● Had to hack core R libraries to bring down the 212 MB ML model
library to fit the 50MB compressed AWS Lambda restriction
● Every time front end would change, our models needs to change and
old data cannot be used to retrain the model.
18
Better solution
Requirements:
● Decrease the time for the models to move from notebook to production
● Has to eliminate recoding of Spark feature pipelines and models from research to production i.e
my pyspark model should be deployable into a Scala pipeline with zero or minimal code changes
● Serving/inference has to be superfast for spark models
19
Better solution
Requirements technical analysis
● For model serving to super fast, it was clear that inference needed to be outside the JVM and
outside realms of Spark context.
● Has to be completely vendor neutral i.e. create a model in AWS and should be able to deploy the
model in a pipeline on GCP and vice-versa.
● True test of serialization/portability: can I zip the model and send it to my coworker in an email
20
Demo
What will you see:
● Build a Pyspark model on 100 Gigs of Data in a Jupyter notebook.
● Export the serialized model into a JSON and protobuf format (with just
addition of a few libraries)
● Load the serialized pipeline into a docker container for near real model
serving using a REST scala interface. (takes about 50ms for a model that
spark serves in 1.5 seconds)
21
Mleap
First things first:
Mleap was possible due to the good work of
Hollin Wilkins and Mikhail Semeniuk
22
Mleap
First things first:
Mleap was possible due to the good work of
Hollin Wilkins and Mikhail Semeniuk
23
Mleap
24
Mleap
Random Forest pipeline
Random Forest
25
Mleap
26
Mleap available transformers
Support for all available Spark transformers:
● Binarizer,BucketedRandomLSH,Bucketizer,ChisqSelector,
● Vectorizer,DCT,ElementwiseProduct,HashingTermFrequency
● PCA, onehotencoder,Ngram,QuantileDiscretizer,Tokenizer
● etc
27
Mleap available transformers
Support for all available Spark Classification transformers:
DecisionTreeClassifier,GradientBoostedTreeClassifier,LogisticRegression,
LogisticRegressionCv,NaiveBayesClassifier,OnevsRest,
RandomForestClassifier,SupportVectorMachines,MultilayerPerceptron
28
Mleap available transformers
Support for all available Spark Regression transformers:
AFTSurvivalRegression,DecisionTreeRegression,GeneralizedLinearRegres
sion,RandomForestRegression etc
29
Mleap available transformers
Support for all available Clustering transformers:
BisectingKmeans,Kmean,LDA etc
30
Mleap unavailable transformers
ALS, but its being worked on currently. Soon to be available
31
Conclusion
Dockerizing ML models can instantly leverage decades of effort spent on
productionizing software engineering. You automatically get benefits of
scaling, reliability due to CI/CD, fault tolerance, high availability, A/B
testing, automation and a rich ecosystem of metrics that the new
datascience discipline need not reinvent.
32
References
● Mleap: https://github.com/combust/mleap
● Neils talk in Pydata 2017 about productionizing ML:
https://www.youtube.com/watch?v=f3I0izerPvc
33
Thank you!

More Related Content

What's hot

SparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache SparkSparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache SparkDatabricks
 
Scaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber AttacksScaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber AttacksDatabricks
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Databricks
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceDatabricks
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Kiwi.com Reaches Cruising Altitude with Scylla
Kiwi.com Reaches Cruising Altitude with ScyllaKiwi.com Reaches Cruising Altitude with Scylla
Kiwi.com Reaches Cruising Altitude with ScyllaScyllaDB
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkroutconfluent
 
Function as a service
Function as a serviceFunction as a service
Function as a serviceatomslide
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?confluent
 
Spark at Airbnb
Spark at AirbnbSpark at Airbnb
Spark at AirbnbHao Wang
 
Bridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesBridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesDatabricks
 
Apache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming EngineApache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming EngineTianlun Zhang
 
Architecture Blue Print
Architecture Blue PrintArchitecture Blue Print
Architecture Blue PrintBogdan Nedelcu
 
Filtering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkFiltering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkDatabricks
 
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
BDX 2016 - Kevin lyons & yakir buskilla  @ eXelate BDX 2016 - Kevin lyons & yakir buskilla  @ eXelate
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate Ido Shilon
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouseVianney FOUCAULT
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotXiang Fu
 
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...Dataconomy Media
 
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...InfluxData
 

What's hot (20)

SparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache SparkSparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache Spark
 
Scaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber AttacksScaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber Attacks
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Kiwi.com Reaches Cruising Altitude with Scylla
Kiwi.com Reaches Cruising Altitude with ScyllaKiwi.com Reaches Cruising Altitude with Scylla
Kiwi.com Reaches Cruising Altitude with Scylla
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
 
Function as a service
Function as a serviceFunction as a service
Function as a service
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?
 
Spark at Airbnb
Spark at AirbnbSpark at Airbnb
Spark at Airbnb
 
Bridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesBridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFrames
 
Apache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming EngineApache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming Engine
 
Architecture Blue Print
Architecture Blue PrintArchitecture Blue Print
Architecture Blue Print
 
Filtering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkFiltering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache Spark
 
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
BDX 2016 - Kevin lyons & yakir buskilla  @ eXelate BDX 2016 - Kevin lyons & yakir buskilla  @ eXelate
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache Pinot
 
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
 
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...
 

Similar to Deploying Large Spark Models to production and model scoring in near real time

Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Edwin Poot
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to ProductionMostafa Majidpour
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemGyula Fóra
 
Operationalizing AI at scale using MADlib Flow - Greenplum Summit 2019
Operationalizing AI at scale using MADlib Flow - Greenplum Summit 2019Operationalizing AI at scale using MADlib Flow - Greenplum Summit 2019
Operationalizing AI at scale using MADlib Flow - Greenplum Summit 2019VMware Tanzu
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Etu Solution
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsStavros Kontopoulos
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureDatabricks
 
Two way data sync between legacy and your brand new micro-service architecture
 Two way data sync between legacy and your brand new micro-service architecture Two way data sync between legacy and your brand new micro-service architecture
Two way data sync between legacy and your brand new micro-service architecturebleporini
 
Real-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil DahlkeReal-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil DahlkeSingleStore
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Embarcados
 
Using Streaming APIs in Production
Using Streaming APIs in ProductionUsing Streaming APIs in Production
Using Streaming APIs in ProductionLuca Mattia Ferrari
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo SanchezGoDataDriven
 
Confluent Partner Tech Talk with SVA
Confluent Partner Tech Talk with SVAConfluent Partner Tech Talk with SVA
Confluent Partner Tech Talk with SVAconfluent
 
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At CommercetoolsGraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At CommercetoolsNicola Molinari
 
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...VoltDB
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentationpunesparkmeetup
 
Delivering Carrier Grade OCP for Virtualized Data Centers
Delivering Carrier Grade OCP for Virtualized Data CentersDelivering Carrier Grade OCP for Virtualized Data Centers
Delivering Carrier Grade OCP for Virtualized Data CentersRadisys Corporation
 

Similar to Deploying Large Spark Models to production and model scoring in near real time (20)

Deploying spark ml models
Deploying spark ml models Deploying spark ml models
Deploying spark ml models
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Operationalizing AI at scale using MADlib Flow - Greenplum Summit 2019
Operationalizing AI at scale using MADlib Flow - Greenplum Summit 2019Operationalizing AI at scale using MADlib Flow - Greenplum Summit 2019
Operationalizing AI at scale using MADlib Flow - Greenplum Summit 2019
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
Two way data sync between legacy and your brand new micro-service architecture
 Two way data sync between legacy and your brand new micro-service architecture Two way data sync between legacy and your brand new micro-service architecture
Two way data sync between legacy and your brand new micro-service architecture
 
Real-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil DahlkeReal-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil Dahlke
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
 
Using Streaming APIs in Production
Using Streaming APIs in ProductionUsing Streaming APIs in Production
Using Streaming APIs in Production
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
 
Confluent Partner Tech Talk with SVA
Confluent Partner Tech Talk with SVAConfluent Partner Tech Talk with SVA
Confluent Partner Tech Talk with SVA
 
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At CommercetoolsGraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
 
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
VoltDB and Flytxt Present: Building a Single Technology Platform for Real-Tim...
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentation
 
Delivering Carrier Grade OCP for Virtualized Data Centers
Delivering Carrier Grade OCP for Virtualized Data CentersDelivering Carrier Grade OCP for Virtualized Data Centers
Delivering Carrier Grade OCP for Virtualized Data Centers
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Recently uploaded (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Deploying Large Spark Models to production and model scoring in near real time

  • 1. Subhojit Banerjee DataScientist/DataEngineer @subbubanerjee https://medium.com/@subhojit20_27731 Deploying Large ML models and scoring in near-real time @scale Running the last mile of the Data Science journey
  • 2. 2 Big Data* is not easy Gartner found just 14% of companies surveyed had big data projects in production in 2015 and unchanged from the year before and slowly inching towards 15% in year 2016 *Big Data ~ Large unstructured and structured data
  • 3. 3 Big Data is not easy ● Cloudera: $261M in revenue, $187M in losses (down from $205M the year before, the only company to narrow its loss) ● Hortonworks: $184M in revenue, $251M in losses (up from $180M the year before) ● Alteryx: $85M in revenue, $24M in losses (up from $21M) ● Splunk: $950M in revenue, $355M in losses (up from $279M) ● Tableau: $827M in revenue, $144M in losses (up from $84M) *Big Data ~ Large unstructured and structured data
  • 4. 4 Big Data is not easy Gartner’s top obstacles for big data success were: ● Determining how to get value from Big data ● Most companies are not set up culturally or oganizationally to suceed, wishing themselves agile and hoping data silos will dissappear “magically” ● Technical expertise is lacking in areas like agile model development and deployment to provide quick turnarounds.
  • 5. 5 Big Data is not easy This talk is to address the last problem - “Democratising” large scale machine learning model deployment and model scoring (in near real time)
  • 6. 01 02 03 04 05 Business case First solution Problems faced Better solution Demo 06 07 Conclusion Questions
  • 7. 7 Business use case But first, we need to collect data to act on it in real time
  • 8. 8 Business use case Is real time segmentation of the user into buy vs defer clusters using GDPR compliant features possible on the website?
  • 9. 9
  • 10. 10
  • 11. 11 ● ~23 million events per month ● ~140G of granular event data collected ● cost of 6 - 12 euro per day ●
  • 12. 12 Business use case So now that we have the real time pipeline can we train and score our ML model ?
  • 13. 13 First Solution Aws lambda based serverless architecture
  • 16. 16 Problem faced The first model was a markov chain on the sequence of webpages visited
  • 17. 17 Problems faced ● Had to hack core R libraries to bring down the 212 MB ML model library to fit the 50MB compressed AWS Lambda restriction ● Every time front end would change, our models needs to change and old data cannot be used to retrain the model.
  • 18. 18 Better solution Requirements: ● Decrease the time for the models to move from notebook to production ● Has to eliminate recoding of Spark feature pipelines and models from research to production i.e my pyspark model should be deployable into a Scala pipeline with zero or minimal code changes ● Serving/inference has to be superfast for spark models
  • 19. 19 Better solution Requirements technical analysis ● For model serving to super fast, it was clear that inference needed to be outside the JVM and outside realms of Spark context. ● Has to be completely vendor neutral i.e. create a model in AWS and should be able to deploy the model in a pipeline on GCP and vice-versa. ● True test of serialization/portability: can I zip the model and send it to my coworker in an email
  • 20. 20 Demo What will you see: ● Build a Pyspark model on 100 Gigs of Data in a Jupyter notebook. ● Export the serialized model into a JSON and protobuf format (with just addition of a few libraries) ● Load the serialized pipeline into a docker container for near real model serving using a REST scala interface. (takes about 50ms for a model that spark serves in 1.5 seconds)
  • 21. 21 Mleap First things first: Mleap was possible due to the good work of Hollin Wilkins and Mikhail Semeniuk
  • 22. 22 Mleap First things first: Mleap was possible due to the good work of Hollin Wilkins and Mikhail Semeniuk
  • 26. 26 Mleap available transformers Support for all available Spark transformers: ● Binarizer,BucketedRandomLSH,Bucketizer,ChisqSelector, ● Vectorizer,DCT,ElementwiseProduct,HashingTermFrequency ● PCA, onehotencoder,Ngram,QuantileDiscretizer,Tokenizer ● etc
  • 27. 27 Mleap available transformers Support for all available Spark Classification transformers: DecisionTreeClassifier,GradientBoostedTreeClassifier,LogisticRegression, LogisticRegressionCv,NaiveBayesClassifier,OnevsRest, RandomForestClassifier,SupportVectorMachines,MultilayerPerceptron
  • 28. 28 Mleap available transformers Support for all available Spark Regression transformers: AFTSurvivalRegression,DecisionTreeRegression,GeneralizedLinearRegres sion,RandomForestRegression etc
  • 29. 29 Mleap available transformers Support for all available Clustering transformers: BisectingKmeans,Kmean,LDA etc
  • 30. 30 Mleap unavailable transformers ALS, but its being worked on currently. Soon to be available
  • 31. 31 Conclusion Dockerizing ML models can instantly leverage decades of effort spent on productionizing software engineering. You automatically get benefits of scaling, reliability due to CI/CD, fault tolerance, high availability, A/B testing, automation and a rich ecosystem of metrics that the new datascience discipline need not reinvent.
  • 32. 32 References ● Mleap: https://github.com/combust/mleap ● Neils talk in Pydata 2017 about productionizing ML: https://www.youtube.com/watch?v=f3I0izerPvc