SlideShare a Scribd company logo
1 of 34
Download to read offline
Subhojit Banerjee
DataScientist/DataEngineer
@subbubanerjee
https://medium.com/@subhojit20_27731
Deploying Large ML models and
scoring in near-real time @scale
Running the last mile of the Data Science journey
2
Big Data* is not easy
Gartner found just 14% of companies
surveyed had big data projects in production
in 2015 and unchanged from the year before
and slowly inching towards 15% in year 2016
*Big Data ~ Large unstructured and structured data
3
Big Data is not easy
● Cloudera: $261M in revenue, $187M in losses (down from $205M the
year before, the only company to narrow its loss)
● Hortonworks: $184M in revenue, $251M in losses (up from $180M the
year before)
● Alteryx: $85M in revenue, $24M in losses (up from $21M)
● Splunk: $950M in revenue, $355M in losses (up from $279M)
● Tableau: $827M in revenue, $144M in losses (up from $84M)
*Big Data ~ Large unstructured and structured data
4
Big Data is not easy
Gartner’s top obstacles for big data success were:
● Determining how to get value from Big data
● Most companies are not set up culturally or oganizationally to suceed, wishing themselves
agile and hoping data silos will dissappear “magically”
● Technical expertise is lacking in areas like agile model development and deployment to
provide quick turnarounds.
5
Big Data is not easy
This talk is to address the last problem -
“Democratising” large scale machine learning
model deployment and model scoring (in
near real time)
01
02
03
04
05
Business case
First solution
Problems faced
Better solution
Demo
06
07
Conclusion
Questions
7
Business use case
Is real time segmentation of the user into buy
vs defer clusters using GDPR compliant
features possible on the website?
8
Business use case
But first, we need to collect data to act on it in
real time
9
10
11
● ~23 million events per month
● ~140G of granular event data collected
● cost of 6 - 12 euro per day
●
12
Business use case
So now that we have the real time pipeline can
we train and score our ML model ?
13
First Solution
Aws lambda based serverless architecture
14
DEMO
15
First solution
16
Problem faced
The first model was a markov chain on the
sequence of webpages visited
17
Problems faced
● Had to hack core R libraries to bring down the 212 MB ML model
library to fit the 50MB compressed AWS Lambda restriction
● Every time front end would change, our models needs to change and
old data cannot be used to retrain the model.
18
Better solution
Requirements:
● Decrease the time for the models to move from notebook to production
● Has to eliminate recoding of Spark feature pipelines and models from research to production i.e
my pyspark model should be deployable into a Scala pipeline with zero or minimal code changes
● Serving/inference has to be superfast for spark models
19
Better solution
Requirements technical analysis
● For model serving to super fast, it was clear that inference needed to be outside the JVM and
outside realms of Spark context.
● Has to be completely vendor neutral i.e. create a model in AWS and should be able to deploy the
model in a pipeline on GCP and vice-versa.
● True test of serialization/portability: can I zip the model and send it to my coworker in an email
20
Demo
What will you see:
● Build a Pyspark model on 100 Gigs of Data in a Jupyter notebook.
● Export the serialized model into a JSON and protobuf format (with just
addition of a few libraries)
● Load the serialized pipeline into a docker container for near real model
serving using a REST scala interface. (takes about 50ms for a model that
spark serves in 1.5 seconds)
21
Demo
What will you see:
● Using GDPR compliant features how does one get 88% ROC and 94% on
precision recall using a Pyspark model
22
Mleap
First things first:
Mleap was possible due to the good work of
Hollin Wilkins and Mikhail Semeniuk
23
Mleap
First things first:
Mleap was possible due to the good work of
Hollin Wilkins and Mikhail Semeniuk
24
Mleap
25
Mleap
Random Forest pipeline
Random Forest
26
Mleap
27
Mleap available transformers
Support for all available Spark transformers:
● Binarizer,BucketedRandomLSH,Bucketizer,ChisqSelector,
● Vectorizer,DCT,ElementwiseProduct,HashingTermFrequency
● PCA, onehotencoder,Ngram,QuantileDiscretizer,Tokenizer
● etc
28
Mleap available transformers
Support for all available Spark Classification transformers:
DecisionTreeClassifier,GradientBoostedTreeClassifier,LogisticRegression,
LogisticRegressionCv,NaiveBayesClassifier,OnevsRest,
RandomForestClassifier,SupportVectorMachines,MultilayerPerceptron
29
Mleap available transformers
Support for all available Spark Regression transformers:
AFTSurvivalRegression,DecisionTreeRegression,GeneralizedLinearRegres
sion,RandomForestRegression etc
30
Mleap available transformers
Support for all available Clustering transformers:
BisectingKmeans,Kmean,LDA etc
31
Mleap unavailable transformers
ALS, but its being worked on currently. Soon to be available
32
Conclusion
Dockerizing ML models can instantly leverage decades of effort spent on
productionizing software engineering. You automatically get benefits of
scaling, reliability due to CI/CD, fault tolerance, high availability, A/B
testing, automation and a rich ecosystem of metrics that the new
datascience discipline need not reinvent.
33
References
● Mleap: https://github.com/combust/mleap
● Neils talk in Pydata 2017 about productionizing ML:
https://www.youtube.com/watch?v=f3I0izerPvc
34
Thank you!

More Related Content

What's hot

SparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache SparkSparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache SparkDatabricks
 
Scaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber AttacksScaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber AttacksDatabricks
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Databricks
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceDatabricks
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Kiwi.com Reaches Cruising Altitude with Scylla
Kiwi.com Reaches Cruising Altitude with ScyllaKiwi.com Reaches Cruising Altitude with Scylla
Kiwi.com Reaches Cruising Altitude with ScyllaScyllaDB
 
Function as a service
Function as a serviceFunction as a service
Function as a serviceatomslide
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?confluent
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkroutconfluent
 
Bridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesBridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesDatabricks
 
Spark at Airbnb
Spark at AirbnbSpark at Airbnb
Spark at AirbnbHao Wang
 
Apache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming EngineApache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming EngineTianlun Zhang
 
Architecture Blue Print
Architecture Blue PrintArchitecture Blue Print
Architecture Blue PrintBogdan Nedelcu
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouseVianney FOUCAULT
 
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
BDX 2016 - Kevin lyons & yakir buskilla  @ eXelate BDX 2016 - Kevin lyons & yakir buskilla  @ eXelate
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate Ido Shilon
 
Filtering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkFiltering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkDatabricks
 
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...InfluxData
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotXiang Fu
 
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...Dataconomy Media
 

What's hot (20)

SparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache SparkSparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache Spark
 
Scaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber AttacksScaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber Attacks
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Kiwi.com Reaches Cruising Altitude with Scylla
Kiwi.com Reaches Cruising Altitude with ScyllaKiwi.com Reaches Cruising Altitude with Scylla
Kiwi.com Reaches Cruising Altitude with Scylla
 
Function as a service
Function as a serviceFunction as a service
Function as a service
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
 
Bridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesBridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFrames
 
Spark at Airbnb
Spark at AirbnbSpark at Airbnb
Spark at Airbnb
 
Apache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming EngineApache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming Engine
 
Architecture Blue Print
Architecture Blue PrintArchitecture Blue Print
Architecture Blue Print
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
 
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
BDX 2016 - Kevin lyons & yakir buskilla  @ eXelate BDX 2016 - Kevin lyons & yakir buskilla  @ eXelate
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
 
Filtering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkFiltering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache Spark
 
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache Pinot
 
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
 

Similar to Deploying spark ml models

Deploying Large Spark Models to production and model scoring in near real time
Deploying Large Spark Models to production and model scoring in near real timeDeploying Large Spark Models to production and model scoring in near real time
Deploying Large Spark Models to production and model scoring in near real timesubhojit banerjee
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Edwin Poot
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Etu Solution
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to ProductionMostafa Majidpour
 
Real-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil DahlkeReal-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil DahlkeSingleStore
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemGyula Fóra
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Nicolas Poggi
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo SanchezGoDataDriven
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Embarcados
 
Two way data sync between legacy and your brand new micro-service architecture
 Two way data sync between legacy and your brand new micro-service architecture Two way data sync between legacy and your brand new micro-service architecture
Two way data sync between legacy and your brand new micro-service architecturebleporini
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureDatabricks
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger AnalyticsItzhak Kameli
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)Stratebi
 
Apache kylin meetup berlin olx v1.0
Apache kylin meetup berlin olx v1.0Apache kylin meetup berlin olx v1.0
Apache kylin meetup berlin olx v1.0ssuser931288
 
Apache Kylin Meetup: Berlin - With OLX Group
Apache Kylin Meetup: Berlin - With OLX GroupApache Kylin Meetup: Berlin - With OLX Group
Apache Kylin Meetup: Berlin - With OLX GroupTyler Wishnoff
 
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At CommercetoolsGraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At CommercetoolsNicola Molinari
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 RecapSri Ambati
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsStavros Kontopoulos
 

Similar to Deploying spark ml models (20)

Deploying Large Spark Models to production and model scoring in near real time
Deploying Large Spark Models to production and model scoring in near real timeDeploying Large Spark Models to production and model scoring in near real time
Deploying Large Spark Models to production and model scoring in near real time
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Real-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil DahlkeReal-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil Dahlke
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
 
Two way data sync between legacy and your brand new micro-service architecture
 Two way data sync between legacy and your brand new micro-service architecture Two way data sync between legacy and your brand new micro-service architecture
Two way data sync between legacy and your brand new micro-service architecture
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
Apache kylin meetup berlin olx v1.0
Apache kylin meetup berlin olx v1.0Apache kylin meetup berlin olx v1.0
Apache kylin meetup berlin olx v1.0
 
Apache Kylin Meetup: Berlin - With OLX Group
Apache Kylin Meetup: Berlin - With OLX GroupApache Kylin Meetup: Berlin - With OLX Group
Apache Kylin Meetup: Berlin - With OLX Group
 
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At CommercetoolsGraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Deploying spark ml models

  • 1. Subhojit Banerjee DataScientist/DataEngineer @subbubanerjee https://medium.com/@subhojit20_27731 Deploying Large ML models and scoring in near-real time @scale Running the last mile of the Data Science journey
  • 2. 2 Big Data* is not easy Gartner found just 14% of companies surveyed had big data projects in production in 2015 and unchanged from the year before and slowly inching towards 15% in year 2016 *Big Data ~ Large unstructured and structured data
  • 3. 3 Big Data is not easy ● Cloudera: $261M in revenue, $187M in losses (down from $205M the year before, the only company to narrow its loss) ● Hortonworks: $184M in revenue, $251M in losses (up from $180M the year before) ● Alteryx: $85M in revenue, $24M in losses (up from $21M) ● Splunk: $950M in revenue, $355M in losses (up from $279M) ● Tableau: $827M in revenue, $144M in losses (up from $84M) *Big Data ~ Large unstructured and structured data
  • 4. 4 Big Data is not easy Gartner’s top obstacles for big data success were: ● Determining how to get value from Big data ● Most companies are not set up culturally or oganizationally to suceed, wishing themselves agile and hoping data silos will dissappear “magically” ● Technical expertise is lacking in areas like agile model development and deployment to provide quick turnarounds.
  • 5. 5 Big Data is not easy This talk is to address the last problem - “Democratising” large scale machine learning model deployment and model scoring (in near real time)
  • 6. 01 02 03 04 05 Business case First solution Problems faced Better solution Demo 06 07 Conclusion Questions
  • 7. 7 Business use case Is real time segmentation of the user into buy vs defer clusters using GDPR compliant features possible on the website?
  • 8. 8 Business use case But first, we need to collect data to act on it in real time
  • 9. 9
  • 10. 10
  • 11. 11 ● ~23 million events per month ● ~140G of granular event data collected ● cost of 6 - 12 euro per day ●
  • 12. 12 Business use case So now that we have the real time pipeline can we train and score our ML model ?
  • 13. 13 First Solution Aws lambda based serverless architecture
  • 16. 16 Problem faced The first model was a markov chain on the sequence of webpages visited
  • 17. 17 Problems faced ● Had to hack core R libraries to bring down the 212 MB ML model library to fit the 50MB compressed AWS Lambda restriction ● Every time front end would change, our models needs to change and old data cannot be used to retrain the model.
  • 18. 18 Better solution Requirements: ● Decrease the time for the models to move from notebook to production ● Has to eliminate recoding of Spark feature pipelines and models from research to production i.e my pyspark model should be deployable into a Scala pipeline with zero or minimal code changes ● Serving/inference has to be superfast for spark models
  • 19. 19 Better solution Requirements technical analysis ● For model serving to super fast, it was clear that inference needed to be outside the JVM and outside realms of Spark context. ● Has to be completely vendor neutral i.e. create a model in AWS and should be able to deploy the model in a pipeline on GCP and vice-versa. ● True test of serialization/portability: can I zip the model and send it to my coworker in an email
  • 20. 20 Demo What will you see: ● Build a Pyspark model on 100 Gigs of Data in a Jupyter notebook. ● Export the serialized model into a JSON and protobuf format (with just addition of a few libraries) ● Load the serialized pipeline into a docker container for near real model serving using a REST scala interface. (takes about 50ms for a model that spark serves in 1.5 seconds)
  • 21. 21 Demo What will you see: ● Using GDPR compliant features how does one get 88% ROC and 94% on precision recall using a Pyspark model
  • 22. 22 Mleap First things first: Mleap was possible due to the good work of Hollin Wilkins and Mikhail Semeniuk
  • 23. 23 Mleap First things first: Mleap was possible due to the good work of Hollin Wilkins and Mikhail Semeniuk
  • 27. 27 Mleap available transformers Support for all available Spark transformers: ● Binarizer,BucketedRandomLSH,Bucketizer,ChisqSelector, ● Vectorizer,DCT,ElementwiseProduct,HashingTermFrequency ● PCA, onehotencoder,Ngram,QuantileDiscretizer,Tokenizer ● etc
  • 28. 28 Mleap available transformers Support for all available Spark Classification transformers: DecisionTreeClassifier,GradientBoostedTreeClassifier,LogisticRegression, LogisticRegressionCv,NaiveBayesClassifier,OnevsRest, RandomForestClassifier,SupportVectorMachines,MultilayerPerceptron
  • 29. 29 Mleap available transformers Support for all available Spark Regression transformers: AFTSurvivalRegression,DecisionTreeRegression,GeneralizedLinearRegres sion,RandomForestRegression etc
  • 30. 30 Mleap available transformers Support for all available Clustering transformers: BisectingKmeans,Kmean,LDA etc
  • 31. 31 Mleap unavailable transformers ALS, but its being worked on currently. Soon to be available
  • 32. 32 Conclusion Dockerizing ML models can instantly leverage decades of effort spent on productionizing software engineering. You automatically get benefits of scaling, reliability due to CI/CD, fault tolerance, high availability, A/B testing, automation and a rich ecosystem of metrics that the new datascience discipline need not reinvent.
  • 33. 33 References ● Mleap: https://github.com/combust/mleap ● Neils talk in Pydata 2017 about productionizing ML: https://www.youtube.com/watch?v=f3I0izerPvc