SlideShare a Scribd company logo
Subhojit Banerjee
DataScientist/DataEngineer
@subbubanerjee
https://medium.com/@subhojit20_27731
Deploying Large ML models and
scoring in near-real time @scale
Running the last mile of the Data Science journey
2
Big Data* is not easy
Gartner found just 14% of companies
surveyed had big data projects in production
in 2015 and unchanged from the year before
and slowly inching towards 15% in year 2016
*Big Data ~ Large unstructured and structured data
3
Big Data is not easy
● Cloudera: $261M in revenue, $187M in losses (down from $205M the
year before, the only company to narrow its loss)
● Hortonworks: $184M in revenue, $251M in losses (up from $180M the
year before)
● Alteryx: $85M in revenue, $24M in losses (up from $21M)
● Splunk: $950M in revenue, $355M in losses (up from $279M)
● Tableau: $827M in revenue, $144M in losses (up from $84M)
*Big Data ~ Large unstructured and structured data
4
Big Data is not easy
Gartner’s top obstacles for big data success were:
● Determining how to get value from Big data
● Most companies are not set up culturally or oganizationally to suceed, wishing themselves
agile and hoping data silos will dissappear “magically”
● Technical expertise is lacking in areas like agile model development and deployment to
provide quick turnarounds.
5
Big Data is not easy
This talk is to address the last problem -
“Democratising” large scale machine learning
model deployment and model scoring (in
near real time)
01
02
03
04
05
Business case
First solution
Problems faced
Better solution
Demo
06
07
Conclusion
Questions
7
Business use case
Is real time segmentation of the user into buy
vs defer clusters using GDPR compliant
features possible on the website?
8
Business use case
But first, we need to collect data to act on it in
real time
9
10
11
● ~23 million events per month
● ~140G of granular event data collected
● cost of 6 - 12 euro per day
●
12
Business use case
So now that we have the real time pipeline can
we train and score our ML model ?
13
First Solution
Aws lambda based serverless architecture
14
DEMO
15
First solution
16
Problem faced
The first model was a markov chain on the
sequence of webpages visited
17
Problems faced
● Had to hack core R libraries to bring down the 212 MB ML model
library to fit the 50MB compressed AWS Lambda restriction
● Every time front end would change, our models needs to change and
old data cannot be used to retrain the model.
18
Better solution
Requirements:
● Decrease the time for the models to move from notebook to production
● Has to eliminate recoding of Spark feature pipelines and models from research to production i.e
my pyspark model should be deployable into a Scala pipeline with zero or minimal code changes
● Serving/inference has to be superfast for spark models
19
Better solution
Requirements technical analysis
● For model serving to super fast, it was clear that inference needed to be outside the JVM and
outside realms of Spark context.
● Has to be completely vendor neutral i.e. create a model in AWS and should be able to deploy the
model in a pipeline on GCP and vice-versa.
● True test of serialization/portability: can I zip the model and send it to my coworker in an email
20
Demo
What will you see:
● Build a Pyspark model on 100 Gigs of Data in a Jupyter notebook.
● Export the serialized model into a JSON and protobuf format (with just
addition of a few libraries)
● Load the serialized pipeline into a docker container for near real model
serving using a REST scala interface. (takes about 50ms for a model that
spark serves in 1.5 seconds)
21
Demo
What will you see:
● Using GDPR compliant features how does one get 88% ROC and 94% on
precision recall using a Pyspark model
22
Mleap
First things first:
Mleap was possible due to the good work of
Hollin Wilkins and Mikhail Semeniuk
23
Mleap
First things first:
Mleap was possible due to the good work of
Hollin Wilkins and Mikhail Semeniuk
24
Mleap
25
Mleap
Random Forest pipeline
Random Forest
26
Mleap
27
Mleap available transformers
Support for all available Spark transformers:
● Binarizer,BucketedRandomLSH,Bucketizer,ChisqSelector,
● Vectorizer,DCT,ElementwiseProduct,HashingTermFrequency
● PCA, onehotencoder,Ngram,QuantileDiscretizer,Tokenizer
● etc
28
Mleap available transformers
Support for all available Spark Classification transformers:
DecisionTreeClassifier,GradientBoostedTreeClassifier,LogisticRegression,
LogisticRegressionCv,NaiveBayesClassifier,OnevsRest,
RandomForestClassifier,SupportVectorMachines,MultilayerPerceptron
29
Mleap available transformers
Support for all available Spark Regression transformers:
AFTSurvivalRegression,DecisionTreeRegression,GeneralizedLinearRegres
sion,RandomForestRegression etc
30
Mleap available transformers
Support for all available Clustering transformers:
BisectingKmeans,Kmean,LDA etc
31
Mleap unavailable transformers
ALS, but its being worked on currently. Soon to be available
32
Conclusion
Dockerizing ML models can instantly leverage decades of effort spent on
productionizing software engineering. You automatically get benefits of
scaling, reliability due to CI/CD, fault tolerance, high availability, A/B
testing, automation and a rich ecosystem of metrics that the new
datascience discipline need not reinvent.
33
References
● Mleap: https://github.com/combust/mleap
● Neils talk in Pydata 2017 about productionizing ML:
https://www.youtube.com/watch?v=f3I0izerPvc
34
Thank you!

More Related Content

What's hot

SparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache SparkSparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache Spark
Databricks
 
Scaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber AttacksScaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber Attacks
Databricks
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Databricks
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
Databricks
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
Josh Patterson
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Kiwi.com Reaches Cruising Altitude with Scylla
Kiwi.com Reaches Cruising Altitude with ScyllaKiwi.com Reaches Cruising Altitude with Scylla
Kiwi.com Reaches Cruising Altitude with Scylla
ScyllaDB
 
Function as a service
Function as a serviceFunction as a service
Function as a service
atomslide
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?
confluent
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
confluent
 
Bridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesBridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFrames
Databricks
 
Spark at Airbnb
Spark at AirbnbSpark at Airbnb
Spark at Airbnb
Hao Wang
 
Apache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming EngineApache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming Engine
Tianlun Zhang
 
Architecture Blue Print
Architecture Blue PrintArchitecture Blue Print
Architecture Blue Print
Bogdan Nedelcu
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
Vianney FOUCAULT
 
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
BDX 2016 - Kevin lyons & yakir buskilla  @ eXelate BDX 2016 - Kevin lyons & yakir buskilla  @ eXelate
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
Ido Shilon
 
Filtering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkFiltering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache Spark
Databricks
 
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...
InfluxData
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache Pinot
Xiang Fu
 
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
Dataconomy Media
 

What's hot (20)

SparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache SparkSparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache Spark
 
Scaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber AttacksScaling ML-Based Threat Detection For Production Cyber Attacks
Scaling ML-Based Threat Detection For Production Cyber Attacks
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Kiwi.com Reaches Cruising Altitude with Scylla
Kiwi.com Reaches Cruising Altitude with ScyllaKiwi.com Reaches Cruising Altitude with Scylla
Kiwi.com Reaches Cruising Altitude with Scylla
 
Function as a service
Function as a serviceFunction as a service
Function as a service
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?
 
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron SchildkroutKafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
Kafka + Uber- The World’s Realtime Transit Infrastructure, Aaron Schildkrout
 
Bridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesBridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFrames
 
Spark at Airbnb
Spark at AirbnbSpark at Airbnb
Spark at Airbnb
 
Apache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming EngineApache Gearpump - Lightweight Real-time Streaming Engine
Apache Gearpump - Lightweight Real-time Streaming Engine
 
Architecture Blue Print
Architecture Blue PrintArchitecture Blue Print
Architecture Blue Print
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
 
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
BDX 2016 - Kevin lyons & yakir buskilla  @ eXelate BDX 2016 - Kevin lyons & yakir buskilla  @ eXelate
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
 
Filtering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkFiltering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache Spark
 
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...
How EnerKey Using InfluxDB Saves Customers Millions by Detecting Energy Usage...
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache Pinot
 
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
 

Similar to Deploying spark ml models

Deploying Large Spark Models to production and model scoring in near real time
Deploying Large Spark Models to production and model scoring in near real timeDeploying Large Spark Models to production and model scoring in near real time
Deploying Large Spark Models to production and model scoring in near real time
subhojit banerjee
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
Lars Albertsson
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Edwin Poot
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
Etu Solution
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
Sundance Multiprocessor Technology Ltd.
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
Mostafa Majidpour
 
Real-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil DahlkeReal-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil Dahlke
SingleStore
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
Gyula Fóra
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
Nicolas Poggi
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
GoDataDriven
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Embarcados
 
Two way data sync between legacy and your brand new micro-service architecture
 Two way data sync between legacy and your brand new micro-service architecture Two way data sync between legacy and your brand new micro-service architecture
Two way data sync between legacy and your brand new micro-service architecture
bleporini
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
Databricks
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
Itzhak Kameli
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
Stratebi
 
Apache Kylin Meetup: Berlin - With OLX Group
Apache Kylin Meetup: Berlin - With OLX GroupApache Kylin Meetup: Berlin - With OLX Group
Apache Kylin Meetup: Berlin - With OLX Group
Tyler Wishnoff
 
Apache kylin meetup berlin olx v1.0
Apache kylin meetup berlin olx v1.0Apache kylin meetup berlin olx v1.0
Apache kylin meetup berlin olx v1.0
ssuser931288
 
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At CommercetoolsGraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
Nicola Molinari
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
Sri Ambati
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 

Similar to Deploying spark ml models (20)

Deploying Large Spark Models to production and model scoring in near real time
Deploying Large Spark Models to production and model scoring in near real timeDeploying Large Spark Models to production and model scoring in near real time
Deploying Large Spark Models to production and model scoring in near real time
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Real-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil DahlkeReal-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil Dahlke
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
 
Two way data sync between legacy and your brand new micro-service architecture
 Two way data sync between legacy and your brand new micro-service architecture Two way data sync between legacy and your brand new micro-service architecture
Two way data sync between legacy and your brand new micro-service architecture
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
Apache Kylin Meetup: Berlin - With OLX Group
Apache Kylin Meetup: Berlin - With OLX GroupApache Kylin Meetup: Berlin - With OLX Group
Apache Kylin Meetup: Berlin - With OLX Group
 
Apache kylin meetup berlin olx v1.0
Apache kylin meetup berlin olx v1.0Apache kylin meetup berlin olx v1.0
Apache kylin meetup berlin olx v1.0
 
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At CommercetoolsGraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 

Recently uploaded

PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 

Recently uploaded (20)

PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 

Deploying spark ml models

  • 1. Subhojit Banerjee DataScientist/DataEngineer @subbubanerjee https://medium.com/@subhojit20_27731 Deploying Large ML models and scoring in near-real time @scale Running the last mile of the Data Science journey
  • 2. 2 Big Data* is not easy Gartner found just 14% of companies surveyed had big data projects in production in 2015 and unchanged from the year before and slowly inching towards 15% in year 2016 *Big Data ~ Large unstructured and structured data
  • 3. 3 Big Data is not easy ● Cloudera: $261M in revenue, $187M in losses (down from $205M the year before, the only company to narrow its loss) ● Hortonworks: $184M in revenue, $251M in losses (up from $180M the year before) ● Alteryx: $85M in revenue, $24M in losses (up from $21M) ● Splunk: $950M in revenue, $355M in losses (up from $279M) ● Tableau: $827M in revenue, $144M in losses (up from $84M) *Big Data ~ Large unstructured and structured data
  • 4. 4 Big Data is not easy Gartner’s top obstacles for big data success were: ● Determining how to get value from Big data ● Most companies are not set up culturally or oganizationally to suceed, wishing themselves agile and hoping data silos will dissappear “magically” ● Technical expertise is lacking in areas like agile model development and deployment to provide quick turnarounds.
  • 5. 5 Big Data is not easy This talk is to address the last problem - “Democratising” large scale machine learning model deployment and model scoring (in near real time)
  • 6. 01 02 03 04 05 Business case First solution Problems faced Better solution Demo 06 07 Conclusion Questions
  • 7. 7 Business use case Is real time segmentation of the user into buy vs defer clusters using GDPR compliant features possible on the website?
  • 8. 8 Business use case But first, we need to collect data to act on it in real time
  • 9. 9
  • 10. 10
  • 11. 11 ● ~23 million events per month ● ~140G of granular event data collected ● cost of 6 - 12 euro per day ●
  • 12. 12 Business use case So now that we have the real time pipeline can we train and score our ML model ?
  • 13. 13 First Solution Aws lambda based serverless architecture
  • 16. 16 Problem faced The first model was a markov chain on the sequence of webpages visited
  • 17. 17 Problems faced ● Had to hack core R libraries to bring down the 212 MB ML model library to fit the 50MB compressed AWS Lambda restriction ● Every time front end would change, our models needs to change and old data cannot be used to retrain the model.
  • 18. 18 Better solution Requirements: ● Decrease the time for the models to move from notebook to production ● Has to eliminate recoding of Spark feature pipelines and models from research to production i.e my pyspark model should be deployable into a Scala pipeline with zero or minimal code changes ● Serving/inference has to be superfast for spark models
  • 19. 19 Better solution Requirements technical analysis ● For model serving to super fast, it was clear that inference needed to be outside the JVM and outside realms of Spark context. ● Has to be completely vendor neutral i.e. create a model in AWS and should be able to deploy the model in a pipeline on GCP and vice-versa. ● True test of serialization/portability: can I zip the model and send it to my coworker in an email
  • 20. 20 Demo What will you see: ● Build a Pyspark model on 100 Gigs of Data in a Jupyter notebook. ● Export the serialized model into a JSON and protobuf format (with just addition of a few libraries) ● Load the serialized pipeline into a docker container for near real model serving using a REST scala interface. (takes about 50ms for a model that spark serves in 1.5 seconds)
  • 21. 21 Demo What will you see: ● Using GDPR compliant features how does one get 88% ROC and 94% on precision recall using a Pyspark model
  • 22. 22 Mleap First things first: Mleap was possible due to the good work of Hollin Wilkins and Mikhail Semeniuk
  • 23. 23 Mleap First things first: Mleap was possible due to the good work of Hollin Wilkins and Mikhail Semeniuk
  • 27. 27 Mleap available transformers Support for all available Spark transformers: ● Binarizer,BucketedRandomLSH,Bucketizer,ChisqSelector, ● Vectorizer,DCT,ElementwiseProduct,HashingTermFrequency ● PCA, onehotencoder,Ngram,QuantileDiscretizer,Tokenizer ● etc
  • 28. 28 Mleap available transformers Support for all available Spark Classification transformers: DecisionTreeClassifier,GradientBoostedTreeClassifier,LogisticRegression, LogisticRegressionCv,NaiveBayesClassifier,OnevsRest, RandomForestClassifier,SupportVectorMachines,MultilayerPerceptron
  • 29. 29 Mleap available transformers Support for all available Spark Regression transformers: AFTSurvivalRegression,DecisionTreeRegression,GeneralizedLinearRegres sion,RandomForestRegression etc
  • 30. 30 Mleap available transformers Support for all available Clustering transformers: BisectingKmeans,Kmean,LDA etc
  • 31. 31 Mleap unavailable transformers ALS, but its being worked on currently. Soon to be available
  • 32. 32 Conclusion Dockerizing ML models can instantly leverage decades of effort spent on productionizing software engineering. You automatically get benefits of scaling, reliability due to CI/CD, fault tolerance, high availability, A/B testing, automation and a rich ecosystem of metrics that the new datascience discipline need not reinvent.
  • 33. 33 References ● Mleap: https://github.com/combust/mleap ● Neils talk in Pydata 2017 about productionizing ML: https://www.youtube.com/watch?v=f3I0izerPvc