Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izrailev - H2O AI World London 2018

Real-Time AI:
Designing for Low Latency
and High Throughput
Sergei Izrailev
Chief Data Scientist
Beeswax
LinkedIn, Twitter: @sizrailev

• What is “real time”?
• Designing for real time
• Real-time machine learning applications at Beeswax
Topics

• One prediction at a time: request - response
• Limited time to retrieve the prediction (latency)
• Minimum number of prediction per unit time (throughput)
Defining “Real Time”

• Loan officer of financial advisor
• ideally within a few seconds, but waiting longer is OK
• Online travel site
• a couple of seconds overall response time
• thousands of predictions; sub-millisecond time per prediction
• Ad buying platform
• 5-10 millisecond response time
• High-frequency trading
• sub-microsecond response time
Scale of Real Time: End-User Experience

Model Building and Model Scoring
Serialized
Model
Build
Train
Transform
Score
Predict
Transform

Real Time: The Biggest Design Decision
Look-up
Predictions
Computed
in Batch
Compute
Predictions
in Real Time
?

Batch Scoring
ConsumerScore
Cache
Predictions
PredictionKey
Default
New data
Trans-
form
Predict
New data
Predictions

Real-Time Scoring
ConsumerScore
New data
Prediction
New data
Trans-
form
Predict

Hybrid Scoring
ScoreScore
Cache
Features
FeaturesKey
Features
Predict
New data Trans-
form New data
Consumer
Prediction

Trade-Offs of Batch and Real-Time Scoring
Consideration Batch Real-Time Hybrid
ML packages and languages Any Some Both
ML algorithms and feature transformations Any Some Both
Using complex features Yes No Yes
Predictions for every request No Yes Yes
Combinatorial input dimensions No Yes Yes
System complexity Medium Medium High
Accuracy (depends) Medium Medium High

• Data only - linear coefficients, PMML, etc
• scoring code is independent
• real-time: scoring code can be in C++ or Java for speed
• batch: any scoring engine can be used (even SQL)
• Code + Data: Serialized objects - Pickle, Spark, R
• reuse generated code in the same framework
• primarily good for batch
• May work in a real-time service, latency and throughput permitting
Model Deployment Options

• Code + Data: H2O's POJO and MOJO
• Generated code is used in a different environment
• Batch: load a generated jar in Spark as a UDF
• Real-time: load a generated jar in the real-time system (or wrap it into
a REST service)
• MOJO 2: includes feature transformations
Model Deployment Options (continued)

Real-Time ML Applications
at Beeswax

Bid request
Ad Exchanges
Bidder
Advertiser
Machine
Learning
Bid price Predictions
Data
2M requests
per second
10K-200K
requests per
second
under 50 ms
response time
5 ms

ML Use Case: Campaign Optimization
Event
Probability
Value of
Event
Real-Time
Metrics
Optimization
Algorithm
Bid Price
Example: spend the whole budget
evenly over a given time period,
while maximizing the number of events

• Inputs: billions of records
• Latency: 5 ms
• Throughput: 100K requests per second
• Production stack: Python, C++, and Java
Real-Time Constraints

• Training pipelines: pyspark
• Scoring type: batch predictions + real-time cache
• ML training engine: H2O Driverless AI
• Infrastructure on AWS: “h2ostart”, “h2ostop”
• Transformations and ML scoring engine: pysparkling + mojo 2
Machine Learning Setup

• Manual feature engineering is very time consuming
• And there are higher value activities for data scientists
• Mojo 2: both feature transformations and prediction engines
• Option to switch to real-time scoring later
Driverless AI vs Other Options

• Provides an auto-pilot
• Someone still has to fuel, service, take off, and land the plane
• Still need to experiment, but setting it up is easy
• Needed a reasonably large machine
• p2.8xlarge: 8 GPUs, 32 vCPUs, 488 GB RAM
• Other constraints: mojo2 is limited to XGBoost and GLM only
Practicalities of Driverless AI

• Accuracy levels off, while complexity continues to increase
• Complexity leads to larger size of the mojos (increased
memory requirements) and reduces the speed of scoring
(increased CPU requirements).
• On AWS, literally, time IS money, so complexity = higher costs
Trade-Off: Accuracy vs Complexity

Trade-Off: Accuracy vs Complexity

• Define what "real time" means for your application
• The most important choice: batch or real-time predictions
• Driverless AI helps solve the feature engineering problem
• Mojo 2 includes feature transformation code ready for real-
time applications
Takeaways

Yes, we are hiring…
sergei@beeswax.com
Questions?

Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izrailev - H2O AI World London 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izrailev - H2O AI World London 2018

Similar to Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izrailev - H2O AI World London 2018 (20)

More from Sri Ambati

More from Sri Ambati (20)

Recently uploaded

Recently uploaded (20)

Real-Time AI: Designing for Low Latency and High Throughput - Dr. Sergei Izrailev - H2O AI World London 2018