Predictive Models at Scale

Predictive Models at
Scale using Dumbo
Nikhil Ketkar

40k+ Brands
600k+ Sellers
700+ Million
Products
7k+ Categories
10k+ Attributes
Motivation: Problem Space @ Indix

Developing Predictive Models
Unlabelled Data
Sample
Hand
Label
Model Predict
Data with
Predicted Labels

HDFS
Statistical
Model
Statistical
Model
Statistical
Model
Statistical
Model
Statistical
Model
Statistical
Model
Predictive Models at Scale

The Two Giants
Native, C/C++ Fortran
Numpy
Scipy, Pandas, Matplotlib
scikit-learn, scikit-image,
statsmodels
JVM
Java/Scala
HDFS, Hadoop MapReduce
Cascading/Scalding
PyData Ecosystem Hadoop Ecosystem
Model
Predict

The Standard Options
● Port to Java/Scala use as Library in Mapper
○ Time Consuming
○ Need to port parts of the PyData Stack
○ Reduced Velocity
○ Error prone
● Write a REST API/Service for the model and
call from Mapper
○ Slow due to Network Latency
○ Deployment is a nightmare
● Use Disco

Can we do better?
● Hadoop Streaming with Typedbytes Support
● Python Wrappers over Hadoop Streaming
○ Dumbo
○ MRJob
○ Hadoopy
○ Pydoop
Reference: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/

Two Minute MapReduce Refresher
Reference: https://tarnbarford.net/journal/mapreduce-on-mongo

Sample Problem: Extract MPN from Product Titles
● 0.5 Billion Product Titles
● Many contain MPNs
● Humans can detect
MPNs
● Can a model do the
same?
● Use CRF on Full Title
● Use RF on Tokens
Moen CSIMC000BN Brushed Nickel Decorative Mirror Frame
Corner Rosette from Mirrorscapes 000 Series Set of 4
Rohl A3608/6.5LPAPC 2 Polished Chrome Country Kitchen Low
Lead Bar Faucet with Porcelain Lever Handle
Newport Brass 3 447/ORB Oil Rubbed Bronze Hand Relieved
Diverter / Volume Control Handle from the Metropole Collection
Bosch HCFC2044B 1/4" SDS Plus X5L with Optimized Flute
Surface Pack of 25
Sterling 7214120 Ensemble 0" x 30" Shower Receptor with Right
hand Drain Pack 6
U12 23252 KUB QUATRON INDX DRILL
MPNs in Product Titles

Important Learnings
● Dumbo Fairly Stable, Mature and Ready for
Production
● Gets the 2 giants working together!
● Found just one issue over 6 months of
usage (patch submitted)
● Support for Typedbytes is critical if making
predictions over binary data (Images etc.)

Predictive Models at Scale

More Related Content

What's hot

Similar to Predictive Models at Scale

Predictive Models at Scale