Predictive Models at
Scale using Dumbo
Nikhil Ketkar
40k+ Brands
600k+ Sellers
700+ Million
Products
7k+ Categories
10k+ Attributes
Motivation: Problem Space @ Indix
Developing Predictive Models
Unlabelled Data
Sample
Hand
Label
Model Predict
Data with
Predicted Labels
HDFS
Statistical
Model
Statistical
Model
Statistical
Model
Statistical
Model
Statistical
Model
Statistical
Model
Predictive Models at Scale
The Two Giants
Native, C/C++ Fortran
Numpy
Scipy, Pandas, Matplotlib
scikit-learn, scikit-image,
statsmodels
JVM
Java/Scala
HDFS, Hadoop MapReduce
Cascading/Scalding
PyData Ecosystem Hadoop Ecosystem
Model
Predict
The Standard Options
● Port to Java/Scala use as Library in Mapper
○ Time Consuming
○ Need to port parts of the PyData Stack
○ Reduced Velocity
○ Error prone
● Write a REST API/Service for the model and
call from Mapper
○ Slow due to Network Latency
○ Deployment is a nightmare
● Use Disco
Can we do better?
● Hadoop Streaming with Typedbytes Support
● Python Wrappers over Hadoop Streaming
○ Dumbo
○ MRJob
○ Hadoopy
○ Pydoop
Reference: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
Two Minute MapReduce Refresher
Reference: https://tarnbarford.net/journal/mapreduce-on-mongo
Sample Problem: Extract MPN from Product Titles
● 0.5 Billion Product Titles
● Many contain MPNs
● Humans can detect
MPNs
● Can a model do the
same?
● Use CRF on Full Title
● Use RF on Tokens
Moen CSIMC000BN Brushed Nickel Decorative Mirror Frame
Corner Rosette from Mirrorscapes 000 Series Set of 4
Rohl A3608/6.5LPAPC 2 Polished Chrome Country Kitchen Low
Lead Bar Faucet with Porcelain Lever Handle
Newport Brass 3 447/ORB Oil Rubbed Bronze Hand Relieved
Diverter / Volume Control Handle from the Metropole Collection
Bosch HCFC2044B 1/4" SDS Plus X5L with Optimized Flute
Surface Pack of 25
Sterling 7214120 Ensemble 0" x 30" Shower Receptor with Right
hand Drain Pack 6
U12 23252 KUB QUATRON INDX DRILL
MPNs in Product Titles
Code Walkthrough
Code Walkthrough
Important Learnings
● Dumbo Fairly Stable, Mature and Ready for
Production
● Gets the 2 giants working together!
● Found just one issue over 6 months of
usage (patch submitted)
● Support for Typedbytes is critical if making
predictions over binary data (Images etc.)

Predictive Models at Scale

  • 1.
    Predictive Models at Scaleusing Dumbo Nikhil Ketkar
  • 2.
    40k+ Brands 600k+ Sellers 700+Million Products 7k+ Categories 10k+ Attributes Motivation: Problem Space @ Indix
  • 3.
    Developing Predictive Models UnlabelledData Sample Hand Label Model Predict Data with Predicted Labels
  • 4.
  • 5.
    The Two Giants Native,C/C++ Fortran Numpy Scipy, Pandas, Matplotlib scikit-learn, scikit-image, statsmodels JVM Java/Scala HDFS, Hadoop MapReduce Cascading/Scalding PyData Ecosystem Hadoop Ecosystem Model Predict
  • 6.
    The Standard Options ●Port to Java/Scala use as Library in Mapper ○ Time Consuming ○ Need to port parts of the PyData Stack ○ Reduced Velocity ○ Error prone ● Write a REST API/Service for the model and call from Mapper ○ Slow due to Network Latency ○ Deployment is a nightmare ● Use Disco
  • 7.
    Can we dobetter? ● Hadoop Streaming with Typedbytes Support ● Python Wrappers over Hadoop Streaming ○ Dumbo ○ MRJob ○ Hadoopy ○ Pydoop Reference: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/
  • 8.
    Two Minute MapReduceRefresher Reference: https://tarnbarford.net/journal/mapreduce-on-mongo
  • 9.
    Sample Problem: ExtractMPN from Product Titles ● 0.5 Billion Product Titles ● Many contain MPNs ● Humans can detect MPNs ● Can a model do the same? ● Use CRF on Full Title ● Use RF on Tokens Moen CSIMC000BN Brushed Nickel Decorative Mirror Frame Corner Rosette from Mirrorscapes 000 Series Set of 4 Rohl A3608/6.5LPAPC 2 Polished Chrome Country Kitchen Low Lead Bar Faucet with Porcelain Lever Handle Newport Brass 3 447/ORB Oil Rubbed Bronze Hand Relieved Diverter / Volume Control Handle from the Metropole Collection Bosch HCFC2044B 1/4" SDS Plus X5L with Optimized Flute Surface Pack of 25 Sterling 7214120 Ensemble 0" x 30" Shower Receptor with Right hand Drain Pack 6 U12 23252 KUB QUATRON INDX DRILL MPNs in Product Titles
  • 10.
  • 11.
  • 12.
    Important Learnings ● DumboFairly Stable, Mature and Ready for Production ● Gets the 2 giants working together! ● Found just one issue over 6 months of usage (patch submitted) ● Support for Typedbytes is critical if making predictions over binary data (Images etc.)