These are slides presented at MLconf in San Francisco, November 14, 2014. I share the approach to real-time machine learning for recommender systems developed at if(we). We achieve rapid iterative cycles by adhering to a strict approach to structuring and accessing our data, as well as to building the online features that comprise our models. These developments support teams of data scientist and data engineers, who work together to solve complex recommendation problems. We also introduce the Antelope Realtime Events framework, an open source demonstration application which derives from our scalable proprietary software stack.
3. 1. Gain understanding of machine learning
2. Gain understanding of the product usage
3. See opportunity to make the product better
4. Create training data
5. Train predictive models
6. Put models in production
7. See improvements
5. 1. Gain understanding of machine learning
2. Gain understanding of the product usage
3. See opportunity to make the product better
4. Pull records from database to create interesting
features (usually aggregates)
5. Train predictive models
6. Go implement models for production
7. See improvements
6. 1. Gain understanding of machine learning
2. Gain understanding of the product usage
3. See opportunity to make the product better
4. Pull records from database to create interesting
features (usually aggregates)
5. Train predictive models
6. Go implement models for production
7. See improvements
3-6
months
7. 1. Gain understanding of machine learning
2. Gain understanding of the product usage
3. See opportunity to make the product better
4. Pull records from database to create interesting
features (usually aggregates)
5. Train predictive models
6. Go implement models for production
7. See improvements Cool!
Wa s i t w o r t h i t ?
8. • Profitable startup actively pursuing big
opportunities in social apps
• Millions of users of existing brands
• Thousands of social contacts per second
10. Tagged dating feature
• >10 million candidates
to select from
• >1000 updates/sec
• Must be responsive to
current activity
• Users expect instant
query results
12. • Data scientist hands model description to
software engineer
• May need to translate features from SQL to Java
• Aggregate features require batch processing
• May need to adjust features and model to
achieve real-time updates
• Fast scoring requires high-performance in-memory
data structures
15. !
!
!
4. Pull records from database to create interesting
features (usually aggregates)
5. Train predictive models
6. Go implement models for production
21. Bob registers
Alice registers
Alice updates profile
Bob opens app
Bob sees Alice in recommendations
Bob swipes yes on Alice
Alice receives push notification
Alice sees Bob swiped yes
Alice swipes yes
Alice sends message to Bob
28. live demo
Kaggle competition
with Best Buy data
https://www.kaggle.com/c/acm-sf-chapter-hackathon-small
29. product update events
{
“timestamp” : “2012-05-03 6:43:15”,
“eventType” : “ProductUpdate”,
“eventProperties” : {
“sku” : “1032361”,
“regularPrice” : “19.99”,
“name” : “Need for Speed: Hot Pursuit”,
“description” : “Fasten your seatbelt and
get ready to drive like your life depends
on it...”
...
}
}
31. demo
Try it yourself, code and instructions at:
https://github.com/ifweco/antelope/blob/master/doc/demo.md
32.
33.
34.
35. 1. Gain understanding of machine learning
2. Gain understanding of the product usage
3. See opportunity to make the product better
4. Create training data
5. Train predictive models
6. Put models in production
7. See improvements
Fast cycles!!
36.
37. • All data in form of events – no exceptions!
• Roll through history to generate training examples
• Sample training data carefully to avoid feedback
• Model is static while features are live and personal
• Use interesting features with boring algorithms
• Expressiveness > performance > scalability
github.com/ifwe/antelope
@jssmith