GDG-NYC (2018-06-04):
We are excited to welcome Shayak Banerjee from the Machine Learning team at Meetup, here to talk about how they are using machine learning to transform and simplify the way in which organizers and members connect around shared interests. (Psst. Meetup is hiring. If you are interested in open positions, catch Shayak over the breaks to follow up!)
8. 38 million
Members worldwide
320 thousand
Meetup groups
192 countries
Meeting up
3+ million
RSVPs every month
Meetups By the Numbers
300 thousand
Meetups every month
9. How Machine Learning adds value
● Surface relevance
○ Members see relevant groups
○ New groups get shown to relevant members
○ Members see relevant events
○ Highlight quality events
● Speed up manual tasks
○ New group approval process
10. Announcing New Groups
● The key problem:
○ New groups don’t have members and need promotion
○ Active member base looking for new communities to join
○ Meetup takes on the responsibility of initial promotion => first
boost in membership
○ Promote via email
○ But don’t blast everybody
○ How do we find the relevant audience?
11. Announcing New Groups
● Organizer => Name + Location + Description + Topics
● Human + ML => Email announcement
Group
Submitted
Policy
Review
Age /
Gender
Filters
Topics
Review
Group
Approval
Identify
Audience
Email
12. Announcing New Groups
● Primary outreach -> Email
● ~1M emails / day
● ML identifies recipients
13. ML Infrastructure - Pre 2017
User’s
Topics
User’s
Groups
Other in-memory maps
“Interest” Server
Ranking
Model
New Group Queue
Processing
Cron
Email Queue
API Call
Ranked List
of Users
New Group Email
15. Challenge: Restricted Feature Complexity
User’s
Topics
User’s
Groups
Other in-memory maps
“Interest” Server
Ranking
Model
New Group Queue
Processing
Cron
Email Queue
API Call
Ranked List
of Users
New Group Email
16. Challenge: Trained offline, non-scheduled
User’s
Topics
User’s
Groups
Other in-memory maps
“Interest” Server
Ranking
Model
New Group Queue
Processing
Cron
Email Queue
API Call
Ranked List
of Users
New Group Email
Offline Model
Training
17. Challenge: Training & Deployment Code
Differences
User’s
Topics
User’s
Groups
Other in-memory maps
“Interest” Server
Ranking
Model
New Group Queue
Processing
Cron
Email Queue
API Call
Ranked List
of Users
New Group Email
Offline Model
Training
19. Modernizing the ML Infrastructure
Data LakeNew Group
Queue
Processing
Cron
Worker 1
Worker 2
Yarn
Cluster
Cache Results
Distributed
Ranking
Model
Email Queue
Ranked List
of Users
New Group
Email
Features
20. Improvement: Richer Features
Data LakeNew Group
Queue
Processing
Cron
Worker 1
Worker 2
Yarn
Cluster
Cache Results
Distributed
Ranking
Model
Email Queue
Ranked List
of Users
New Group
Email
Features
21. Improvement: Horizontal scalability
Data LakeNew Group
Queue
Processing
Cron
Worker 1
Worker 2
Yarn
Cluster
Cache Results
Distributed
Ranking
Model
Email Queue
Ranked List
of Users
New Group
Email
Features
22. Improvement: Training + Deployment =
Same Code
Data LakeNew Group
Queue
Processing
Cron
Worker 1
Worker 2
Yarn
Cluster
Cache Results
Distributed
Ranking
Model
Email Queue
Ranked List
of Users
New Group
Email
Features
23. How it Performed
Evaluation:
● 50/50 split test of which model was used
to select who gets an email
Success measurement:
● Joins per Group
RESULT
30%+Joins per GroupOther Learnings:
● Processing time < 24 hours ⇒ 24-48 hours
● Additional delay => manage expectations
with organizers
24. Lessons Learned
● Keeping data in sync (rsync, sqoop, flume)
● Spark needs tuning
● Horizontal scaling not always an answer
● Airflow local testing
● Dealing with “random” failures
Needs Improvement
● Sampling
● Combining batch + online
● Sharing features
● Understanding model performance after deployment
● Predicting who’s going to “show up”