Your SlideShare is downloading. ×
Eventbrite dataplatform and services - Interest graph based recommendations
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Eventbrite dataplatform and services - Interest graph based recommendations

2,104
views

Published on

Published in: Technology, Education

1 Comment
3 Likes
Statistics
Notes
  • الفيس بوك
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
2,104
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
71
Comments
1
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Platform and Services Vipul Sharma and EyalReuveni
  • 2. Agenda Eventbrite Data Products Data Platform Recommendations Questions
  • 3. • A social event ticketing and discovery platform• 50th Million Ticket Sold• Revenue doubled YOY• 180 Employees in SOMA SF• Solving significant engineering problems • Data • Data, Infrastructure, Mobile, Web, Scale, Ops, QA• Firing all cylinders and hiring blazing fastwww.eventbrite.com/jobs
  • 4. Data Products
  • 5. Analytics • Add–Hoc queries by Analysts
  • 6. Fraud and Spam
  • 7. Data Platform
  • 8. Hadoop Cluster• 30 persistent EC2 High-Memory Instances• 30TB disk with replication factor of 2, ext3 formatted• CDH3• Fair Scheduler• HBase
  • 9. Infrastructure• Search • Solr • Incremental updates towards event driven• Recommendation/Graph • Hadoop • Native Java MapReduce • Bash for workflow• Persistence • MySql • HDFS • HBase • MongoDB (Investigating Cassandra and Riak)
  • 10. Infrastructure• Stream • RabbitMQ • Internal Fire hose (Investigating Kafka)• Offline • MapRedude • Streaming • Hive • Hue
  • 11. Infrastructure - Sqoozie• Workflow for mysql imports to HDFS • Generate Sqoop commands • Run these imports in parallel• Transparent to schema changes• Include or exclude on column, data types, table level• Data Type Casting tinyint(1)  Integer• Distributed Table Imports
  • 12. Infrastructure - Blammo• Raw logs are imported to HDFS via flume• Almost real-time – 5 min latency• Logs are key-value pairs in JSON• Each log producer publishes schema in yaml• Hive schema and schema yaml in sync using thrift• Control exclusion and inclusion
  • 13. Recommendations
  • 14. You will like to attend this event
  • 15. Recommendation Engines Interest Graph Based Social Graph Based (Your (Your friends who friends like Lady like rock music Collaborative Gaga so you will like you are Filtering – Item- like Lady attending Eric Item similarity Gaga, PYMK – Clapton Event– Facebook, Linkedin Eventbrite) Collaborative (You like Godfather so you ) Filtering – User- User Similarity will like Scarface - Netflix) (People who Item bought camera Hierarchy also bought batteries - (You bought Amazon) camera so you need batteries - Amazon)
  • 16. Why Interest? Events are Social Events are InterestDense Graph is Irrelevant Interest are Changing
  • 17. How do we know your Interest?• We ask you• Based on your activity • Events Attended • Events Browsed• Facebook Interests • User Interest has to match Event category • Static• Machine Learning • Logistic Regression using MLE • Sparse Matrix is generated using MapReduce • A model for each interest
  • 18. Model Based vs Clustering Item-Item vs User-User Building Social Graph is Clustering StepSocial Graph Recommendation is a Ranking Problem
  • 19. Implicit Social Graph U1 E1 E4 U2 U3 E2 E3 U4 U5
  • 20. Mixed Social Graph U1 E1 U2 U3 E2 E3 FB U4 U5 LI
  • 21. 15M * 260 * 260 = 1.14 Trillion Edges 4Billion edges ranked Each node is a feature vector representing a UserEach edge is a feature vector representing a Relationship
  • 22. Feature Generation• Mixed Features• A series of map-reduce jobs• Output on HDFS in flat files; Input to subsequent jobs• Orders = Event  Attendees • MAP: eid: uid • REDUCE: eid:[uid]• Attendees  Social Graph • Input: eid:[uid] • MAP: uidi:[uid] • REDUCE: uid:[neighbors]• Interest based features, user specific, graph mining etc• Upload feature values to HBase
  • 23. U1U2 U3
  • 24. HBase
  • 25. HBase• Collect data from multiple Map Reduce jobs • Stores entire social graph • Over one million writes per second
  • 26. HBase rowid neighbors events featureX 2718282 101 3 0.3678795
  • 27. HBaserowid 314159:n 314159:e 314159:fx 161803:n 161803:e 161803:fx2718282 31 1 0.3183 83 2 0.618
  • 28. Tips & Tricks• Distributed cache database • Sped up some Map Reduce jobs by hours • Be sure to use counters!
  • 29. Tips & Tricks• Hive (ab)uses • Almost as many hive jobs as custom ones • “flip join” • Statistical functions using hive • UDF
  • 30. Tips & Tricks• Memory Memory Memory• LZO, WAL• Combiners are great until• Shuffle and Sorting stage• Hadoop ecosystem is still new
  • 31. Questions?