Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Platform and Services  Vipul Sharma and EyalReuveni
Agenda            Eventbrite           Data Products           Data Platform         Recommendations            Questions
•   A social event ticketing and discovery platform•   50th Million Ticket Sold•   Revenue doubled YOY•   180 Employees in...
Data Products
Analytics            • Add–Hoc queries by Analysts
Fraud and Spam
Data Platform
Hadoop Cluster•   30 persistent EC2 High-Memory Instances•   30TB disk with replication factor of 2, ext3 formatted•   CDH...
Infrastructure• Search   • Solr   • Incremental updates towards event driven• Recommendation/Graph   • Hadoop   • Native J...
Infrastructure• Stream   • RabbitMQ   • Internal Fire hose (Investigating Kafka)• Offline   •   MapRedude   •   Streaming ...
Infrastructure - Sqoozie• Workflow for mysql imports to HDFS    • Generate Sqoop commands    • Run these imports in parall...
Infrastructure - Blammo•   Raw logs are imported to HDFS via flume•   Almost real-time – 5 min latency•   Logs are key-val...
Recommendations
You will like to attend this event
Recommendation Engines                                                                                      Interest Graph...
Why Interest?  Events are Social          Events are InterestDense Graph is Irrelevant                            Interest...
How do we know your Interest?• We ask you• Based on your activity   • Events Attended   • Events Browsed• Facebook Interes...
Model Based vs Clustering            Item-Item vs User-User     Building Social Graph is Clustering StepSocial Graph Recom...
Implicit Social Graph                                 U1                            E1        E4                  U2      ...
Mixed Social Graph                                U1                           E1                 U2                  U3  ...
15M * 260 * 260 = 1.14 Trillion Edges               4Billion edges ranked   Each node is a feature vector representing a U...
Feature Generation•   Mixed Features•   A series of map-reduce jobs•   Output on HDFS in flat files; Input to subsequent j...
U1U2        U3
HBase
HBase• Collect data from multiple Map Reduce jobs   • Stores entire social graph   • Over one million writes per second
HBase    rowid     neighbors   events   featureX    2718282   101         3        0.3678795
HBaserowid     314159:n   314159:e   314159:fx   161803:n   161803:e   161803:fx2718282   31         1          0.3183    ...
Tips & Tricks• Distributed cache database   • Sped up some Map Reduce jobs by hours   • Be sure to use counters!
Tips & Tricks• Hive (ab)uses   •   Almost as many hive jobs as custom ones   •   “flip join”   •   Statistical functions u...
Tips & Tricks•   Memory Memory Memory•   LZO, WAL•   Combiners are great until•   Shuffle and Sorting stage•   Hadoop ecos...
Questions?
Eventbrite dataplatform and services - Interest graph based recommendations
Eventbrite dataplatform and services - Interest graph based recommendations
Eventbrite dataplatform and services - Interest graph based recommendations
Upcoming SlideShare
Loading in …5
×

Eventbrite dataplatform and services - Interest graph based recommendations

4,105 views

Published on

Published in: Technology, Education
  • الفيس بوك
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Eventbrite dataplatform and services - Interest graph based recommendations

  1. Data Platform and Services Vipul Sharma and EyalReuveni
  2. Agenda Eventbrite Data Products Data Platform Recommendations Questions
  3. • A social event ticketing and discovery platform• 50th Million Ticket Sold• Revenue doubled YOY• 180 Employees in SOMA SF• Solving significant engineering problems • Data • Data, Infrastructure, Mobile, Web, Scale, Ops, QA• Firing all cylinders and hiring blazing fastwww.eventbrite.com/jobs
  4. Data Products
  5. Analytics • Add–Hoc queries by Analysts
  6. Fraud and Spam
  7. Data Platform
  8. Hadoop Cluster• 30 persistent EC2 High-Memory Instances• 30TB disk with replication factor of 2, ext3 formatted• CDH3• Fair Scheduler• HBase
  9. Infrastructure• Search • Solr • Incremental updates towards event driven• Recommendation/Graph • Hadoop • Native Java MapReduce • Bash for workflow• Persistence • MySql • HDFS • HBase • MongoDB (Investigating Cassandra and Riak)
  10. Infrastructure• Stream • RabbitMQ • Internal Fire hose (Investigating Kafka)• Offline • MapRedude • Streaming • Hive • Hue
  11. Infrastructure - Sqoozie• Workflow for mysql imports to HDFS • Generate Sqoop commands • Run these imports in parallel• Transparent to schema changes• Include or exclude on column, data types, table level• Data Type Casting tinyint(1)  Integer• Distributed Table Imports
  12. Infrastructure - Blammo• Raw logs are imported to HDFS via flume• Almost real-time – 5 min latency• Logs are key-value pairs in JSON• Each log producer publishes schema in yaml• Hive schema and schema yaml in sync using thrift• Control exclusion and inclusion
  13. Recommendations
  14. You will like to attend this event
  15. Recommendation Engines Interest Graph Based Social Graph Based (Your (Your friends who friends like Lady like rock music Collaborative Gaga so you will like you are Filtering – Item- like Lady attending Eric Item similarity Gaga, PYMK – Clapton Event– Facebook, Linkedin Eventbrite) Collaborative (You like Godfather so you ) Filtering – User- User Similarity will like Scarface - Netflix) (People who Item bought camera Hierarchy also bought batteries - (You bought Amazon) camera so you need batteries - Amazon)
  16. Why Interest? Events are Social Events are InterestDense Graph is Irrelevant Interest are Changing
  17. How do we know your Interest?• We ask you• Based on your activity • Events Attended • Events Browsed• Facebook Interests • User Interest has to match Event category • Static• Machine Learning • Logistic Regression using MLE • Sparse Matrix is generated using MapReduce • A model for each interest
  18. Model Based vs Clustering Item-Item vs User-User Building Social Graph is Clustering StepSocial Graph Recommendation is a Ranking Problem
  19. Implicit Social Graph U1 E1 E4 U2 U3 E2 E3 U4 U5
  20. Mixed Social Graph U1 E1 U2 U3 E2 E3 FB U4 U5 LI
  21. 15M * 260 * 260 = 1.14 Trillion Edges 4Billion edges ranked Each node is a feature vector representing a UserEach edge is a feature vector representing a Relationship
  22. Feature Generation• Mixed Features• A series of map-reduce jobs• Output on HDFS in flat files; Input to subsequent jobs• Orders = Event  Attendees • MAP: eid: uid • REDUCE: eid:[uid]• Attendees  Social Graph • Input: eid:[uid] • MAP: uidi:[uid] • REDUCE: uid:[neighbors]• Interest based features, user specific, graph mining etc• Upload feature values to HBase
  23. U1U2 U3
  24. HBase
  25. HBase• Collect data from multiple Map Reduce jobs • Stores entire social graph • Over one million writes per second
  26. HBase rowid neighbors events featureX 2718282 101 3 0.3678795
  27. HBaserowid 314159:n 314159:e 314159:fx 161803:n 161803:e 161803:fx2718282 31 1 0.3183 83 2 0.618
  28. Tips & Tricks• Distributed cache database • Sped up some Map Reduce jobs by hours • Be sure to use counters!
  29. Tips & Tricks• Hive (ab)uses • Almost as many hive jobs as custom ones • “flip join” • Statistical functions using hive • UDF
  30. Tips & Tricks• Memory Memory Memory• LZO, WAL• Combiners are great until• Shuffle and Sorting stage• Hadoop ecosystem is still new
  31. Questions?

×