Data Platform and Services

  Vipul Sharma and EyalReuveni
Agenda


            Eventbrite
           Data Products
           Data Platform
         Recommendations
            Questions
•   A social event ticketing and discovery platform
•   50th Million Ticket Sold
•   Revenue doubled YOY
•   180 Employees in SOMA SF
•   Solving significant engineering problems
    • Data
    • Data, Infrastructure, Mobile, Web, Scale, Ops, QA
• Firing all cylinders and hiring blazing fast
www.eventbrite.com/jobs
Data Products
Analytics




            • Add–Hoc queries by Analysts
Fraud and Spam
Data Platform
Hadoop Cluster




•   30 persistent EC2 High-Memory Instances
•   30TB disk with replication factor of 2, ext3 formatted
•   CDH3
•   Fair Scheduler
•   HBase
Infrastructure

• Search
   • Solr
   • Incremental updates towards event driven
• Recommendation/Graph
   • Hadoop
   • Native Java MapReduce
   • Bash for workflow
• Persistence
   •   MySql
   •   HDFS
   •   HBase
   •   MongoDB (Investigating Cassandra and Riak)
Infrastructure


• Stream
   • RabbitMQ
   • Internal Fire hose (Investigating Kafka)
• Offline
   •   MapRedude
   •   Streaming
   •   Hive
   •   Hue
Infrastructure - Sqoozie



• Workflow for mysql imports to HDFS
    • Generate Sqoop commands
    • Run these imports in parallel
•   Transparent to schema changes
•   Include or exclude on column, data types, table level
•   Data Type Casting tinyint(1)  Integer
•   Distributed Table Imports
Infrastructure - Blammo



•   Raw logs are imported to HDFS via flume
•   Almost real-time – 5 min latency
•   Logs are key-value pairs in JSON
•   Each log producer publishes schema in yaml
•   Hive schema and schema yaml in sync using thrift
•   Control exclusion and inclusion
Recommendations
You will like to attend this event
Recommendation Engines



                                                                                     Interest Graph
                                                                                     Based
                                                                 Social Graph
                                                                 Based (Your         (Your friends who
                                                                 friends like Lady   like rock music
                                          Collaborative          Gaga so you will    like you are
                                          Filtering – Item-      like Lady Gaga,     attending Eric
                                          Item similarity        PYMK – Facebook,    Clapton Event–
                                                                 Linkedin)           Eventbrite)
                      Collaborative       (You like
                      Filtering – User-   Godfather so you
                      User Similarity     will like Scarface -
                                          Netflix)
                      (People who
     Item             bought camera
     Hierarchy        also bought
                      batteries -
     (You bought      Amazon)
     camera so you
     need batteries
     - Amazon)
Why Interest?




  Events are Social          Events are Interest




Dense Graph is Irrelevant
                            Interest are Changing
How do we know your Interest?


• We ask you
• Based on your activity
   • Events Attended
   • Events Browsed
• Facebook Interests
   • User Interest has to match Event category
   • Static
• Machine Learning
   • Logistic Regression using MLE
   • Sparse Matrix is generated using MapReduce
   • A model for each interest
Model Based vs Clustering

            Item-Item vs User-User

     Building Social Graph is Clustering Step

Social Graph Recommendation is a Ranking Problem
Implicit Social Graph


                                 U1


                            E1        E4

                  U2                       U3


             E2        E3

        U4                       U5
Mixed Social Graph


                                U1


                           E1

                 U2                  U3


            E2        E3
                                          FB
       U4                       U5
                                          LI
15M * 260 * 260 = 1.14 Trillion Edges
               4Billion edges ranked
   Each node is a feature vector representing a User

Each edge is a feature vector representing a Relationship
Feature Generation

•   Mixed Features
•   A series of map-reduce jobs
•   Output on HDFS in flat files; Input to subsequent jobs
•   Orders = Event  Attendees
    • MAP: eid: uid
    • REDUCE: eid:[uid]
• Attendees  Social Graph
    • Input: eid:[uid]
    • MAP: uidi:[uid]
    • REDUCE: uid:[neighbors]
• Interest based features, user specific, graph mining etc
• Upload feature values to HBase
U1




U2        U3
HBase
HBase




• Collect data from multiple Map Reduce jobs
   • Stores entire social graph
   • Over one million writes per second
HBase




    rowid     neighbors   events   featureX
    2718282   101         3        0.3678795
HBase




rowid     314159:n   314159:e   314159:fx   161803:n   161803:e   161803:fx
2718282   31         1          0.3183      83         2          0.618
Tips & Tricks




• Distributed cache database
   • Sped up some Map Reduce jobs by hours
   • Be sure to use counters!
Tips & Tricks




• Hive (ab)uses
   •   Almost as many hive jobs as custom ones
   •   “flip join”
   •   Statistical functions using hive
   •   UDF
Tips & Tricks


•   Memory Memory Memory
•   LZO, WAL
•   Combiners are great until
•   Shuffle and Sorting stage
•   Hadoop ecosystem is still new
Questions?

Testtting

  • 1.
    Data Platform andServices Vipul Sharma and EyalReuveni
  • 2.
    Agenda Eventbrite Data Products Data Platform Recommendations Questions
  • 3.
    A social event ticketing and discovery platform • 50th Million Ticket Sold • Revenue doubled YOY • 180 Employees in SOMA SF • Solving significant engineering problems • Data • Data, Infrastructure, Mobile, Web, Scale, Ops, QA • Firing all cylinders and hiring blazing fast www.eventbrite.com/jobs
  • 4.
  • 7.
    Analytics • Add–Hoc queries by Analysts
  • 8.
  • 9.
  • 11.
    Hadoop Cluster • 30 persistent EC2 High-Memory Instances • 30TB disk with replication factor of 2, ext3 formatted • CDH3 • Fair Scheduler • HBase
  • 12.
    Infrastructure • Search • Solr • Incremental updates towards event driven • Recommendation/Graph • Hadoop • Native Java MapReduce • Bash for workflow • Persistence • MySql • HDFS • HBase • MongoDB (Investigating Cassandra and Riak)
  • 13.
    Infrastructure • Stream • RabbitMQ • Internal Fire hose (Investigating Kafka) • Offline • MapRedude • Streaming • Hive • Hue
  • 14.
    Infrastructure - Sqoozie •Workflow for mysql imports to HDFS • Generate Sqoop commands • Run these imports in parallel • Transparent to schema changes • Include or exclude on column, data types, table level • Data Type Casting tinyint(1)  Integer • Distributed Table Imports
  • 15.
    Infrastructure - Blammo • Raw logs are imported to HDFS via flume • Almost real-time – 5 min latency • Logs are key-value pairs in JSON • Each log producer publishes schema in yaml • Hive schema and schema yaml in sync using thrift • Control exclusion and inclusion
  • 16.
  • 17.
    You will liketo attend this event
  • 18.
    Recommendation Engines Interest Graph Based Social Graph Based (Your (Your friends who friends like Lady like rock music Collaborative Gaga so you will like you are Filtering – Item- like Lady Gaga, attending Eric Item similarity PYMK – Facebook, Clapton Event– Linkedin) Eventbrite) Collaborative (You like Filtering – User- Godfather so you User Similarity will like Scarface - Netflix) (People who Item bought camera Hierarchy also bought batteries - (You bought Amazon) camera so you need batteries - Amazon)
  • 19.
    Why Interest? Events are Social Events are Interest Dense Graph is Irrelevant Interest are Changing
  • 20.
    How do weknow your Interest? • We ask you • Based on your activity • Events Attended • Events Browsed • Facebook Interests • User Interest has to match Event category • Static • Machine Learning • Logistic Regression using MLE • Sparse Matrix is generated using MapReduce • A model for each interest
  • 21.
    Model Based vsClustering Item-Item vs User-User Building Social Graph is Clustering Step Social Graph Recommendation is a Ranking Problem
  • 22.
    Implicit Social Graph U1 E1 E4 U2 U3 E2 E3 U4 U5
  • 23.
    Mixed Social Graph U1 E1 U2 U3 E2 E3 FB U4 U5 LI
  • 24.
    15M * 260* 260 = 1.14 Trillion Edges 4Billion edges ranked Each node is a feature vector representing a User Each edge is a feature vector representing a Relationship
  • 25.
    Feature Generation • Mixed Features • A series of map-reduce jobs • Output on HDFS in flat files; Input to subsequent jobs • Orders = Event  Attendees • MAP: eid: uid • REDUCE: eid:[uid] • Attendees  Social Graph • Input: eid:[uid] • MAP: uidi:[uid] • REDUCE: uid:[neighbors] • Interest based features, user specific, graph mining etc • Upload feature values to HBase
  • 26.
  • 27.
  • 28.
    HBase • Collect datafrom multiple Map Reduce jobs • Stores entire social graph • Over one million writes per second
  • 29.
    HBase rowid neighbors events featureX 2718282 101 3 0.3678795
  • 30.
    HBase rowid 314159:n 314159:e 314159:fx 161803:n 161803:e 161803:fx 2718282 31 1 0.3183 83 2 0.618
  • 31.
    Tips & Tricks •Distributed cache database • Sped up some Map Reduce jobs by hours • Be sure to use counters!
  • 32.
    Tips & Tricks •Hive (ab)uses • Almost as many hive jobs as custom ones • “flip join” • Statistical functions using hive • UDF
  • 33.
    Tips & Tricks • Memory Memory Memory • LZO, WAL • Combiners are great until • Shuffle and Sorting stage • Hadoop ecosystem is still new
  • 34.