• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Eventbrite Data Platform Talk foir SFDM
 

Eventbrite Data Platform Talk foir SFDM

on

  • 2,445 views

Slides for Eventbrite's data platform talk at SF data mining meetup.

Slides for Eventbrite's data platform talk at SF data mining meetup.

Statistics

Views

Total Views
2,445
Views on SlideShare
1,445
Embed Views
1,000

Actions

Likes
6
Downloads
60
Comments
0

4 Embeds 1,000

http://cptl.corp.yahoo.co.jp 997
https://si0.twimg.com 1
https://twimg0-a.akamaihd.net 1
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Eventbrite Data Platform Talk foir SFDM Eventbrite Data Platform Talk foir SFDM Presentation Transcript

    • Data PlatformVipul Sharma – vipul@eventbrite.com
    • A social event ticketing and discovery platform
    • $1B total sales 68M tickets sold 1.4M events hosted.5M organizers served23M attendees served 12 countries
    • Post Event ConceptionOrganization Event Lifecycle Creation Sale Discovery
    • Frictionless is the mantra!
    • Data Platform and Discovery
    • • SearchDiscovery • Recommendation • Social • Data warehouse and MetricsAnalytics • Internal and External reporting • Real Time and Batch Analytics Abuse • Spam • FraudPrevention • TOS
    • Analytics • Add–Hoc queries by Analysts
    • Fraud and Spam
    • Data Platform
    • Hadoop Cluster• 30 persistent EC2 High-Memory Instances• 30TB disk with replication factor of 2, ext3 formatted• CDH3• Fair Scheduler• HBase
    • Infrastructure• Search • Solr • Incremental updates towards event driven• Recommendation/Graph • Hadoop • Native Java MapReduce • Bash for workflow• Social • Cassandra • Denormalizedvview• Persistence • MySql • HDFS • HBase • MongoDB (Moving to Cassandra)
    • Infrastructure• Stream • RabbitMQ • Internal Fire hose • Storm• Offline • MapRedude • Streaming • Hive • Hue
    • DiscoverySocial, Interest, Local
    • Attendees EventsOrganizers
    • Categorization - Prism Tech Conference Music Sports
    • Prism - Features• Supervised Learning• Logistic Regression using MLE• Pair wise classification into 20 categories• High precision lower recall• Use mapreduce for feature extraction• Use for clustering as well
    • Prism – Training Data• Binary classification for each category• Training data needed for positive and negative • Conference and not Conference • Sports and not Sports• Samasource and Crowdflower• Stem words to create initial set• Positive, negative, negative with stem words
    • Prism - Features• Convert Event and Organizer data in feature vector• Event details, Organizer details, Ticket details• Boolean representation of predefined attributes • Words – tf-idf, dictonaries • Phrases • Domains • Rules – regular expression • Functions – business logic e.g. ticket price between $10-$20 • Compounds – boolean combination of features & and || rules – <COMPOUND1>:techcrunch& disrupt &techcrunch.com – <COMPOUND2>:COMPOUND2 && after && party
    • Prism - Features• Each feature is represented in various context • Event Title, Event Description, Organizer Title, Organizer Description• Each feature has meta info – Termclass • <LANG_EN>, <CONF_LANG_EN>,<ADULT_LANG_EN> • <SPORTS_LANG_EN>:<EVENT_TITLE>ball• Feature vector is represented as sparse vector+1 391158:1 401814:1 410526:1 411489:1 411606:2 413910:1 427659:1 438369:1 449735:1 449736:2 455478:1 456741:1 463188:1693|||||warrior spirits 3rd annual fundraising auction|||||1:<DESC>again,1:<NAME>annual,1:<DESC>annu al,2:<DESC>approaching,2:<NAME>auction,4:<DESC>auctio n,2:<DESC>auctions,2:<DESC>bring
    • Prism - Training• Binary classifier• Multiclass less accurate• Each event get classified into 20 category• MapReduce for creating sparse matrix• MapReduce for batch classification • Distributed cache for feature set and models• We can use same sparse matrix for clustering
    • Attendee• What your interests are? - Prism• Who your friends are? – Explicit and Implicit• What are the interests of your friends? - Prism• Which of your friend have your interests? – IBG• Location of users and events • Purchase events location • Facebook location • Our database • Other signals – ip, mobile app etc
    • You will like to attend this event
    • Recommendation Engines Interest Graph Based Social Graph Based (Your (Your friends who friends like Lady like rock music Collaborative Gaga so you will like you are Filtering – Item- like Lady Gaga, attending Eric Item similarity PYMK – Facebook, Clapton Event– Linkedin) Eventbrite) Collaborative (You like Filtering – User- Godfather so you User Similarity will like Scarface - Netflix) (People who Item bought camera Hierarchy also bought batteries - (You bought Amazon) camera so you need batteries - Amazon)
    • Why Interest? Events are Social Events are InterestDense Graph is Irrelevant Interest are Changing
    • How do we know your Interest?• We ask you• Based on your activity • Events Attended • Events Browsed (In Future)• Facebook Interests • User Interest has to match Event category • Static• Prism
    • Model Based vs Clustering Item-Item vs User-User Building Social Graph is Clustering StepSocial Graph Recommendation is a Ranking Problem
    • Implicit Social Graph U1 E1 E4 U2 U3 E2 E3 U4 U5
    • Mixed Social Graph U1 E1 U2 U3 E2 E3 FB U4 U5 LI
    • 23M * 260 * 260 = 1.5 Trillion Edges 6 Billion edges ranked Each node is a feature vector representing a UserEach edge is a feature vector representing a Relationship
    • Feature Generation• Mixed Features• A series of map-reduce jobs• Output on HDFS in flat files; Input to subsequent jobs• Orders = Event  Attendees • MAP: eid: uid • REDUCE: eid:[uid]• Attendees  Social Graph • Input: eid:[uid] • MAP: uidi:[uid] • REDUCE: uid:[neighbors]• Interest based features, user specific, graph mining etc• Upload feature values to HBase
    • HBase• Why Hbase? • To process 6B edges lookup features for each node and each edge • 6B/1000 /86400 = 70 days!! • 1M/sec = 1.5 hrs • Processing 1.3 TB of data with mapreduce• Collect data from multiple Map Reduce jobs • Stores entire social graph • Features for each node and edge
    • Data Model Rowkey U UUuid1 f1 f2 f3 uid2:f4 uid2:f5 uid3:f4rowid neighbors events featureX2718282 101 3 0.3678795rowid 314159:n 314159:e 314159:fx 161803:n 161803:e 161803:fx2718282 31 1 0.3183 83 2 0.618
    • U1U2 U3
    • HBase
    • Hadoop Tips & Tricks• Joins • Distributed cache • Hive map side joins• Hive • Nice set of statistical functions • Lots of hive queries• Hbase • Lots of memory • WAL • LZO • Proper configs • Avoid hot regioservers
    • Hadoop tips & tricks• Combiners did not work• Shuffle and Merge
    • More Innovation• Rethink everything• Add social to search• Add time series features• Real time updates using firehose and storm• Various sorts of data
    • Developers! Developers! Developers!• Interested in scaling, messaging, data, machine learning, mobile, services• We will continue to push the boundaries of hard problems• jobs@eventbrite.com• vipul@eventbrite.com
    • Storm at EventbriteTuesday August 21, 2012 at Eventbrite HQHow we are using Storm for real time processing of our datahttp://www.eventbrite.com/event/4010290888 Andrew Whangwhang@eventbrite.co m
    • Questions?