Testtting

Data Platform and Services

Vipul Sharma and EyalReuveni

Agenda

Eventbrite
Data Products
Data Platform
Recommendations
Questions

• A social event ticketing and discovery platform
• 50th Million Ticket Sold
• Revenue doubled YOY
• 180 Employees in SOMA SF
• Solving significant engineering problems
• Data
• Data, Infrastructure, Mobile, Web, Scale, Ops, QA
• Firing all cylinders and hiring blazing fast
www.eventbrite.com/jobs

Analytics

• Add–Hoc queries by Analysts

Hadoop Cluster

• 30 persistent EC2 High-Memory Instances
• 30TB disk with replication factor of 2, ext3 formatted
• CDH3
• Fair Scheduler
• HBase

Infrastructure

• Search
• Solr
• Incremental updates towards event driven
• Recommendation/Graph
• Hadoop
• Native Java MapReduce
• Bash for workflow
• Persistence
• MySql
• HDFS
• HBase
• MongoDB (Investigating Cassandra and Riak)

Infrastructure

• Stream
• RabbitMQ
• Internal Fire hose (Investigating Kafka)
• Offline
• MapRedude
• Streaming
• Hive
• Hue

Infrastructure - Sqoozie

• Workflow for mysql imports to HDFS
• Generate Sqoop commands
• Run these imports in parallel
• Transparent to schema changes
• Include or exclude on column, data types, table level
• Data Type Casting tinyint(1)  Integer
• Distributed Table Imports

Infrastructure - Blammo

• Raw logs are imported to HDFS via flume
• Almost real-time – 5 min latency
• Logs are key-value pairs in JSON
• Each log producer publishes schema in yaml
• Hive schema and schema yaml in sync using thrift
• Control exclusion and inclusion

You will like to attend this event

Recommendation Engines

Interest Graph
Based
Social Graph
Based (Your (Your friends who
friends like Lady like rock music
Collaborative Gaga so you will like you are
Filtering – Item- like Lady Gaga, attending Eric
Item similarity PYMK – Facebook, Clapton Event–
Linkedin) Eventbrite)
Collaborative (You like
Filtering – User- Godfather so you
User Similarity will like Scarface -
Netflix)
(People who
Item bought camera
Hierarchy also bought
batteries -
(You bought Amazon)
camera so you
need batteries
- Amazon)

Why Interest?

Events are Social Events are Interest

Dense Graph is Irrelevant
Interest are Changing

How do we know your Interest?

• We ask you
• Based on your activity
• Events Attended
• Events Browsed
• Facebook Interests
• User Interest has to match Event category
• Static
• Machine Learning
• Logistic Regression using MLE
• Sparse Matrix is generated using MapReduce
• A model for each interest

Model Based vs Clustering

Item-Item vs User-User

Building Social Graph is Clustering Step

Social Graph Recommendation is a Ranking Problem

Implicit Social Graph

U1

E1 E4

U2 U3

E2 E3

U4 U5

Mixed Social Graph

U1

E1

U2 U3

E2 E3
FB
U4 U5
LI

15M * 260 * 260 = 1.14 Trillion Edges
4Billion edges ranked
Each node is a feature vector representing a User

Each edge is a feature vector representing a Relationship

Feature Generation

• Mixed Features
• A series of map-reduce jobs
• Output on HDFS in flat files; Input to subsequent jobs
• Orders = Event  Attendees
• MAP: eid: uid
• REDUCE: eid:[uid]
• Attendees  Social Graph
• Input: eid:[uid]
• MAP: uidi:[uid]
• REDUCE: uid:[neighbors]
• Interest based features, user specific, graph mining etc
• Upload feature values to HBase

HBase

• Collect data from multiple Map Reduce jobs
• Stores entire social graph
• Over one million writes per second

HBase

rowid neighbors events featureX
2718282 101 3 0.3678795

HBase

rowid 314159:n 314159:e 314159:fx 161803:n 161803:e 161803:fx
2718282 31 1 0.3183 83 2 0.618

Tips & Tricks

• Distributed cache database
• Sped up some Map Reduce jobs by hours
• Be sure to use counters!

Tips & Tricks

• Hive (ab)uses
• Almost as many hive jobs as custom ones
• “flip join”
• Statistical functions using hive
• UDF

Tips & Tricks

• Memory Memory Memory
• LZO, WAL
• Combiners are great until
• Shuffle and Sorting stage
• Hadoop ecosystem is still new

Testtting

More Related Content

Viewers also liked

Similar to Testtting

Testtting