COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013

COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS

SPEAKER: Vipul Sharma
Director of Data Engineering
Eventbrite

Monday, April 1, 13

Real Time Data Processing at Scale
Vipul Sharma – Director of Data Engineering

Monday, April 1, 13

Eventbrite by the Numbers

Monday, April 1, 13

Eventbrite by the Numbers

1.5 million events
80 million tickets sold
$1 billion in gross ticket sales
Events in 179 countries

Monday, April 1, 13

Who am I?

Director of Data Engineering at Eventbrite
Infrastructure, Data Science, Analytics, Spam and Fraud

linkedin.com/in/vipulsharma3
@vipulsharma
vipul@eventbrite.com

Monday, April 1, 13

Real Time

• Deﬁnition of real time varies with use case
• Real time at scale is a challenge
• Active learning requires real time data processing
• Spam/Fraud
• Discovery
• Search
• Analytics
• Real time analytics
• Data Changes
• Changes in inventory, user settings etc

Monday, April 1, 13

Scaling for Growth

• Decouple Services
• Decouple services based on CAP, Size and Growth
• NoSQL attractive for out of the box sharding, replication and multi data
center support along with high write speeds
• Multiple data stores pose a challenges of data ﬂow between services in real
time
• Batch Processing
• Batch processing for big data e.g. data science, analytics etc
• MapReduce is not built for real time
• Data locality requires data to be stored on HDFS
• Data Sync to Hadoop in real time is a challenge

Monday, April 1, 13

Challenges with Real Time
• Data Flow
• How to transfer data captured in logs to services in real
time
• How to transfer data captured in database to services in
real time
• Data Processing
• How to process signiﬁcant data in real time
• Distributed data processing for real time

Monday, April 1, 13

Data Flow

• Database polling
• Rather than each application polling build a single polling service
• Downstream applications polls from this service
• Built for consistency and read scalability
• Example: Event Cache
• Excited about Linkedin’s Databus - http://data.linkedin.com/projects/
databus
• Persisted Queues
• Transfer logs via a distributed persisted message queue
• Downstream applications subscribe to these queues getting a stream of
data
• Example: Firehose
• Excited about Linkedin’s Kafka - http://kafka.apache.org/index.html
Monday, April 1, 13

Data Processing

• Denormalization
• Write data ready to serve
• NoSQL built for Denormalization
• Example: See who’s visiting
• Distributed Data Processing
• Complex business logic needs more than de-normalization
• Example: API stats using Storm
• http://storm-project.net/

Monday, April 1, 13

Questions?

See it in action. Download our app:
eventbrite.com/eventbriteapp

Monday, April 1, 13

Thank You!
@vipulsharma/ vipul@eventbrite.com
Monday, April 1, 13

COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013

Similar to COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013 (20)

More from Gigaom

More from Gigaom (20)

Recently uploaded

Recently uploaded (20)

COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013