COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013
Upcoming SlideShare
Loading in...5
×
 

COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013

on

  • 915 views

Presentation from Vipul Sharma, Eventbrite

Presentation from Vipul Sharma, Eventbrite
#dataconf
More at http://event.gigaom.com/structuredata/

Statistics

Views

Total Views
915
Views on SlideShare
892
Embed Views
23

Actions

Likes
2
Downloads
14
Comments
0

1 Embed 23

http://lanyrd.com 23

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013 COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013 Presentation Transcript

    • COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS SPEAKER: Vipul Sharma Director of Data Engineering EventbriteMonday, April 1, 13
    • Real Time Data Processing at Scale Vipul Sharma – Director of Data EngineeringMonday, April 1, 13
    • Eventbrite by the NumbersMonday, April 1, 13
    • Eventbrite by the Numbers 1.5 million events 80 million tickets sold $1 billion in gross ticket sales Events in 179 countriesMonday, April 1, 13
    • Who am I? Director of Data Engineering at Eventbrite Infrastructure, Data Science, Analytics, Spam and Fraud linkedin.com/in/vipulsharma3 @vipulsharma vipul@eventbrite.comMonday, April 1, 13
    • Real Time • Definition of real time varies with use case • Real time at scale is a challenge • Active learning requires real time data processing • Spam/Fraud • Discovery • Search • Analytics • Real time analytics • Data Changes • Changes in inventory, user settings etcMonday, April 1, 13
    • Scaling for Growth • Decouple Services • Decouple services based on CAP, Size and Growth • NoSQL attractive for out of the box sharding, replication and multi data center support along with high write speeds • Multiple data stores pose a challenges of data flow between services in real time • Batch Processing • Batch processing for big data e.g. data science, analytics etc • MapReduce is not built for real time • Data locality requires data to be stored on HDFS • Data Sync to Hadoop in real time is a challengeMonday, April 1, 13
    • Monday, April 1, 13
    • Challenges with Real Time • Data Flow • How to transfer data captured in logs to services in real time • How to transfer data captured in database to services in real time • Data Processing • How to process significant data in real time • Distributed data processing for real timeMonday, April 1, 13
    • Data Flow • Database polling • Rather than each application polling build a single polling service • Downstream applications polls from this service • Built for consistency and read scalability • Example: Event Cache • Excited about Linkedin’s Databus - http://data.linkedin.com/projects/ databus • Persisted Queues • Transfer logs via a distributed persisted message queue • Downstream applications subscribe to these queues getting a stream of data • Example: Firehose • Excited about Linkedin’s Kafka - http://kafka.apache.org/index.htmlMonday, April 1, 13
    • Data Processing • Denormalization • Write data ready to serve • NoSQL built for Denormalization • Example: See who’s visiting • Distributed Data Processing • Complex business logic needs more than de-normalization • Example: API stats using Storm • http://storm-project.net/Monday, April 1, 13
    • Questions? See it in action. Download our app: eventbrite.com/eventbriteappMonday, April 1, 13
    • Thank You! @vipulsharma/ vipul@eventbrite.comMonday, April 1, 13
    • Monday, April 1, 13