Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter

Hadoop and Vertica
The Data Analytics Platform at Twitter
Bill Graham - @billgraham
Data Systems Engineer, Analytics Infrastructure
Hadoop Summit, June 2012

About that pony giveaway...

2

Outline
• Architecture
• Data ﬂow
• Job coordination
• Resource management
• Vertica integration
• Gotchas
• Future work

3

We count things

• 140 characters
• 140M active users
• 400M tweets per day
• 80-100 TB ingested daily (uncompressed)
• 10s of Ks daily Hadoop jobs

4

Heterogeneous stack
• Many job execution applications
• Crane - Java ETL
• Oink - Pig scheduler
• Rasvelg - SQL aggregations
• Scalding - Cascading via Scala
• PyCascading - Cascading via Python
• Indexing jobs
• Our users
• Analytics, Revenue, Growth, Search, Recommendations, etc.
• PMs, Sales!

5

Data ﬂow: Analytics

Production Hosts
Log Application
events Data
Scribe
Aggregators

Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User proﬁles
Staging Hadoop Cluster

Main Hadoop DW HBase Analytics
Vertica
Web Tools

MySQL
6


Production Hosts
Log Application
events Data
Scribe
Aggregators

Third Party
Social graph
Distributed
Staging Hadoop Cluster Crawler

Log
Mover

Vertica
Web Tools

MySQL
6


Production Hosts
Log Application
events Data
Scribe
Aggregators

Third Party
Social graph
Distributed
Crane Crane
Crane
Log
Mover

Vertica
Web Tools
Crane

Crane
Crane

MySQL
6


Production Hosts
Log Application
events Data
Scribe
Aggregators

Third Party
Social graph
Distributed
Crane Crane
Crane
Log
Mover

Oink
Oink Main Hadoop DW HBase Analytics
Vertica
Web Tools
Crane

Crane
Crane

Oink
MySQL
6


Production Hosts
Log Application
events Data
Scribe
Aggregators

Third Party
Social graph
Distributed
Crane Crane
Crane
Log Rasvelg
Mover

Oink
Vertica
Web Tools
Crane

Crane
Crane

Oink
MySQL
6


Production Hosts
Log Application
events Data
Scribe
Aggregators

Third Party
Social graph
Distributed Analysts
Staging Hadoop Cluster Crawler Engineers
Crane PMs
Crane Sales
Crane
Log Rasvelg
Mover

Oink
Vertica
Web Tools
Crane

Crane
Crane

Oink
MySQL
6


Production Hosts
Log Application
events Data
Scribe
Aggregators

Third Party
Social graph
Distributed Analysts
Staging Hadoop Cluster Crawler Engineers
Crane PMs
Crane Sales
Crane
Log Rasvelg
HCatalog Mover

Oink
Vertica
Web Tools
Crane

Crane
Crane

Oink
MySQL
6

Chaotic? Actually, no.

7

System concepts

• Loose coupling
• Job coordination as a service
• Resource management as a service
• Idempotence

8

Loose coupling

• Multiple job frameworks
• Right tool for the job
• Common dependency management

9

Job coordination

• Shared batch table for job state
• Access via client libraries
• Jobs & data are time-based
• 3 types of preconditions
1. other job success (i.e., predecessor job complete)
2. existence of data (i.e., HDFS input exists)
3. user-deﬁned (i.e., MySQL slave lag)
• Failed jobs get retried (usually)

10

Job coordination

• Shared batch table for job state batch table:
(id, description, state,
• Access via client libraries start_time, end_time,
job_start_time, job_end_time)


10

Job coordination


Job

10

Job coordination


Job
2. existence of data (i.e., HDFS input exists) Data


10

Job coordination


Job
2. existence of data (i.e., HDFS input exists) Data

• Failed jobs get retried (usually) ?

10

Resource management

• Analytics Resource Manager - ARM!
• Library above Zookeeper
• Throttles jobs and workers
• Only 1 job of this name may run at once
• Only N jobs may be run by this app at once
• Only M mappers may write to Vertica at once

11

Job DAG & state transition

“Local View”
• Is it time for me to run yet?
• Are my dependancies satisﬁed?
• Any resource constraints?

12


“Local View”
granted

denied Insert entry into
batch table
no
Idle yes Completion
Execution
Complete?

Execution

12


“Local View”
granted

denied Insert entry into
batch table
no
Idle yes Completion
Execution
Complete?

Execution

batch table:
start_time, end_time,
12

Example: active users

Production Hosts

Main Hadoop DW

MySQL/ Analytics
Gizzard MySQL Dashboards
Vertica

13

Job DAG

Log mover
Production Hosts
Log mover
(via staging cluster)

ib e web_events
Scr
Main Hadoop DW
Scr
ibe sms_events

MySQL/ Analytics
Vertica

13

Job DAG

Oink
Log mover
Production Hosts
Log mover
Oink/Pig
ibe web_events
Scr Cleanse
Main Hadoop DW Filter
Transform
Scr Geo lookup
ibe sms_events Union
Distinct

MySQL/ Analytics
Vertica

13

Job DAG

Oink Oink
Log mover
Production Hosts
Log mover
Oink/Pig
ibe web_events
Scr Cleanse
Transform
Scr Geo lookup
Distinct

Oink
user_sessions

MySQL/ Analytics
Vertica

13

Job DAG

Oink Oink
Log mover
Production Hosts
Crane
Log mover
Oink/Pig
ibe web_events
Scr Cleanse
Transform
Scr Geo lookup
Distinct

Oink
user_sessions

MySQL/ Crane Analytics
user_proﬁles Vertica

13

Job DAG

Oink Oink
Log mover
Production Hosts
Crane
Log mover Rasvelg
Oink/Pig
ibe web_events
Scr Cleanse
Transform
Scr Geo lookup
Distinct

Oink
user_sessions

MySQL/ Crane Analytics
user_proﬁles Vertica

Rasvelg
Join,
Join Group, Count
Aggregations:
- active_by_geo
- active_by_device
- active_by_client
... 13

Job DAG

Oink Oink
...
Log mover
Production Hosts
Crane
Log mover Rasvelg Crane
Oink/Pig
ibe web_events
Scr Cleanse
Transform
Scr Geo lookup
Distinct

Oink
user_sessions

MySQL/ Crane Crane Analytics
user_proﬁles Vertica active_by_*

Rasvelg
Join,
Join Group, Count
Aggregations:
- active_by_geo
- active_by_device
- active_by_client
... 13

Vertica or Hadoop?
• Vertica
• Loads 100s of Ks rows/second
• Aggregate 100s of Ms rows in seconds
• Used for low latency queries and aggregations
• Keep a sliding window of data
• Hadoop
• Excels when data size is massive
• Flexible and powerful
• Great with nested data structures and unstructured data
• Used for complex functions and ML

14

Vertica import options
• Direct import via Crane
• Load into dest table, single thread
• Atomic import via Crane/Rasvelg
• Crane loads to temp table, single thread
• Rasvelg moves to dest table
• Parallel import via Oink/Pig
• Pig job via VerticaStorer
MySQL/
Gizzard

• ARM throttles active DB connections Crane

Rasvelg

Oink
Main Hadoop DW
Vertica
Crane

15

Vertica imports - pros/cons
• Crane & Rasvelg
• Good for smaller datasets, DB to DB transfers
• Single threaded
• Easy on Vertica
• Hadoop not required
• Pig
• Great for larger datasets MySQL/
Gizzard

• More complex, not atomic
Crane

• DDOS potential Rasvelg

Oink
Main Hadoop DW
Vertica
Crane

16

VerticaStorer
• PigStorage implementation
• From Vertica’s Hadoop connector suite
• Out of the box
• Easy to get Hello World working
• Well documented
• Pig/Vertica data bindings work well
• Fast!
• Transaction-aware tasks
• No bugs found
• Open source?

17

Pig VerticaStorage
• Our enhancements
• Connection credential management
• Truncate before load option
• Throttle concurrent writers via ZK
• Future features
• Counters for rows inserted/rejected
• Name-based tuple-column bindings
• Atomic load via temp table

18

Pig VerticaStorage
• Our enhancements
• Connection credential management
• Truncate before load option
• Throttle concurrent writers via ZK
• Future features
• Counters for rows inserted/rejected
• Name-based tuple-column bindings
• Atomic load via temp table
SET mapred.map.tasks.speculative.execution false

user_sessions = LOAD ‘/processed/user_sessions/2012/06/14’;

STORE user_sessions INTO '{db_schema.user_sessions}' USING
com.twitter.twadoop.pig.store.VerticaStorage(
'config/db.yml', 'db_name', 'arm_resource_name');
18

Gotcha #1

• MR data load is not atomic
• Avoid partial reads
• Option 1: load to temp table, then insert direct
• Option 2: add job dependency concept

19

Gotcha #2

• Speculative execution is not always your friend
• Launch more tasks than needed, just in case
• For non-idempotent jobs, extra tasks == BAD

20

Gotcha #3

• isIdempotant() must be a ﬁrst-class concept
• Loader jobs will fail
• Failure after ﬁrst task success == not good
• Can’t automate retry without cleanup

21

Gotcha #4

• Vendor code only gets you so far
• Nice to haves == have to write
• Favor the decorator pattern
• Pig’s StoreFuncWrapper can help
• Vendor open sourcing is ideal

22

Future work
• More VerticaStorer features
• Multiple Vertica clusters
• Atomic DB loads with Pig/Oink
• Better DAG visibility
• Better job history visibility
• MR job optimizations via historic stats
• HCatalog data registry
• Job push events

23

Acknowledgements

24

Questions?

Bill Graham - @billgraham

25

Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter

Similar to Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter (20)

Recently uploaded

Recently uploaded (20)

Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter

Editor's Notes