AWS Community Day CPH - Three problems of Terraform
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
1. Hadoop and Vertica
The Data Analytics Platform at Twitter
Bill Graham - @billgraham
Data Systems Engineer, Analytics Infrastructure
Hadoop Summit, June 2012
4. We count things
• 140 characters
• 140M active users
• 400M tweets per day
• 80-100 TB ingested daily (uncompressed)
• 10s of Ks daily Hadoop jobs
4
6. Data flow: Analytics
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Staging Hadoop Cluster
Main Hadoop DW HBase Analytics
Vertica
Web Tools
MySQL
6
7. Data flow: Analytics
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Distributed
Staging Hadoop Cluster Crawler
Log
Mover
Main Hadoop DW HBase Analytics
Vertica
Web Tools
MySQL
6
8. Data flow: Analytics
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Distributed
Staging Hadoop Cluster Crawler
Crane Crane
Crane
Log
Mover
Main Hadoop DW HBase Analytics
Vertica
Web Tools
Crane
Crane
Crane
MySQL
6
9. Data flow: Analytics
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Distributed
Staging Hadoop Cluster Crawler
Crane Crane
Crane
Log
Mover
Oink
Oink Main Hadoop DW HBase Analytics
Vertica
Web Tools
Crane
Crane
Crane
Oink
MySQL
6
10. Data flow: Analytics
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Distributed
Staging Hadoop Cluster Crawler
Crane Crane
Crane
Log Rasvelg
Mover
Oink
Oink Main Hadoop DW HBase Analytics
Vertica
Web Tools
Crane
Crane
Crane
Oink
MySQL
6
11. Data flow: Analytics
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Distributed Analysts
Staging Hadoop Cluster Crawler Engineers
Crane PMs
Crane Sales
Crane
Log Rasvelg
Mover
Oink
Oink Main Hadoop DW HBase Analytics
Vertica
Web Tools
Crane
Crane
Crane
Oink
MySQL
6
12. Data flow: Analytics
Production Hosts
Log Application
events Data
Scribe
Aggregators
Third Party
Social graph
Imports HDFS MySQL/ Tweets
Gizzard User profiles
Distributed Analysts
Staging Hadoop Cluster Crawler Engineers
Crane PMs
Crane Sales
Crane
Log Rasvelg
HCatalog Mover
Oink
Oink Main Hadoop DW HBase Analytics
Vertica
Web Tools
Crane
Crane
Crane
Oink
MySQL
6
14. System concepts
• Loose coupling
• Job coordination as a service
• Resource management as a service
• Idempotence
8
15. Loose coupling
• Multiple job frameworks
• Right tool for the job
• Common dependency management
9
16. Job coordination
• Shared batch table for job state
• Access via client libraries
• Jobs & data are time-based
• 3 types of preconditions
1. other job success (i.e., predecessor job complete)
2. existence of data (i.e., HDFS input exists)
3. user-defined (i.e., MySQL slave lag)
• Failed jobs get retried (usually)
10
17. Job coordination
• Shared batch table for job state batch table:
(id, description, state,
• Access via client libraries start_time, end_time,
job_start_time, job_end_time)
• Jobs & data are time-based
• 3 types of preconditions
1. other job success (i.e., predecessor job complete)
2. existence of data (i.e., HDFS input exists)
3. user-defined (i.e., MySQL slave lag)
• Failed jobs get retried (usually)
10
18. Job coordination
• Shared batch table for job state batch table:
(id, description, state,
• Access via client libraries start_time, end_time,
job_start_time, job_end_time)
• Jobs & data are time-based
• 3 types of preconditions
Job
1. other job success (i.e., predecessor job complete)
2. existence of data (i.e., HDFS input exists)
3. user-defined (i.e., MySQL slave lag)
• Failed jobs get retried (usually)
10
19. Job coordination
• Shared batch table for job state batch table:
(id, description, state,
• Access via client libraries start_time, end_time,
job_start_time, job_end_time)
• Jobs & data are time-based
• 3 types of preconditions
Job
1. other job success (i.e., predecessor job complete)
2. existence of data (i.e., HDFS input exists) Data
3. user-defined (i.e., MySQL slave lag)
• Failed jobs get retried (usually)
10
20. Job coordination
• Shared batch table for job state batch table:
(id, description, state,
• Access via client libraries start_time, end_time,
job_start_time, job_end_time)
• Jobs & data are time-based
• 3 types of preconditions
Job
1. other job success (i.e., predecessor job complete)
2. existence of data (i.e., HDFS input exists) Data
3. user-defined (i.e., MySQL slave lag)
• Failed jobs get retried (usually) ?
10
21. Resource management
• Analytics Resource Manager - ARM!
• Library above Zookeeper
• Throttles jobs and workers
• Only 1 job of this name may run at once
• Only N jobs may be run by this app at once
• Only M mappers may write to Vertica at once
11
22. Resource management
• Analytics Resource Manager - ARM!
• Library above Zookeeper
• Throttles jobs and workers
• Only 1 job of this name may run at once
• Only N jobs may be run by this app at once
• Only M mappers may write to Vertica at once
11
23. Resource management
• Analytics Resource Manager - ARM!
• Library above Zookeeper
• Throttles jobs and workers
• Only 1 job of this name may run at once
• Only N jobs may be run by this app at once
• Only M mappers may write to Vertica at once
11
24. Job DAG & state transition
“Local View”
• Is it time for me to run yet?
• Are my dependancies satisfied?
• Any resource constraints?
12
25. Job DAG & state transition
“Local View”
• Is it time for me to run yet?
• Are my dependancies satisfied?
• Any resource constraints?
granted
denied Insert entry into
batch table
no
Idle yes Completion
Execution
Complete?
Execution
12
26. Job DAG & state transition
“Local View”
• Is it time for me to run yet?
• Are my dependancies satisfied?
• Any resource constraints?
granted
denied Insert entry into
batch table
no
Idle yes Completion
Execution
Complete?
Execution
batch table:
(id, description, state,
start_time, end_time,
job_start_time, job_end_time)
12
27. Example: active users
Production Hosts
Main Hadoop DW
MySQL/ Analytics
Gizzard MySQL Dashboards
Vertica
13
28. Example: active users
Job DAG
Log mover
Production Hosts
Log mover
(via staging cluster)
ib e web_events
Scr
Main Hadoop DW
Scr
ibe sms_events
MySQL/ Analytics
Gizzard MySQL Dashboards
Vertica
13
29. Example: active users
Job DAG
Oink
Log mover
Production Hosts
Log mover
(via staging cluster)
Oink/Pig
ibe web_events
Scr Cleanse
Main Hadoop DW Filter
Transform
Scr Geo lookup
ibe sms_events Union
Distinct
MySQL/ Analytics
Gizzard MySQL Dashboards
Vertica
13
30. Example: active users
Job DAG
Oink Oink
Log mover
Production Hosts
Log mover
(via staging cluster)
Oink/Pig
ibe web_events
Scr Cleanse
Main Hadoop DW Filter
Transform
Scr Geo lookup
ibe sms_events Union
Distinct
Oink
user_sessions
MySQL/ Analytics
Gizzard MySQL Dashboards
Vertica
13
31. Example: active users
Job DAG
Oink Oink
Log mover
Production Hosts
Crane
Log mover
(via staging cluster)
Oink/Pig
ibe web_events
Scr Cleanse
Main Hadoop DW Filter
Transform
Scr Geo lookup
ibe sms_events Union
Distinct
Oink
user_sessions
MySQL/ Crane Analytics
Gizzard MySQL Dashboards
user_profiles Vertica
13
32. Example: active users
Job DAG
Oink Oink
Log mover
Production Hosts
Crane
Log mover Rasvelg
(via staging cluster)
Oink/Pig
ibe web_events
Scr Cleanse
Main Hadoop DW Filter
Transform
Scr Geo lookup
ibe sms_events Union
Distinct
Oink
user_sessions
MySQL/ Crane Analytics
Gizzard MySQL Dashboards
user_profiles Vertica
Rasvelg
Join,
Join Group, Count
Aggregations:
- active_by_geo
- active_by_device
- active_by_client
... 13
34. Vertica or Hadoop?
• Vertica
• Loads 100s of Ks rows/second
• Aggregate 100s of Ms rows in seconds
• Used for low latency queries and aggregations
• Keep a sliding window of data
• Hadoop
• Excels when data size is massive
• Flexible and powerful
• Great with nested data structures and unstructured data
• Used for complex functions and ML
14
35. Vertica import options
• Direct import via Crane
• Load into dest table, single thread
• Atomic import via Crane/Rasvelg
• Crane loads to temp table, single thread
• Rasvelg moves to dest table
• Parallel import via Oink/Pig
• Pig job via VerticaStorer
MySQL/
Gizzard
• ARM throttles active DB connections Crane
Rasvelg
Oink
Main Hadoop DW
Vertica
Crane
15
36. Vertica imports - pros/cons
• Crane & Rasvelg
• Good for smaller datasets, DB to DB transfers
• Single threaded
• Easy on Vertica
• Hadoop not required
• Pig
• Great for larger datasets MySQL/
Gizzard
• More complex, not atomic
Crane
• DDOS potential Rasvelg
Oink
Main Hadoop DW
Vertica
Crane
16
37. VerticaStorer
• PigStorage implementation
• From Vertica’s Hadoop connector suite
• Out of the box
• Easy to get Hello World working
• Well documented
• Pig/Vertica data bindings work well
• Fast!
• Transaction-aware tasks
• No bugs found
• Open source?
17
38. Pig VerticaStorage
• Our enhancements
• Connection credential management
• Truncate before load option
• Throttle concurrent writers via ZK
• Future features
• Counters for rows inserted/rejected
• Name-based tuple-column bindings
• Atomic load via temp table
18
39. Pig VerticaStorage
• Our enhancements
• Connection credential management
• Truncate before load option
• Throttle concurrent writers via ZK
• Future features
• Counters for rows inserted/rejected
• Name-based tuple-column bindings
• Atomic load via temp table
SET mapred.map.tasks.speculative.execution false
user_sessions = LOAD ‘/processed/user_sessions/2012/06/14’;
STORE user_sessions INTO '{db_schema.user_sessions}' USING
com.twitter.twadoop.pig.store.VerticaStorage(
'config/db.yml', 'db_name', 'arm_resource_name');
18
40. Gotcha #1
• MR data load is not atomic
• Avoid partial reads
• Option 1: load to temp table, then insert direct
• Option 2: add job dependency concept
19
41. Gotcha #2
• Speculative execution is not always your friend
• Launch more tasks than needed, just in case
• For non-idempotent jobs, extra tasks == BAD
20
42. Gotcha #3
• isIdempotant() must be a first-class concept
• Loader jobs will fail
• Failure after first task success == not good
• Can’t automate retry without cleanup
21
43. Gotcha #4
• Vendor code only gets you so far
• Nice to haves == have to write
• Favor the decorator pattern
• Pig’s StoreFuncWrapper can help
• Vendor open sourcing is ideal
22
44. Future work
• More VerticaStorer features
• Multiple Vertica clusters
• Atomic DB loads with Pig/Oink
• Better DAG visibility
• Better job history visibility
• MR job optimizations via historic stats
• HCatalog data registry
• Job push events
23