A Bird’s-Eye View of Pig and Scalding
with hRaven
a tale by @gario and @joep
Hadoop Summit 2013
v1.2
@Twitter#HadoopSummit2013 2
Apache HBase PMC member and
Committer
Software Engineer @ Twitter
Core Storage Team - Hadoop/H...
@Twitter#HadoopSummit2013 3
Chapter 1: The Problem
Chapter 2: Why hRaven?
Chapter 3: How Does it Work?
3a: Loading
3b: Tab...
Chapter 1: The Problem
Illustration by Sirxlem (CC BY-NC-ND
3.0)
@Twitter#HadoopSummit2013 5
Most users run Pig and Scalding scripts, not straight map reduce
JobTracker UI shows jobs, not...
@Twitter#HadoopSummit2013
Chapter 1: A Problem of Scale
6
@Twitter#HadoopSummit2013 7
How many Pig versus Scalding jobs do we run ?
What cluster capacity do jobs in my pool take ?
...
@Twitter#HadoopSummit2013 8
How many Pig versus Scalding jobs do we run ?
What cluster capacity do jobs in my pool take ?
...
Chapter 2: Why hRaven?
Photo by DAVID ILIFF. License: CC-BY-SA
3.0
@Twitter#HadoopSummit2013 10
Stores stats, configuration and timing for every map reduce job on every
cluster
Structured a...
@Twitter#HadoopSummit2013 11
cluster - each cluster has a unique name mapping to the Job Tracker
user - map reduce jobs ar...
@Twitter#HadoopSummit2013 12
Chapter 2: Application Flows
Edgar
@Twitter#HadoopSummit2013 13
Chapter 2: Application Flows
Edgar
@Twitter#HadoopSummit2013 14
All jobs in a flow are ordered together•
Chapter 2: Flow Storage
@Twitter#HadoopSummit2013 15
Most recent flow is ordered first•
Chapter 2: Flow Storage
@Twitter#HadoopSummit2013 16
All jobs in a flow are ordered together
Per-job metrics stored
Total map and reduce tasks
HDF...
Chapter 3: How Does it Work?
@Twitter#HadoopSummit2013 18
Chapter 3: ETL - Step 1: JobFilePreprocessor
@Twitter#HadoopSummit2013 19
Chapter 3: ETL - Step 2: JobFileRawLoader
@Twitter#HadoopSummit2013 20
Chapter 3: ETL - Step 3: JobFileProcessor
@Twitter#HadoopSummit2013 21
Chapter 3: ETL - Step 3: JobFileProcessor
Jobs finish out of order with respect to job_id
@Twitter#HadoopSummit2013 22
job_history_raw
job_history
job_history_task
job_history_app_version
•
•
•
•
Chapter 3: Tables
@Twitter#HadoopSummit2013 23
Row key: cluster!jobID
Columns:
jobconf - stores serialized raw job_*_conf.xml file
jobhistor...
@Twitter#HadoopSummit2013 24
Row key: cluster!user!application!timestamp!jobID
cluster - unique cluster name (ie. “cluster...
@Twitter#HadoopSummit2013 25
Row key: cluster!user!application!timestamp!jobID!taskID
same components as job_history key (...
@Twitter#HadoopSummit2013 26
Row key: cluster!user!application
Example: cluster1@dc1!edgar!wordcount
Columns:
v1=136958563...
@Twitter#HadoopSummit2013 27
Using Pig’s HBaseStorage (or direct HBase APIs)
Through Client API
Through REST API
•
•
•
Cha...
Chapter 4: Current Uses
@Twitter#HadoopSummit2013 29
Pig reducer optimizations
Cluster utilization / capacity planning
Application performance tre...
@Twitter#HadoopSummit2013 30
Chapter 4: Cluster reads-writes
@Twitter#HadoopSummit2013
Chapter 4: Pool / Application reads/writes
31
Pool view
Spike in File size read
Indicates jobs s...
@Twitter#HadoopSummit2013
Chapter 4: Pool usage: Used vs. Allocated
32
@Twitter#HadoopSummit2013 33
Chapter 4: Compute cost
Appendix: Future Work
@Twitter#HadoopSummit2013 35
Real-time data loading from Job Tracker / Application Master
Full flow-centric UI (Job Tracke...
@Twitter#HadoopSummit2013 36
hRaven on Github
https://github.com/twitter/hraven
hRaven Mailing Lists
hraven-user@googlegro...
@Twitter#HadoopSummit2013
Afterword
37
Now will thou drop your job data on the floor ?
Quoth the hRaven, 'Nevermore.'
#TheEnd
@gario and @joep
Come visit us at booth #26 to continue the story
@Twitter#HadoopSummit2013 39
Desired order
job_201306271100_9999
job_201306271100_10000
...
job_201306271100_99999
job_201...
Upcoming SlideShare
Loading in...5
×

A Birds-Eye View of Pig and Scalding Jobs with hRaven

1,933

Published on

As Twitter's use of mapreduce rapidly expands, tracking usage on our clusters grows correspondingly more difficult. With an ever increasing job load, and a reliance on higher level abstractions such as Pig and Scalding, the utility of existing tools for viewing job history decreases rapidly, and extracting insights becomes a challenge. At Twitter, we created hRaven to fill this gap. hRaven archives the full history and metrics from all mapreduce jobs on our clusters, and strings together each job from a Pig or Scalding script execution into a combined flow. From this archive, we can easily derive aggregate resource utilization by user, pool, or application. While the historical trending of an individual application allows us to perform runtime optimization of resource scheduling. We will cover how hRaven provides a rich historical archive of mapreduce job execution, and how the data is structured into higher level flows representing the job sequence for frameworks such as Pig, Scalding, and Hive. We will then explore how we mine hRaven data to account for Hadoop resource utilization, to optimize runtime scheduling, and to identify common anti-patterns in user jobs. Finally, we will look at the end user experience, including Ambrose integration for flow visualization.

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,933
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
26
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

A Birds-Eye View of Pig and Scalding Jobs with hRaven

  1. 1. A Bird’s-Eye View of Pig and Scalding with hRaven a tale by @gario and @joep Hadoop Summit 2013 v1.2
  2. 2. @Twitter#HadoopSummit2013 2 Apache HBase PMC member and Committer Software Engineer @ Twitter Core Storage Team - Hadoop/HBase • • • About the authors Software Engineer @ Twitter Engineering Manager Hadoop/HBase team @ Twitter • •
  3. 3. @Twitter#HadoopSummit2013 3 Chapter 1: The Problem Chapter 2: Why hRaven? Chapter 3: How Does it Work? 3a: Loading 3b: Table structure / querying Chapter 4: Current Uses Appendix: Future Work • • • • • • • Table of Contents
  4. 4. Chapter 1: The Problem Illustration by Sirxlem (CC BY-NC-ND 3.0)
  5. 5. @Twitter#HadoopSummit2013 5 Most users run Pig and Scalding scripts, not straight map reduce JobTracker UI shows jobs, not DAGs of jobs generated by Pig and Scalding • • Chapter 1: Mismatched Abstractions
  6. 6. @Twitter#HadoopSummit2013 Chapter 1: A Problem of Scale 6
  7. 7. @Twitter#HadoopSummit2013 7 How many Pig versus Scalding jobs do we run ? What cluster capacity do jobs in my pool take ? How many jobs do we run each day ? What % of jobs have > 30k tasks ? Why do I need to hand-tune these (hundreds) of jobs, can’t the cluster learn ? • • • • • Chapter 1: Questions
  8. 8. @Twitter#HadoopSummit2013 8 How many Pig versus Scalding jobs do we run ? What cluster capacity do jobs in my pool take ? How many jobs do we run each day ? What % of jobs have > 30k tasks ? Why do I need to hand-tune these (hundreds) of jobs, can’t the cluster learn ? • • • • • Chapter 1: Questions #Nevermore
  9. 9. Chapter 2: Why hRaven? Photo by DAVID ILIFF. License: CC-BY-SA 3.0
  10. 10. @Twitter#HadoopSummit2013 10 Stores stats, configuration and timing for every map reduce job on every cluster Structured around the full DAG of jobs from a Pig or Scalding application Easily queryable for historical trending Allows for Pig reducer optimization based on historical run stats Keep data online forever (12.6M jobs, 4.5B tasks + attempts) • • • • • Chapter 2: Why hRaven?
  11. 11. @Twitter#HadoopSummit2013 11 cluster - each cluster has a unique name mapping to the Job Tracker user - map reduce jobs are run as a given user application - a Pig or Scalding script (or plain map reduce job) flow - the combined DAG of jobs executed from a single run of an application version - changes impacting the DAG are recorded as a new version of the same application • • • • • Chapter 2: Key Concepts
  12. 12. @Twitter#HadoopSummit2013 12 Chapter 2: Application Flows Edgar
  13. 13. @Twitter#HadoopSummit2013 13 Chapter 2: Application Flows Edgar
  14. 14. @Twitter#HadoopSummit2013 14 All jobs in a flow are ordered together• Chapter 2: Flow Storage
  15. 15. @Twitter#HadoopSummit2013 15 Most recent flow is ordered first• Chapter 2: Flow Storage
  16. 16. @Twitter#HadoopSummit2013 16 All jobs in a flow are ordered together Per-job metrics stored Total map and reduce tasks HDFS bytes read / written File bytes read / written Total map and reduce slot milliseconds Easy to aggregate stats for an entire flow Easy to scan the timeseries of each application’s flows • • • • • • • • Chapter 2: Key Features
  17. 17. Chapter 3: How Does it Work?
  18. 18. @Twitter#HadoopSummit2013 18 Chapter 3: ETL - Step 1: JobFilePreprocessor
  19. 19. @Twitter#HadoopSummit2013 19 Chapter 3: ETL - Step 2: JobFileRawLoader
  20. 20. @Twitter#HadoopSummit2013 20 Chapter 3: ETL - Step 3: JobFileProcessor
  21. 21. @Twitter#HadoopSummit2013 21 Chapter 3: ETL - Step 3: JobFileProcessor Jobs finish out of order with respect to job_id
  22. 22. @Twitter#HadoopSummit2013 22 job_history_raw job_history job_history_task job_history_app_version • • • • Chapter 3: Tables
  23. 23. @Twitter#HadoopSummit2013 23 Row key: cluster!jobID Columns: jobconf - stores serialized raw job_*_conf.xml file jobhistory - stored serialized raw job history log file job_processed_success - indicates whether job has been processed • • • Chapter 3: job_history_raw
  24. 24. @Twitter#HadoopSummit2013 24 Row key: cluster!user!application!timestamp!jobID cluster - unique cluster name (ie. “cluster1@dc1”) user - user running the application (“edgar”) application - application ID derived from job configuration: uses “batch.desc” property if set otherwise parses a consistent ID from “mapred.job.name” timestamp - inverted (Long.MAX_VALUE - value) value of submission time jobID - stored as Job Tracker start time (long), concatenated with job sequence number job_201306271100_0001 -> [1372352073732L][1L] • • • • • • • • Chapter 3: job_history
  25. 25. @Twitter#HadoopSummit2013 25 Row key: cluster!user!application!timestamp!jobID!taskID same components as job_history key (same ordering) taskID - (ie. “m_00001”) uniquely identifies individual task/attempt in job Two row types: Task - “meta” row cluster1@dc1!edgar!wordcount!9654...!...[00001]!m_00001 Task Attempt - individual execution on a Task Tracker cluster1@dc1!edgar!wordcount!9654...!...[00001]!m_00001_1 • • • • Chapter 3: job_history_task
  26. 26. @Twitter#HadoopSummit2013 26 Row key: cluster!user!application Example: cluster1@dc1!edgar!wordcount Columns: v1=1369585634000 v2=1372263813000 Chapter 3: job_history_app_version
  27. 27. @Twitter#HadoopSummit2013 27 Using Pig’s HBaseStorage (or direct HBase APIs) Through Client API Through REST API • • • Chapter 3: Querying hRaven
  28. 28. Chapter 4: Current Uses
  29. 29. @Twitter#HadoopSummit2013 29 Pig reducer optimizations Cluster utilization / capacity planning Application performance trending over time Identifying common job anti-patterns Ad-hoc analysis troubleshooting cluster problems • • • • • Chapter 4: Current Uses
  30. 30. @Twitter#HadoopSummit2013 30 Chapter 4: Cluster reads-writes
  31. 31. @Twitter#HadoopSummit2013 Chapter 4: Pool / Application reads/writes 31 Pool view Spike in File size read Indicates jobs spilling • • • Application view Spike in HDFS size read Indicates spiking input • • •
  32. 32. @Twitter#HadoopSummit2013 Chapter 4: Pool usage: Used vs. Allocated 32
  33. 33. @Twitter#HadoopSummit2013 33 Chapter 4: Compute cost
  34. 34. Appendix: Future Work
  35. 35. @Twitter#HadoopSummit2013 35 Real-time data loading from Job Tracker / Application Master Full flow-centric UI (Job Tracker UI replacement) Hadoop 2.0 compatibility (in-progress) Ambrose integration • • • • Appendix: Future Work
  36. 36. @Twitter#HadoopSummit2013 36 hRaven on Github https://github.com/twitter/hraven hRaven Mailing Lists hraven-user@googlegroups.com hraven-dev@googlegroups.com • • • Additional Resources
  37. 37. @Twitter#HadoopSummit2013 Afterword 37 Now will thou drop your job data on the floor ? Quoth the hRaven, 'Nevermore.'
  38. 38. #TheEnd @gario and @joep Come visit us at booth #26 to continue the story
  39. 39. @Twitter#HadoopSummit2013 39 Desired order job_201306271100_9999 job_201306271100_10000 ... job_201306271100_99999 job_201306271100_100000 ... job_201306271100_999999 job_201306271100_1000000 • Sort order Variable length job_id Lexical order job_201306271100_10000 job_201306271100_100000 job_201306271100_1000000 job_201306271100_9999 job_201306271100_99999 job_201306271100_999999 •
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×