A Birds-Eye View of Pig and Scalding Jobs with hRaven

  • 1,696 views
Uploaded on

As Twitter's use of mapreduce rapidly expands, tracking usage on our clusters grows correspondingly more difficult. With an ever increasing job load, and a reliance on higher level abstractions such …

As Twitter's use of mapreduce rapidly expands, tracking usage on our clusters grows correspondingly more difficult. With an ever increasing job load, and a reliance on higher level abstractions such as Pig and Scalding, the utility of existing tools for viewing job history decreases rapidly, and extracting insights becomes a challenge. At Twitter, we created hRaven to fill this gap. hRaven archives the full history and metrics from all mapreduce jobs on our clusters, and strings together each job from a Pig or Scalding script execution into a combined flow. From this archive, we can easily derive aggregate resource utilization by user, pool, or application. While the historical trending of an individual application allows us to perform runtime optimization of resource scheduling. We will cover how hRaven provides a rich historical archive of mapreduce job execution, and how the data is structured into higher level flows representing the job sequence for frameworks such as Pig, Scalding, and Hive. We will then explore how we mine hRaven data to account for Hadoop resource utilization, to optimize runtime scheduling, and to identify common anti-patterns in user jobs. Finally, we will look at the end user experience, including Ambrose integration for flow visualization.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,696
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
24
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. A Bird’s-Eye View of Pig and Scalding with hRaven a tale by @gario and @joep Hadoop Summit 2013 v1.2
  • 2. @Twitter#HadoopSummit2013 2 Apache HBase PMC member and Committer Software Engineer @ Twitter Core Storage Team - Hadoop/HBase • • • About the authors Software Engineer @ Twitter Engineering Manager Hadoop/HBase team @ Twitter • •
  • 3. @Twitter#HadoopSummit2013 3 Chapter 1: The Problem Chapter 2: Why hRaven? Chapter 3: How Does it Work? 3a: Loading 3b: Table structure / querying Chapter 4: Current Uses Appendix: Future Work • • • • • • • Table of Contents
  • 4. Chapter 1: The Problem Illustration by Sirxlem (CC BY-NC-ND 3.0)
  • 5. @Twitter#HadoopSummit2013 5 Most users run Pig and Scalding scripts, not straight map reduce JobTracker UI shows jobs, not DAGs of jobs generated by Pig and Scalding • • Chapter 1: Mismatched Abstractions
  • 6. @Twitter#HadoopSummit2013 Chapter 1: A Problem of Scale 6
  • 7. @Twitter#HadoopSummit2013 7 How many Pig versus Scalding jobs do we run ? What cluster capacity do jobs in my pool take ? How many jobs do we run each day ? What % of jobs have > 30k tasks ? Why do I need to hand-tune these (hundreds) of jobs, can’t the cluster learn ? • • • • • Chapter 1: Questions
  • 8. @Twitter#HadoopSummit2013 8 How many Pig versus Scalding jobs do we run ? What cluster capacity do jobs in my pool take ? How many jobs do we run each day ? What % of jobs have > 30k tasks ? Why do I need to hand-tune these (hundreds) of jobs, can’t the cluster learn ? • • • • • Chapter 1: Questions #Nevermore
  • 9. Chapter 2: Why hRaven? Photo by DAVID ILIFF. License: CC-BY-SA 3.0
  • 10. @Twitter#HadoopSummit2013 10 Stores stats, configuration and timing for every map reduce job on every cluster Structured around the full DAG of jobs from a Pig or Scalding application Easily queryable for historical trending Allows for Pig reducer optimization based on historical run stats Keep data online forever (12.6M jobs, 4.5B tasks + attempts) • • • • • Chapter 2: Why hRaven?
  • 11. @Twitter#HadoopSummit2013 11 cluster - each cluster has a unique name mapping to the Job Tracker user - map reduce jobs are run as a given user application - a Pig or Scalding script (or plain map reduce job) flow - the combined DAG of jobs executed from a single run of an application version - changes impacting the DAG are recorded as a new version of the same application • • • • • Chapter 2: Key Concepts
  • 12. @Twitter#HadoopSummit2013 12 Chapter 2: Application Flows Edgar
  • 13. @Twitter#HadoopSummit2013 13 Chapter 2: Application Flows Edgar
  • 14. @Twitter#HadoopSummit2013 14 All jobs in a flow are ordered together• Chapter 2: Flow Storage
  • 15. @Twitter#HadoopSummit2013 15 Most recent flow is ordered first• Chapter 2: Flow Storage
  • 16. @Twitter#HadoopSummit2013 16 All jobs in a flow are ordered together Per-job metrics stored Total map and reduce tasks HDFS bytes read / written File bytes read / written Total map and reduce slot milliseconds Easy to aggregate stats for an entire flow Easy to scan the timeseries of each application’s flows • • • • • • • • Chapter 2: Key Features
  • 17. Chapter 3: How Does it Work?
  • 18. @Twitter#HadoopSummit2013 18 Chapter 3: ETL - Step 1: JobFilePreprocessor
  • 19. @Twitter#HadoopSummit2013 19 Chapter 3: ETL - Step 2: JobFileRawLoader
  • 20. @Twitter#HadoopSummit2013 20 Chapter 3: ETL - Step 3: JobFileProcessor
  • 21. @Twitter#HadoopSummit2013 21 Chapter 3: ETL - Step 3: JobFileProcessor Jobs finish out of order with respect to job_id
  • 22. @Twitter#HadoopSummit2013 22 job_history_raw job_history job_history_task job_history_app_version • • • • Chapter 3: Tables
  • 23. @Twitter#HadoopSummit2013 23 Row key: cluster!jobID Columns: jobconf - stores serialized raw job_*_conf.xml file jobhistory - stored serialized raw job history log file job_processed_success - indicates whether job has been processed • • • Chapter 3: job_history_raw
  • 24. @Twitter#HadoopSummit2013 24 Row key: cluster!user!application!timestamp!jobID cluster - unique cluster name (ie. “cluster1@dc1”) user - user running the application (“edgar”) application - application ID derived from job configuration: uses “batch.desc” property if set otherwise parses a consistent ID from “mapred.job.name” timestamp - inverted (Long.MAX_VALUE - value) value of submission time jobID - stored as Job Tracker start time (long), concatenated with job sequence number job_201306271100_0001 -> [1372352073732L][1L] • • • • • • • • Chapter 3: job_history
  • 25. @Twitter#HadoopSummit2013 25 Row key: cluster!user!application!timestamp!jobID!taskID same components as job_history key (same ordering) taskID - (ie. “m_00001”) uniquely identifies individual task/attempt in job Two row types: Task - “meta” row cluster1@dc1!edgar!wordcount!9654...!...[00001]!m_00001 Task Attempt - individual execution on a Task Tracker cluster1@dc1!edgar!wordcount!9654...!...[00001]!m_00001_1 • • • • Chapter 3: job_history_task
  • 26. @Twitter#HadoopSummit2013 26 Row key: cluster!user!application Example: cluster1@dc1!edgar!wordcount Columns: v1=1369585634000 v2=1372263813000 Chapter 3: job_history_app_version
  • 27. @Twitter#HadoopSummit2013 27 Using Pig’s HBaseStorage (or direct HBase APIs) Through Client API Through REST API • • • Chapter 3: Querying hRaven
  • 28. Chapter 4: Current Uses
  • 29. @Twitter#HadoopSummit2013 29 Pig reducer optimizations Cluster utilization / capacity planning Application performance trending over time Identifying common job anti-patterns Ad-hoc analysis troubleshooting cluster problems • • • • • Chapter 4: Current Uses
  • 30. @Twitter#HadoopSummit2013 30 Chapter 4: Cluster reads-writes
  • 31. @Twitter#HadoopSummit2013 Chapter 4: Pool / Application reads/writes 31 Pool view Spike in File size read Indicates jobs spilling • • • Application view Spike in HDFS size read Indicates spiking input • • •
  • 32. @Twitter#HadoopSummit2013 Chapter 4: Pool usage: Used vs. Allocated 32
  • 33. @Twitter#HadoopSummit2013 33 Chapter 4: Compute cost
  • 34. Appendix: Future Work
  • 35. @Twitter#HadoopSummit2013 35 Real-time data loading from Job Tracker / Application Master Full flow-centric UI (Job Tracker UI replacement) Hadoop 2.0 compatibility (in-progress) Ambrose integration • • • • Appendix: Future Work
  • 36. @Twitter#HadoopSummit2013 36 hRaven on Github https://github.com/twitter/hraven hRaven Mailing Lists hraven-user@googlegroups.com hraven-dev@googlegroups.com • • • Additional Resources
  • 37. @Twitter#HadoopSummit2013 Afterword 37 Now will thou drop your job data on the floor ? Quoth the hRaven, 'Nevermore.'
  • 38. #TheEnd @gario and @joep Come visit us at booth #26 to continue the story
  • 39. @Twitter#HadoopSummit2013 39 Desired order job_201306271100_9999 job_201306271100_10000 ... job_201306271100_99999 job_201306271100_100000 ... job_201306271100_999999 job_201306271100_1000000 • Sort order Variable length job_id Lexical order job_201306271100_10000 job_201306271100_100000 job_201306271100_1000000 job_201306271100_9999 job_201306271100_99999 job_201306271100_999999 •