Adam Kawa
Data Engineer @ Spotify
Hadoop Operations
Powered By …
Hadoop
1. How many times has Coldplay been streamed this
month?
2. How many times was “Get Lucky” streamed during
first 24h?
3. W...
1. What song to recommend Jay-Z when he wakes
up?
2. Is Adam Kawa bored with Coldplay today?
3. How to get Arun to subscri...
(Big) Data At Spotify
■ Data generated by +24M monthly active users
and for users!
- 2.2 TB of compressed data from users ...
Data Infrastructure At Spotify
■ Apache Hadoop YARN
■ Many other systems including
- Kafka, Cassandra, Storm, Luigi in pro...
■ Probably the largest commercial Hadoop cluster in
Europe!
- 694 heterogeneous nodes
- 14.25 PB of data consumed
- ~12.00...
March 2013
Tricky questions were asked!
1. How many servers do you need to buy to survive
one year?
2. What will you do to use them efficiently?
3. If we agree, d...
■ One of Data Engineers responsible for answering
these questions!
Adam Kawa
■ Examples of how to analyze various metrics, logs
and files
- generated by Hadoop
- using Hadoop
- to understand Hadoop
-...
■ This knowledge can be useful to
- measure how fast HDFS is growing
- define an empirical retention policy
- measure the ...
1. Analyzing HDFS
2. Analyzing MapReduce and YARN
Agenda
HDFS
Garbage Collection On
The NameNode
“ We don’t have any full GC pauses on the NN.
Our GC stops the NN for less than 100 msec,
on average!
:) ”
Adam Kawa @ Had...
“ Today, between 12:05 and 13:00
we had 5 full GC pauses on the NN.
They stopped the NN for 34min47sec in total!
:( ”
Adam...
What happened
between 12:05 and 13:00?
The NameNode was receiving the block reports from
all the DataNodes
Quick Answer!
1. We started the NN when the DNs were running
Detailed Answer
1. We started the NN when the DNs were running
2. 502 DNs immediately registered to the NN
■ Within 1.2 sec (based on logs...
1. We started the NN when the DNs were running
2. 502 DNs immediately registered to the NN
■ Within 1.2 sec (based on logs...
1. We started the NN when the DNs were running
2. 502 DNs immediately registered to the NN
■ Within 1.2 sec (based on logs...
Hadoop told us everything!
■ Enable GC logging for the NameNode
■ Visualize e.g. GCViewer
■ Analyze memory usage patterns, GC pauses,
misconfiguratio...
Time
This blue line
shows the heap
used by the NN
Loading
FsImage
Start replaying
Edit logs
First block report
processed
25 block reports
processed
131 block reports
processed
5min 39sec of
Full GC
40 block reports
processed
Next Full GC
Next Full GC
!!!
CMS collector starts
at 98.5% of heap…
We fixed that !
What happened in HDFS
between mid-December 2013
and mid-January 2014?
HDFS
HDFS Metadata
■ A persistent checkpoint of HDFS metadata
■ It contains information about files + directories
■ A binary file
HDFS FsImag...
■ Converts the content of FsImage to text formats
- e.g. a tab-separated file or XML
■ Output is easily analyzed by any to...
50% of the data
created during last 3
months
Anything interesting?
1. NO data added that day
2. Many more files added after
The migration to YARN
Where
did
the small files
come from?
■ An interactive
visualization of
data in HDFS
Twitter's
HDFS-DU
/app-logs
avg. file size = 253 KB
no. of dirs = 595K
no. ...
■ Statistics broken down by user/group name
■ Candidates for duplicate datasets
■ Inefficient MapReduce jobs
- Small files...
■ You can analyze FsImage to learn how fast HDFS
grows
■ You can combine it with “external” datasets
- number of daily/mon...
■ You can also use ''trend button'' in Ganglia
Simplified HDFS Capacity Planning
If we do
NOTHING,
we might fill
the clust...
What will we do
to survive longer
than September?
HDFS
Retention
Question
How many days after creation, a dataset is not
accessed anymore?
Retention Policy
Question
How many days after creation, a dataset is not
accessed anymore?
Possible Solution
■ You can use modification_tim...
■ Logs and core datasets are accessed even many
years after creation
■ Many reports are not accessed even a hour after
cre...
HDFS
Hot Datasets
■ Some files/directories will be accessed more often
than others e.g.:
- fresh logs, core datasets, dictionary files
Idea
...
How to find them?
■ Logs all filesystem access requests sent to the NN
■ Easy to parse and aggregate
- a tab-separated line for each request...
■ JAR files stored in HDFS and used by Pig scripts
■ A dictionary file with metadata about log messages
■ Core datasets: p...
YARN
MapReduce Jobs
Autotuning
■ There are jobs that we schedule regularly
- e.g. top lists for each country
Idea
■ Before submitting it next time, use s...
We implemented
■ A pre-execution hook that automatically sets
- Maximum size of an input split
- Number of Reduce tasks
■ ...
■ Here, the goal is that a task runs approx. 10 min, on
average
- Inspired by LinkedIn at Hadoop Summit 2013
- Helpful in ...
Another Example - Job Optimized Over Time
Even perfect manual settings
may become outdated
when an input dataset grows!
YARN
MapReduce Statistics
■ Extracts the statistics from historical MapReduce jobs
- Supports MRv1 and YARN
■ Stores them as Avro files
- Enables ea...
Low Medium High
A Slow Node
- 40% lower throughput than the average
Low Medium High
NIC negotiated 100MbE
instead of 1GbE
Low Medium High
According to Facebook
■ ”Small percentage of machines are responsible for
large percentage of failures”
- Worse performanc...
Adding nodes to the cluster
increases performance.
Sometimes, removing (crappy) nodes
does too !
Fixing
slow and failing
tasks as well !
YARN
Application Logs
■ YARN - can be moved to HDFS
- They are stored as TFiles … :(
- Small and many of them!
Location Of Application Logs
■ Frequent exceptions and bugs
- Just looking at the last line of stderr shows a lot!
■ Possible optimizations
- Memory an...
YARN
The Capacity Scheduler
■ We specified capacities and elasticity based on a
combination of
- “some” data
- intuition
- desire to shape future usag...
■ Basic information available on the Scheduler Web UI
■ Take print-screens!
- Otherwise, you will lose the history of what...
■ Capacity Scheduler exposes these metrics via JMX
■ Ganglia does NOT display the metrics related to
utilization of queues...
■ It collects JMX metrics from Java processes
■ It can send metrics to multiple destinations
- Graphite, cacti/rrdtool, Ga...
■ Our Production queue often borrows resources
- Usually from the Queue3 and Queue4 queues
Overutilization And Underutiliz...
The Best Time For The Downtime?
Three Crowns
Three Crowns = Sweden
BONUS
Some Cool Stuff
From The Community
■ Aggregates and visualizes Hadoop cluster
utilization across users
LinkedIn's White Elephant
■ Collects run-time statistics from MR jobs
- Stores them in HBase
■ Does not provide built-in visualization layer
- The p...
That’s all!
■ Analyzing Hadoop is also a “business” problem
- Save money
- Iterate faster
- Avoid downtimes
Summary
Thank you!
■ To my awesome colleagues for great technical
review:
Piotr Krewski, Josh Baer, Rafal Wojdyla,
Anna Dackiewicz, Magnus Ru...
Section name
Questions?
Check out spotify.com/jobs or
@Spotifyjobs for more information
kawaa@spotify.com
Check out my blog: HakunaMapData.com
Wan...
Backup
■ Tricky question!
■ Use production jobs that represent your workload
■ Use a metric that is independent from size of data...
Benchmarking
Benchmarking
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Upcoming SlideShare
Loading in...5
×

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)

19,795

Published on

At Spotify we collect huge volumes of data for many purposes. Reporting to labels, powering our product features, and analyzing user growth are some of our most common ones. Additionally, we collect many operational metrics related to the responsiveness, utilization and capacity of our servers. To store and process this data, we use scalable and fault-tolerant multi-system infrastructure, and Apache Hadoop is a key part of it. Surprisingly or not, Apache Hadoop generates large amounts of data in the form of logs and metrics that describe its behaviour and performance. To process this data in a scalable and performant manner we use … also Hadoop! During this presentation, I will talk about how we analyze various logs generated by Apache Hadoop using custom scripts (written in Pig or Java/Python MapReduce) and available open-source tools to get data-driven answers to many questions related to the behaviour of our 690-node Hadoop cluster. At Spotify we frequently leverage these tools to learn how fast we are growing, when to buy new nodes, how to calculate the empirical retention policy for each dataset, optimize the scheduler, benchmark the cluster, find its biggest offenders (both people and datasets) and more.

Published in: Technology
8 Comments
87 Likes
Statistics
Notes
No Downloads
Views
Total Views
19,795
On Slideshare
0
From Embeds
0
Number of Embeds
31
Actions
Shares
0
Downloads
1,138
Comments
8
Likes
87
Embeds 0
No embeds

No notes for slide

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)

  1. 1. Adam Kawa Data Engineer @ Spotify Hadoop Operations Powered By … Hadoop
  2. 2. 1. How many times has Coldplay been streamed this month? 2. How many times was “Get Lucky” streamed during first 24h? 3. Who was the most popular artist in NYC last week? Labels, Advertisers, Partners
  3. 3. 1. What song to recommend Jay-Z when he wakes up? 2. Is Adam Kawa bored with Coldplay today? 3. How to get Arun to subscribe to Spotify Premium? Data Scientists
  4. 4. (Big) Data At Spotify ■ Data generated by +24M monthly active users and for users! - 2.2 TB of compressed data from users per day - 64 TB of data generated in Hadoop each day (triplicated)
  5. 5. Data Infrastructure At Spotify ■ Apache Hadoop YARN ■ Many other systems including - Kafka, Cassandra, Storm, Luigi in production - Giraph, Tez, Spark in the evaluation mode
  6. 6. ■ Probably the largest commercial Hadoop cluster in Europe! - 694 heterogeneous nodes - 14.25 PB of data consumed - ~12.000 jobs each day Apache Hadoop
  7. 7. March 2013 Tricky questions were asked!
  8. 8. 1. How many servers do you need to buy to survive one year? 2. What will you do to use them efficiently? 3. If we agree, don’t come back to us this year! OK? Finance Department
  9. 9. ■ One of Data Engineers responsible for answering these questions! Adam Kawa
  10. 10. ■ Examples of how to analyze various metrics, logs and files - generated by Hadoop - using Hadoop - to understand Hadoop - to avoid guesstimates! The Topic Of This Talk
  11. 11. ■ This knowledge can be useful to - measure how fast HDFS is growing - define an empirical retention policy - measure the performance of jobs - optimize the scheduler - and more What To Use It For
  12. 12. 1. Analyzing HDFS 2. Analyzing MapReduce and YARN Agenda
  13. 13. HDFS Garbage Collection On The NameNode
  14. 14. “ We don’t have any full GC pauses on the NN. Our GC stops the NN for less than 100 msec, on average! :) ” Adam Kawa @ Hadoop User Mailing List December 16th, 2013
  15. 15. “ Today, between 12:05 and 13:00 we had 5 full GC pauses on the NN. They stopped the NN for 34min47sec in total! :( ” Adam Kawa @ Spotify office, Stockholm January 13th, 2014
  16. 16. What happened between 12:05 and 13:00?
  17. 17. The NameNode was receiving the block reports from all the DataNodes Quick Answer!
  18. 18. 1. We started the NN when the DNs were running Detailed Answer
  19. 19. 1. We started the NN when the DNs were running 2. 502 DNs immediately registered to the NN ■ Within 1.2 sec (based on logs from the DNs) Detailed Answer
  20. 20. 1. We started the NN when the DNs were running 2. 502 DNs immediately registered to the NN ■ Within 1.2 sec (based on logs from the DNs) 3. 502 DNs started sending the block reports ■ dfs.blockreport.initialDelay = 30 minutes ■ 17 block reports per minute (on average) ■ +831K blocks in each block report (on average) Detailed Answer
  21. 21. 1. We started the NN when the DNs were running 2. 502 DNs immediately registered to the NN ■ Within 1.2 sec (based on logs from the DNs) 3. 502 DNs started sending the block reports ■ dfs.blockreport.initialDelay = 30 minutes ■ 17 block reports per minute (on average) ■ +831K blocks in each block report (on average) 4. This generated a high memory pressure on the NN ■ The NN ran into Full GC !!! Detailed Answer
  22. 22. Hadoop told us everything!
  23. 23. ■ Enable GC logging for the NameNode ■ Visualize e.g. GCViewer ■ Analyze memory usage patterns, GC pauses, misconfiguration Collecting The GC Stats
  24. 24. Time
  25. 25. This blue line shows the heap used by the NN
  26. 26. Loading FsImage
  27. 27. Start replaying Edit logs
  28. 28. First block report processed
  29. 29. 25 block reports processed
  30. 30. 131 block reports processed
  31. 31. 5min 39sec of Full GC
  32. 32. 40 block reports processed
  33. 33. Next Full GC
  34. 34. Next Full GC !!!
  35. 35. CMS collector starts at 98.5% of heap… We fixed that !
  36. 36. What happened in HDFS between mid-December 2013 and mid-January 2014?
  37. 37. HDFS HDFS Metadata
  38. 38. ■ A persistent checkpoint of HDFS metadata ■ It contains information about files + directories ■ A binary file HDFS FsImage File
  39. 39. ■ Converts the content of FsImage to text formats - e.g. a tab-separated file or XML ■ Output is easily analyzed by any tools - e.g. Pig, Hive HDFS Offline Image Viewer
  40. 40. 50% of the data created during last 3 months
  41. 41. Anything interesting?
  42. 42. 1. NO data added that day 2. Many more files added after
  43. 43. The migration to YARN
  44. 44. Where did the small files come from?
  45. 45. ■ An interactive visualization of data in HDFS Twitter's HDFS-DU /app-logs avg. file size = 253 KB no. of dirs = 595K no. of files = 60.6M
  46. 46. ■ Statistics broken down by user/group name ■ Candidates for duplicate datasets ■ Inefficient MapReduce jobs - Small files - Skewed files More Uses Of FsImage File
  47. 47. ■ You can analyze FsImage to learn how fast HDFS grows ■ You can combine it with “external” datasets - number of daily/monthly active users - total size of logs generated by users - number of queries / day run by data analysts Advanced HDFS Capacity Planning
  48. 48. ■ You can also use ''trend button'' in Ganglia Simplified HDFS Capacity Planning If we do NOTHING, we might fill the cluster in September ...
  49. 49. What will we do to survive longer than September?
  50. 50. HDFS Retention
  51. 51. Question How many days after creation, a dataset is not accessed anymore? Retention Policy
  52. 52. Question How many days after creation, a dataset is not accessed anymore? Possible Solution ■ You can use modification_time and access_time from FsImage Empirical Retention Policy
  53. 53. ■ Logs and core datasets are accessed even many years after creation ■ Many reports are not accessed even a hour after creation ■ Most intermediate datasets needed less than a week ■ 10% of data has not been accessed for a year Our Retention Facts
  54. 54. HDFS Hot Datasets
  55. 55. ■ Some files/directories will be accessed more often than others e.g.: - fresh logs, core datasets, dictionary files Idea ■ To process it faster, increase its replication factor while it’s “hot” ■ To save disk space, decrease its replication factor when it becomes “cold” Hot Dataset
  56. 56. How to find them?
  57. 57. ■ Logs all filesystem access requests sent to the NN ■ Easy to parse and aggregate - a tab-separated line for each request HDFS Audit Log 2014-01-18 15:16:12,023 INFO FSNamesystem.audit: allowed=true ugi=kawaa (auth:SIMPLE) ip=/10.254.28.4 cmd=open src=/metadata/artist/2013-11-27/part-00061.avro dst=null perm=null
  58. 58. ■ JAR files stored in HDFS and used by Pig scripts ■ A dictionary file with metadata about log messages ■ Core datasets: playlists, users, top tracks Our Hot Datasets
  59. 59. YARN MapReduce Jobs Autotuning
  60. 60. ■ There are jobs that we schedule regularly - e.g. top lists for each country Idea ■ Before submitting it next time, use statistics from the previous executions of a job - To learn about its historical performance - To tweak its configuration settings Recurring MapReduce Jobs
  61. 61. We implemented ■ A pre-execution hook that automatically sets - Maximum size of an input split - Number of Reduce tasks ■ More settings can be tweaked - Memory - Combiner Jobs Autotuning
  62. 62. ■ Here, the goal is that a task runs approx. 10 min, on average - Inspired by LinkedIn at Hadoop Summit 2013 - Helpful in extreme cases (short/long running tasks) A Small PoC ;)
  63. 63. Another Example - Job Optimized Over Time
  64. 64. Even perfect manual settings may become outdated when an input dataset grows!
  65. 65. YARN MapReduce Statistics
  66. 66. ■ Extracts the statistics from historical MapReduce jobs - Supports MRv1 and YARN ■ Stores them as Avro files - Enables easy analysis using e.g. Pig and Hive ■ Similar projects - Replephant, hRaven Zlatanitor = Zlatan + Monitor Zlatanitor
  67. 67. Low Medium High
  68. 68. A Slow Node - 40% lower throughput than the average Low Medium High
  69. 69. NIC negotiated 100MbE instead of 1GbE Low Medium High
  70. 70. According to Facebook ■ ”Small percentage of machines are responsible for large percentage of failures” - Worse performance - More alerts - More manual intervention Repeat Offenders
  71. 71. Adding nodes to the cluster increases performance. Sometimes, removing (crappy) nodes does too !
  72. 72. Fixing slow and failing tasks as well !
  73. 73. YARN Application Logs
  74. 74. ■ YARN - can be moved to HDFS - They are stored as TFiles … :( - Small and many of them! Location Of Application Logs
  75. 75. ■ Frequent exceptions and bugs - Just looking at the last line of stderr shows a lot! ■ Possible optimizations - Memory and size of map input buffer What Might Be Checked a) AttributeError: 'int' object has no attribute 'iteritems' b) ValueError: invalid literal for int() with base 10: 'spotify' c) ValueError: Expecting , delimiter: line 1 column 3257 (char 3257) d) ImportError: No module named db_statistics
  76. 76. YARN The Capacity Scheduler
  77. 77. ■ We specified capacities and elasticity based on a combination of - “some” data - intuition - desire to shape future usage (!) Our Initial Capacities
  78. 78. ■ Basic information available on the Scheduler Web UI ■ Take print-screens! - Otherwise, you will lose the history of what you saw :( Overutilization And Underutilization
  79. 79. ■ Capacity Scheduler exposes these metrics via JMX ■ Ganglia does NOT display the metrics related to utilization of queues (by default) Visualizing Utilization Of Queue
  80. 80. ■ It collects JMX metrics from Java processes ■ It can send metrics to multiple destinations - Graphite, cacti/rrdtool, Ganglia - tab-separated text file - STDOUT - and more Jmxtrans
  81. 81. ■ Our Production queue often borrows resources - Usually from the Queue3 and Queue4 queues Overutilization And Underutilization
  82. 82. The Best Time For The Downtime?
  83. 83. Three Crowns
  84. 84. Three Crowns = Sweden
  85. 85. BONUS Some Cool Stuff From The Community
  86. 86. ■ Aggregates and visualizes Hadoop cluster utilization across users LinkedIn's White Elephant
  87. 87. ■ Collects run-time statistics from MR jobs - Stores them in HBase ■ Does not provide built-in visualization layer - The picture below comes from Twitter's blog Twitter's hRaven
  88. 88. That’s all!
  89. 89. ■ Analyzing Hadoop is also a “business” problem - Save money - Iterate faster - Avoid downtimes Summary
  90. 90. Thank you!
  91. 91. ■ To my awesome colleagues for great technical review: Piotr Krewski, Josh Baer, Rafal Wojdyla, Anna Dackiewicz, Magnus Runesson, Gustav Landén, Guido Urdaneta, Uldis Barbans More Thanks
  92. 92. Section name Questions?
  93. 93. Check out spotify.com/jobs or @Spotifyjobs for more information kawaa@spotify.com Check out my blog: HakunaMapData.com Want to join the band?
  94. 94. Backup
  95. 95. ■ Tricky question! ■ Use production jobs that represent your workload ■ Use a metric that is independent from size of data that you process ■ Optimize one setting at the time Benchmarking
  96. 96. Benchmarking
  97. 97. Benchmarking
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×