Apache Hadoop In Theory And Practice


Published on

A presentation about Hadoop internals, and a couple of issues that we have seen in the practice.

Published in: Technology
1 Comment
  • '*Sadly, based on practical observations, the block to file ration tends to decrease during the lifetime of the cluster'
    Why 'Sadly'? Why this is bad?
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Apache Hadoop In Theory And Practice

  1. 1. Apache Hadoop In Theory And Practice Adam Kawa Data Engineer @ Spotify
  2. 2. Why Data? Get insights to offer a better product “More data usually beats better algorithms” Get of insights to make better decisions Avoid “guesstimates” Take a competitive advantage
  3. 3. What Is Challenging? Store data reliably Analyze data quickly Cost-effective way Use expressible and high-level language
  4. 4. Fundamental Ideas A big system of machines, not a big machine Failures will happen Move computation to data, not data to computation Write complex code only once, but right A system of multiple animals
  5. 5. Apache Hadoop An open-source Java software Storing and processing of very large data sets A clusters of commodity machines A simple programming model
  6. 6. Apache Hadoop Two main components: HDFS - a distributed file system MapReduce – a distributed processing layer Many other tools belong to “Apache Hadoop Ecosystem”
  7. 7. Component HDFS
  8. 8. The Purpose Of HDFS Store large datasets in a distributed, scalable and fault-tolerant way High throughput It is like a big truck Very large files to move heavy stuff Streaming reads and writes (no edits) (not Ferrari) Write once, read many times
  9. 9. HDFS Mis-Usage Do NOT use, if you have Low-latency requests Random reads and writes Lots of small files Then better to consider RDBMs, File servers, Hbase or Cassandra...
  10. 10. Splitting Files And Replicating Blocks Split a very large file into smaller (but still large) blocks Store them redundantly on a set of machines
  11. 11. Spiting Files Into Blocks Today, 128MB or 256MB is recommended The default block size is 64MB Minimize the overhead of a disk seek operation (less than 1%) A file is just “sliced” into chunks after each 64MB (or so) It does NOT matter whether it is text/binary, compressed or not It does matter later (when reading the data)
  12. 12. Replicating Blocks The default replication factor is 3 It can be changed per a file or a directory It can be increased for “hot” datasets (temporarily or permanently) Trade-off between Reliability, availability, performance Disk space
  13. 13. Master And Slaves The Master node keeps and manages all metadata information The Slave nodes store blocks of data and serve them to the client Master node (called NameNode) Slave nodes (called DataNodes)
  14. 14. Classical* HDFS Cluster *no NameNode HA, no HDFS Replication Manages metadata Does some “house-keeping” operations for NameNode Stores and retrieves blocks of data
  15. 15. HDFS NameNode Performs all the metadata-related operations Keeps information in RAM (for fast lookup) The filesystem tree Metadata for all files/directories (e.g. ownership, permissions) Names and locations of blocks Metadata (not all) is additionally stored on disk(s) (for reliability) The filesystem snapshot (fsimage) + editlog (edits) files
  16. 16. HDFS DataNode Stores and retrieves blocks of data Data is stored as regular files on a local filesystem (e.g. ext4) e.g. blk_-992391354910561645 (+ checksums in a separate file) A block itself does not know which file it belongs to! Sends a heartbeat message to the NN to say that it is still alive Sends a block report to the NN periodically
  17. 17. HDFS Secondary NameNode NOT a failover NameNode Periodically merges a prior snapshot (fsimage) and editlog(s) (edits) Fetches current fsimage and edits files from the NameNode Applies edits to fsimage to create the up-to-date fsimage Then sends the up-to-date fsimage back to the NameNode We can configure frequency of this operation Reduces the NameNode startup time Prevents edits to become too large
  18. 18. Exemplary HDFS Commands hadoop fs -ls -R /user/kawaa hadoop fs -cat /toplist/2013-05-15/poland.txt | less hadoop fs -put logs.txt /incoming/logs/user hadoop fs -count /toplist hadoop fs -chown kawaa:kawaa /toplist/2013-05-15/poland.avro It is distributed, but it gives you a beautiful abstraction!
  19. 19. Reading A File From HDFS Block data is never sent through the NameNode The NameNode redirects a client to an appropriate DataNode The NameNode chooses a DataNode that is as “close” as possible $ hadoop fs -cat /toplist/2013-05-15/poland.txt Blocks locations Lots of data comes from DataNodes to a client
  20. 20. HDFS Network Topology Network topology defined by an administrator in a supplied script Convert IP address into a path to a rack (e.g /dc1/rack1) A path is used to calculate distance between nodes Image source: “Hadoop: The Definitive Guide” by Tom White
  21. 21. HDFS Block Placement Pluggable (default in BlockPlacementPolicyDefault.java) 1st replica on the same node where a writer is located Otherwise “random” (but not too “busy” or almost full) node is used 2nd and the 3rd replicas on two different nodes in a different rack The rest are placed on random nodes No DataNode with more than one replica of any block No rack with more than two replicas of the same block (if possible)
  22. 22. HDFS Balancer Moves block from over-utilized DNs to under-utilized DNs Stops when HDFS is balanced the utilization of differs from the Maintains the block placement policy every DN utilization of the cluster by no more than a given threshold
  23. 23. Questions HDFS
  24. 24. HDFS Block Question Why a block itself does NOT know which file it belongs to?
  25. 25. HDFS Block Question Why a block itself does NOT know which file it belongs to? Answer Design decision → simplicity, performance Filename, permissions, ownership etc might change It would require updating all block replicas that belongs to a file
  26. 26. HDFS Metadata Question Why NN does NOT store information about block locations on disks?
  27. 27. HDFS Metadata Question Why NN does NOT store information about block locations on disks? Answer Design decision → simplicity They are sent by DataNodes as block reports periodically Locations of block replicas may change over time A change in IP address or hostname of DataNode Balancing the cluster (moving blocks between nodes) Moving disks between servers (e.g. failure of a motherboard)
  28. 28. HDFS Replication Question How many files represent a block replica in HDFS?
  29. 29. HDFS Replication Question How many files represent a block replica in HDFS? Answer Actually, two files: The first file for data itself The second file for block’s metadata Checksums for the block data The block’s generation stamp by default less than 1% of the actual data
  30. 30. HDFS Block Placement Question Why does NOT the default block placement strategy take the disk space utilization (%) into account? It only checks, if a node a) has enough disk space to write a block, and b) does not serve too many clients ...
  31. 31. HDFS Block Placement Question Why does NOT the default block placement strategy take the disk space utilization (%) into account? Answer Some DataNodes might become overloaded by incoming data e.g. a newly added node to the cluster
  32. 32. Facts HDFS
  33. 33. HDFS And Local File System Runs on the top of a native file system (e.g. ext3, ext4, xfs) HDFS is simply a Java application that uses a native file system
  34. 34. HDFS Data Integrity HDFS detects corrupted blocks When writing Client computes the checksums for each block Client sends checksums to a DN together with data When reading Client verifies the checksums when reading a block If verification fails, NN is notified about the corrupt replica Then a DN fetches a different replica from another DN
  35. 35. HDFS NameNode Scalability Stats based on Yahoo! Clusters An average file ≈ 1.5 blocks (block size = 128 MB) An average file ≈ 600 bytes in RAM (1 file and 2 blocks objects) 100M files ≈ 60 GB of metadata 1 GB of metadata ≈ 1 PB physical storage (but usually less*) *Sadly, based on practical observations, the block to file ratio tends to decrease during the lifetime of the cluster Dekel Tankel, Yahoo!
  36. 36. HDFS NameNode Performance Read/write operations throughput limited by one machine ~120K read ops/sec ~6K write ops/sec MapReduce tasks are also HDFS clients Internal load increases as the cluster grows More block reports and heartbeats from DataNodes More MapReduce tasks sending requests Bigger snapshots transferred from/to Secondary NameNode
  37. 37. HDFS Main Limitations Single NameNode Keeps all metadata in RAM Performs all metadata operations Becomes a single point of failure (SPOF)
  38. 38. HDFS Main Improvements Introduce multiple NameNodes in form of: HDFS Federation HDFS High Availability (HA) Find More: http://slidesha.re/15zZlet
  39. 39. In practice HDFS
  40. 40. Problem DataNode can not start on a server for some reason
  41. 41. Usually it means some kind of disk failure $ ls /disk/hd12/ ls: reading directory /disk/hd12/: Input/output error org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 11, volumes configured: 12, volumes failed: 1, Volume failures tolerated: 0 org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: Increase dfs.datanode.failed.volumes.tolerated to avoid expensive block replication when a disk fails (and just monitor failed disks)
  42. 42. It was exciting this see stuff breaking!
  43. 43. In practice HDFS
  44. 44. Problem A user can not run resource-intensive Hive queries It happened immediately after expanding the cluster
  45. 45. Description The queries are valid The queries are resource-intensive The queries run successfully on a small dataset But they fail on a large dataset Surprisingly they run successfully through other user accounts The user has right permissions to HDFS directories and Hive tables
  46. 46. The NameNode is throwing thousands of warnings and exceptions 14592 times only during 8 min (4768/min in a peak)
  47. 47. Normally Hadoop is a very trusty elephant The username comes from the client machine (and is not verified) The groupname is resolved on the NameNode server Using the shell command ''id -Gn <username>'' If a user does not have an account on the NameNode server The ExitCodeException exception is thrown
  48. 48. Possible Fixes Create an user account on the NameNode server (dirty, insecure) Use AD/LDAP for a user-group resolution hadoop.security.group.mapping.ldap.* settings If you also need the full-authentication, deploy Kerberos
  49. 49. Our Fix We decided to use LDAP for a user-group resolution However, LDAP settings in Hadoop did not work for us Because posixGroup is not a supported filter group class in hadoop.security.group.mapping.ldap.search.filter.group We found a workaround using nsswitch.conf
  50. 50. Lesson Learned Know who is going to use your cluster Know who is abusing the cluster (HDFS access and MR jobs) Parse the NameNode logs regularly Look for FATAL, ERROR, Exception messages Especially before and after expanding the cluster
  51. 51. Component MapReduce
  52. 52. MapReduce Model Programming model inspired by functional programming map() and reduce() functions processing <key, value> pairs Useful for processing very large datasets in a distributed way Simple, but very expressible
  53. 53. Map And Reduce Functions
  54. 54. Map And Reduce Functions Counting Word
  55. 55. MapReduce Job Input data is divided into splits and converted into <key, value> pairs Invokes map() function multiple times Invokes reduce() Function multiple times Keys are sorted, values not (but could be)
  56. 56. MapReduce Example: ArtistCount Artist, Song, Timestamp, User Key is the offset of the line from the beginning of the line We could specify which artist goes to which reducer (HashParitioner is default one)
  57. 57. MapReduce Example: ArtistCount map(Integer key, EndSong value, Context context): context.write(value.artist, 1) reduce(String key, Iterator<Integer> values, Context context): int count = 0 for each v in values: count += v context.write(key, count) Pseudo-code in non-existing language ;)
  58. 58. MapReduce Combiner Make sure that the Combiner combines fast and enough (otherwise it adds overhead only)
  59. 59. Data Locality in HDFS and MapReduce By default, three replicas should be available somewhere on the cluster Ideally, Mapper code is sent to a node that has the replica of this block
  60. 60. MapReduce Implementation Batch processing system Automatic parallelization and distribution of computation Fault-tolerance Deals with all messy details related to distributed processing Relatively easy to use for programmers Java API, Streaming (Python, Perl, Ruby …) Apache Pig, Apache Hive, (S)Crunch Status and monitoring tools
  61. 61. “Classical” MapReduce Daemons Keeps track of TTs, schedules jobs and tasks executions Runs map and reduce tasks, Reports to JobTracker
  62. 62. JobTracker Reponsibilities Manages the computational resources Available TaskTrackers, map and reduce slots Schedules all user jobs Schedules all tasks that belongs to a job Monitors tasks executions Restarts failed and speculatively runs slow tasks Calculates job counters totals
  63. 63. TaskTracker Reponsibilities Runs map and reduce tasks Reports to JobTracker Heartbeats saying that it is still alive Number of free map and reduce slots Task progress, status, counters etc
  64. 64. Apache Hadoop Cluster It can consists of 1, 5, 100 and 4000 nodes
  65. 65. MapReduce Job Submission They are copied with a higher replication factor (by default, 10) Image source: “Hadoop: The Definitive Guide” by Tom White Tasks are started in a separate JVM to isolate a user code form Hadoop code
  66. 66. MapReduce: Sort And Shuffle Phase Map phase Reduce phase Other maps tasks Image source: “Hadoop: The Definitive Guide” by Tom White Other reduce tasks
  67. 67. MapReduce: Partitioner Specifies which Reducer should get a given <key, value> pair Aim for an even distribution of the intermediate data Skewed data may overload a single reducer And make a job running much longer
  68. 68. Speculative Execution Scheduling a redundant copy of the remaining, long-running task The output from the one that finishes first is used The other one is killed, since it is no longer needed An optimization, not a feature to make jobs run more reliably
  69. 69. Speculative Execution Enable, if tasks often experience “external” problems e.g. hardware degradation (disk, network card), system problems, memory unavailability.. Otherwise Speculative execution can reduce overall throughput Redundant tasks run with similar speed as non-redundant ones Might help one job, all the others have to wait longer for slots Redundantly running reduce tasks will transfer over the network all intermediate data write their output redundantly (for a moment) to directory in HDFS
  70. 70. Facts MapReduce
  71. 71. Java API Very customizable Input/Output Format, Record Reader/Writer, Partitioner, Writable, Comparator … Unit testing with MRUnit HPROF profiler can give a lot of insights Reuse objects (especially keys and values) when possible Split String efficiently e.g. StringUtils instead of String.split Hadoop Java API http://slidesha.re/1c50IPk More about
  72. 72. MapReduce Job Configuration Tons of configuration parameters Input split size (~implies the number of map tasks) Number of reduce tasks Available memory for tasks Compression settings Combiner Partitioner and more...
  73. 73. Questions MapReduce
  74. 74. MapReduce Input <Key, Value> Pairs Question Why each line in text file is, by default, converted to <offset, line> instead of <line_number, line>?
  75. 75. MapReduce Input <Key, Value> Pairs Question Why each line in text file is, by default, converted to <offset, line> instead of <line_number, line>? Answer If your lines are not fixed-width, you need to read file from the beginning to the end to find line_number of each line (thus it is not parallelized).
  76. 76. In practice MapReduce
  77. 77. “I noticed that a bottleneck seems to be coming from the map tasks. Is there any reason that we can't open any of the allocated reduce slots to map tasks?” Regards, Chris How to hard-code the number of map and reduce slots efficiently?
  78. 78. This may change again soon ... We initially started with 60/40 But today we are closer to something like 70/30 Time Spend In Occupied Slots Occupied Map And Reduce Slots
  79. 79. We are currently introducing a new feature to Luigi Automatic settings of Maximum input split size (~implies the number of map tasks) Number of reduce task More settings soon (e.g. size of map output buffer) The goal is each task running 5-15 minutes on average Because even perfect manual setting may become outdated because input size grows
  80. 80. The current PoC ;) type # map # reduce avg map time avg reduce time job execution time old_1 4826 25 46sec 1hrs, 52mins, 14sec 2hrs, 52mins, 16sec new_1 391 294 4mins, 46sec 8mins, 24sec 23mins, 12sec type # map # reduce avg map time avg reduce time job execution time old_2 4936 800 7mins, 30sec 22mins, 18sec 5hrs, 20mins, 1sec new_2 4936 1893 8mins, 52sec 7mins, 35sec 1hrs, 18mins, 29sec It should help in extreme cases short-living maps short-living and long-living reduces
  81. 81. In practice MapReduce
  82. 82. Problem Surprisingly, Hive queries are running extremely long Thousands task are constantly being killed
  83. 83. Only 1 task failed, 2x more task were killed than were completed
  84. 84. Logs show that the JobTracker gets a request to kill the tasks Who actually can send a kill request? User (using e.g. mapred job -kill-task) JobTracker (a speculative duplicate, or when a whole job fails) Fair Scheduler Diplomatically, it's called “preemption”
  85. 85. Key Observations Killed tasks came from ad-hoc and resource-intensive Hive queries Tasks are usually killed quickly after they start Surviving tasks are running fine for long time Hive queries are running in their own Fair Scheduler's pool
  86. 86. Eureka! FairScheduler has a license to kill! Preempt the newest tasks in an over-share pool to forcibly make some room for starving pools
  87. 87. Hive pool was running over its minimum and fair shares Other pools were running under their minimum and fair shares So that Fair Scheduler was (legally) killing Hive tasks from time to time Fair Scheduler can kill to be KIND...
  88. 88. Possible Fixes Disable the preemption Tune minimum shares based on your workload Tune preemption timeouts based on your workload Limit the number of map/reduce tasks in a pool Limit the number of jobs in a pool Switch to Capacity Scheduler
  89. 89. Lessons Learned A scheduler should NOT be considered as the ''black-box'' It is so easy to implement a long-running Hive query
  90. 90. More About Hadoop Adventures http://slidesha.re/1ctbTHT
  91. 91. In the reality Hadoop is fun!
  92. 92. Questions?
  93. 93. Stockholm and Sweden
  94. 94. Want to join the band? Check out spotify.com/jobs or @Spotifyjobs for more information kawaa@spotify.com HakunaMapData.com
  95. 95. Thank you!