Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs


Published on

Sharing of Hadoop cluster deployment experience in production from scratch on real hardware. Brief overview of Hadoop stack, its components, major deployment and configuration challenges, performance tuning and application tuning experience. Some “war stories” about the issues we have faced while operating, the benefits of DevOps approach for running Hadoop apps.

Published in: Software
  • Be the first to comment

  • Be the first to like this

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

  1. 1. Hadoop – The War Stories Running Hadoop in large enterprise environment Nikolai Grigoriev (, @nikgrig) Principal Software Engineer,
  2. 2. Agenda ● Why Hadoop? ● Planning Hadoop deployment ● Hadoop and read hardware ● Understanding the software stack ● Tuning HDFS, MapReduce and HBase ● Troubleshooting examples ● Testing your applications Disclaimer: this presentation is based on the combined work experience from more than one company and represents the author's personal point of view on the problems discussed in it.
  3. 3. Why Hadoop (and why have we decided to use it)? ● Need to store hundreds of Tb of info ● Need to process it in parallel ● Desire to have both storage and processing horizontally scalable ● Having and open-source platform with commercial support
  4. 4. Our application Application servers (many :) ) Log processors “ETL process”
  5. 5. Our application in numbers ● Thousands of user sessions per second ● Average session log size: ~30Kb, 3-7 events per log ● Target retention period – at least ~90 days ● Redundancy and HA everywhere ● Pluggable “ETL” modules for additional data processing
  6. 6. Main problem Team had no practical knowledge of Hadoop, HDFS and HBase… ...and there was nobody at the company to help
  7. 7. But we did not realize... It was not THE ONLY problem we were about to face!
  8. 8. First fight – capacity planning ● Tons of articles are written about Hadoop capacity planning ● Architects may be spending months making educated guesses ● Capacity planning is really about finding the amount of $$$ to be spent on your cluster for target workload – If we had infinite amount of $$$ why would we bother at all? ;)
  9. 9. Hadoop performance limiting factors
  10. 10. It is all about the balance ● Your Hadoop cluster and your apps use all these resources at different time ● Over-provisioning of one of the resources usually lead to the shortage of another one - wasted $$$
  11. 11. What can we say about an app? ● It is going to store X Tb of data – Amount of storage (not to forget the RF!) – Accommodate for growth and failures ● It is going to ingest the data at Y Mb/s – Your network speed and number of nodes ● Latency – More HDDs and faster HDDs – More RAM – More nodes
  12. 12. We are big enterprise... Geeky Hadoop developer Old School Senior IT Guy - many “commodity+” hosts - good but inexpensive networking - more regular HDDs - lots of RAM - I also love cloud… - my recent OS - my software configuration - simple network SANs, RAIDs, SCSI, racks, Blades, redundancy, Cisco, HP, fiber optics, 4-year-old rock-solid RHEL, SNMP monitoring… what? I am the Boss...
  13. 13. Hadoop cluster vs. old school application servers ● Mostly identical “commodity+” machines – Probably with the exception of NN, JT ● Better to have more simpler machines than fewer monster ones ● No RAID, just JBOD! ● Ethernet depending on the storage density, bonded 1Gbit may be enough ● Hadoop achieves with software what used to be achievable with [expensive!] hardware
  14. 14. But still, your application is the driver, not the IT guy! From Cloudera website – Hadoop machine configuration according to workload
  15. 15. Your job is: ● Educate your IT, get them on your side or at least earn their trust ● Try to build a capacity planning spreadsheet based on what you do know ● Apply common sense to guess what you do not know ● ...and plan a decent buffer ● Set reasonable performance targets for your application
  16. 16. Fight #2 – OMG, our application is slow!!! ● Main part of our application was the MR job merging the logs ● We have committed to deliver X logs/sec on a target test cluster with sample workload ● We were delivering like ~30% of that ● ...weeks before release :) ● ...and we have ran out of other excuses :( ● It was clearly our software and/or configuration
  17. 17. Wait a second – we have support contract from Hadoop vendor! ● I mean no disrespect to the vendors! ● But they do not know your application ● And they do not know your hardware ● And they do not know exactly your OS ● And they do not know your network equipment ● They can help you with some tuning, they can help you with bugs and crashes – but they won't be able (or sometimes simply qualified) to do your job!
  18. 18. We are on our own :( ● We have realized that our testing methods were not adequate to Hadoop-based ETL process ● Testing the product end-to-end was too difficult, tracking changes was impossible ● Turn-around was too long, we could not try something quickly and revert back ● Observing and monitoring the live system with dummy incoming data was not productive enough
  19. 19. Key to successful testing ● Representative data set ● Ability to repeat the same operation as many times as needed with quick turnaround ● Each engineer had to be able to run the tests and try something ● Establishing the key metrics you monitor and try to improve ● Methodological approach – analyze, change, test, be ready to roll back
  20. 20. Our “reference runner” Large sample dataset “Reset” tool Runner tool Statistics Recreates HBase tables (predefined regions), cleans HDFS etc Injects the test data, prepares the environment, launches the MR job like real application, allows to quickly rebuild and redeploy the part of the application Any improvements since last run? Manager
  21. 21. Tuning results ● In two weeks we had the job that worked about 3 times faster ● Tuning was done everywhere – from OS to Hadoop/HBase and our code ● We were confident that the software was ready to go to production ● During following 2 years later we realized how bad was our design and how it should have been done ;)
  22. 22. Hadoop MapReduce DOs ● Think processes, not threads ● Reusable objects, lower GC overhead ● Snappy data compression is generally good ● Reasonable use of counters provides important information ● For frequently running jobs, distributed cache helps a lot ● Minimize disk I/O (spills etc), RAM is cheap ● Avoid unnecessary serialization/deserialization
  23. 23. Hadoop MapReduce DONTs ● Small files in HDFS ● Multithreaded programming inside mapper/reducer ● Fat tasks using too much heap ● Any I/O in M-R other than HDFS, ZK or HBase ● Over-complicated code (simple things work better)
  24. 24. Fight #3 – Going Production! ● Remember the slide about engineer vs. IT God preferences ;) ● Production hardware was slightly different from the test cluster ● Cluster has been deployed by the people who did not know Hadoop ● First attempt to run the software resulted in major failure and the cluster was finally handed over to the developers for fixing ;)
  25. 25. Production hardware ● HP blade servers, 32 core, 128GB of RAM ● Emulex dual-port 10G Ethernet NICs ● 14 HDDs per machine ● OEL 6.3 ● 10G switch modules ● Company hosting center with dedicated networking and operations staff
  26. 26. Hardware BIOS/Firmware(s) BIOS/Firmware settings OS (Linux) Java (JVM) Hadoop services Your application(s) Step back – 10,000 ft look at Hadoop stack Hardware BIOS/Firmware(s) BIOS/Firmware settings OS (Linux) Java (JVM) Hadoop services Your application(s) Network - Hadoop is not just a bunch of Java apps - It is a data and application platform - It can run well, just run, barely run and cause constant headache – depending on how much love does it receive :)
  27. 27. Hadoop stack (continued) ● In Hadoop a small problem, even sometimes on a single node can be a major pain ● Isolating and finding that small problem may be difficult ● Symptoms are often obvious only at high level (e.g. application) ● Complex hardware (like HP) adds more potential problems
  28. 28. Example of one of the problems we had initially ● Jobs were failing because of timeouts ● Numerous I/O errors observed in job and HDFS logs ● This simple test was failing: $ dd if=/dev/zero of=test8Gb.bin bs=1M count=8192 $ time hdfs dfs -copyFromLocal test8Gb.bin / Zzz..zzz...zzz...5min...zzz… real 4m10.002s user 0m15.130s sys 0m4.094s ● IT was clueless but did not really bother ● In fact, 8192Mb / (4 * 60 + 10) = 32Mb/s (!?!?!) ● 10Gb network transfers to HDFS at ~160Mb/s
  29. 29. Role of HDFS in Hadoop ● In Hadoop HDFS is the key layer that provides the distributed filesystem services for other components ● Health of HDFS directly (and drastically) affects the health of other components HDFS Map-Reduce Data HBase
  30. 30. So, clearly HDFS was the problem ● But what was the problem with HDFS?? ● How exactly HDFS writing works?
  31. 31. Chasing it down ● Due to node-to-node streaming it was difficult to understand who was responsible ● Theory of “one bad node in pipeline” was ruled out as results were consistently bad with the cluster of 14 nodes ● Idea (isolating the problem is good): $ time hdfs -Ddfs.replication=1 dfs -copyFromLocal test8Gb.bin / real 0m42.002s $ time hdfs -Ddfs.replication=2 dfs -copyFromLocal test8Gb.bin / real 2m53.184s $ time hdfs -Ddfs.replication=3 dfs -copyFromLocal test8Gb.bin / real 3m41.072s ● 8192/42=195 Mb/s – hmmm….
  32. 32. Discoveries ● To make even longer story short... – Bug in “cubic” TCP congestion protocol in Linux kernel – NIC firmware was too old – Kernel driver for Emulex 10G NICs was too old – Only one out of 8 NIC RX queues was enabled on some hosts – A number of network settings were not appropriate for 10G network – “irqbalance” process (due to kernel bug) was locking NIC RX queues by “losing” NIC IRQ handlers – ...
  33. 33. More discoveries – Nodes were set up multi-homed, even HDFS at that time did not support that – Misconfigured DNS and reverse DNS ● On disk I/O side – Bad filesystem parameters – Read-ahead settings were wrong – Disk controller firmware was old
  34. 34. HDFS “litmus” test - TestDFSIO 13/03/13 16:30:02 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write 13/03/13 16:30:02 INFO fs.TestDFSIO: Date & time: Wed Mar 13 16:30:02 UTC 2013 13/03/13 16:30:02 INFO fs.TestDFSIO: Number of files: 16 13/03/13 16:30:02 INFO fs.TestDFSIO: Total MBytes processed: 160000.0 13/03/13 16:30:02 INFO fs.TestDFSIO: Throughput mb/sec: 103.42190773343779 13/03/13 16:30:02 INFO fs.TestDFSIO: Average IO rate mb/sec: 103.61066436767578 13/03/13 16:30:02 INFO fs.TestDFSIO: IO rate std deviation: 4.513343367320971 13/03/13 16:30:02 INFO fs.TestDFSIO: Test exec time sec: 114.876 13/03/13 16:31:31 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read 13/03/13 16:31:31 INFO fs.TestDFSIO: Date & time: Wed Mar 13 16:31:31 UTC 2013 13/03/13 16:31:31 INFO fs.TestDFSIO: Number of files: 16 13/03/13 16:31:31 INFO fs.TestDFSIO: Total MBytes processed: 160000.0 13/03/13 16:31:31 INFO fs.TestDFSIO: Throughput mb/sec: 586.8243268024676 13/03/13 16:31:31 INFO fs.TestDFSIO: Average IO rate mb/sec: 648.8555908203125 13/03/13 16:31:31 INFO fs.TestDFSIO: IO rate std deviation: 267.0954600161208 13/03/13 16:31:31 INFO fs.TestDFSIO: Test exec time sec: 33.683 13/03/13 16:31:31 INFO fs.TestDFSIO:
  35. 35. Fight #4 – tuning Hadoop ● Why do people tune things (IT was not interested ;) )? ● With your own expensive hardware you want the maximum IOPS and CPU power for $$$ you have paid ● Not to mention that you simply want your apps to run faster ● Tuning is an endless process but 80/20 rule works perfectly
  36. 36. Even before you have something to tune…. ● Pick reasonably good hardware but do not go high-end ● Same for network equipment ● Hadoop scales well and the redundancy is achieved by software ● More nodes is almost always better than going for extra node power and/or storage space ● Simpler systems are easier to tune, maintain and troubleshoot ● Different machines for master nodes
  37. 37. Tuning the hardware and BIOS ● Updating BIOS and firmwares to recent versions ● Disabling dynamic CPU frequency scaling ● Tuning memory speed, power profile ● Disk controller, tune disk cache
  38. 38. OS Tuning ● Pick the filesystem (ext3, ext4, XFS...), parameters (reserve blocks 0%) and mount options (noatime,nodiratime, barriers etc) ● I/O scheduler depending on your disks and tasks ● Read-ahead settings ● Disable swap! ● irqbalance for big machines ● Tune other parameters (number of FDs, sockets) ● Install major troubleshooting tools (iostat, iotop, tcpdump, strace…) on every one
  39. 39. Network tuning ● Test your TCP performance with iperf, ttcp or any other tools you like ● Know your NICs well, install right firmware and kernel modules ● Tune your TCP and IP parameters (work harder if you have expensive 10G network) ● If your NIC supports TCP offload and it works – use it ● txqueuelen, MTU 9000 (if appropriate), HDFS is chatty ● Learn ethtool and see what it can do for you ● Basic IP networking set-up (DNS etc) has to be 100% perfect
  40. 40. JVM tuning ● Hadoop allows you to set JVM options for all processes ● Your Data Node, Name Node and HBase Region Servers are going to work hard and you need to help them to deal with your workload ● If your MR code is well designed you will most likely NOT need to tune JVM for MR tasks ● Your main enemy will be GC – until you become at lease allies, if not friends :)
  41. 41. Tuning Hadoop services ● NameNode deals with many connections and needs ~150 bytes per HDFS block ● NameNode and DataNode are highly concurrent, latter needs many threads ● Use HDFS short-circuit reads if appropriate ● ZooKeeper needs to handle enough connections ● HBase uses LOTS of heap ● Reuse JVMs for MR jobs if appropriate
  42. 42. Tuning MapReduce tasks (that means tuning for your code and data) ● If you run different MR jobs, consider tuning parameters for each of them, not once and for all of them ● Configure job scheduler to enforce the SLAs ● Estimate the resource needed for each job ● Plan how are you going to run your jobs
  43. 43. Tuning your own code ● Test and profile your complex MR code outside of Hadoop (your savings will scale too!) ● Check for GC overhead ● Use reusable objects ● Avoid using expensive formats like JSON and XML ● Anything you waste is multiplied by the number of rows and the number of tasks! ● Evaluate the need for intermediate data compression
  44. 44. Tuning HBase ● That requires separate presentation ● You will need to fight hard for reducing GC pauses and overhead ● Pre-splitting regions may be a good idea to better balance the load ● Understand HBase compactions and deal with major compactions your way
  45. 45. Set up your monitoring (and alarming) ● You cannot improve what you cannot see! ● Monitor OS, Hadoop and your app metrics ● Ganglia, Graphite, LogStash, even Cloudera Manager are your friends ● Set the baseline, track your changes, observe the outcome
  46. 46. Fight #5 - Operations ● Real hand-over to the Operations people actually never happened ● In case of any problems either it was ignored or escalation to engineers was taking about 1 minute ● Neither NOC nor Operations staff wanted to acquire enough knowledge of Hadoop and the apps ● Monitoring was nearly non-existing ● Same for appropriate alarms
  47. 47. If you are serious... ● Send your Ops for Hadoop training (or buy them books and have them read those!) ● Have them automate everything ● Ops have to understand your applications, not just the platform they are running on ● Your Ops need to be decent Linux admins ● ...and it would be great if they are also OK programmers (scripting, Java…) ● Of course, the motivation is the key
  48. 48. Plan and train for disaster ● Train your Ops how to help your system to survive till Monday morning ● Decide what sort of loss you will tolerate (BigData is not always so precious) ● Design your system for resilience, async processing, queuing etc
  49. 49. Fight #6 - evolution ● Sooner or later you will need to increase your capacity – Unless your business is stagnating ● Technically, you will either – Run out of storage space – Start hitting the wall on IOPS or CPU and fail to respect your SLAs (even if only internal ones) – Won't be able to deploy new applications
  50. 50. Understand your application - again ● Even if your apps runs fine you need to monitor the performance factors ● Build spreadsheets reflecting your current numbers ● Plan for the business growth ● Translate this into the number of additional nodes and networking equipment ● Especially important if your hardware purchase cycle takes months
  51. 51. Conclusions ● Not all companies are ready for BigData – often because of conservative people in key positions ● Traditional IT/Ops/NOC organizations are often unable to support these platforms ● Engineers have to be given more power to control how the things they build are ran (DevOps) ● Hadoop is a complex platform and has to be taken seriously for serious applications ● If you really depend on Hadoop you do need to build in-house expertise
  52. 52. Questions? Thanks for listening! Nikolai Grigoriev