BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Hadoop – The War Stories
Running Hadoop in large enterprise environment
Nikolai Grigoriev (ngrigoriev@gmail.com, @nikgrig)
Principal Software Engineer, http://sociablelabs.com

Agenda
● Why Hadoop?
● Planning Hadoop deployment
● Hadoop and read hardware
● Understanding the software stack
● Tuning HDFS, MapReduce and HBase
● Troubleshooting examples
● Testing your applications
Disclaimer: this presentation is based on the combined work experience from more than
one company and represents the author's personal point of view on the problems discussed in it.

Why Hadoop (and why have we decided to
use it)?
● Need to store hundreds of Tb of info
● Need to process it in parallel
● Desire to have both storage and processing
horizontally scalable
● Having and open-source platform with
commercial support

Our application
Application servers
(many :) )
Log processors
“ETL process”

Our application in numbers
● Thousands of user sessions per second
● Average session log size: ~30Kb, 3-7 events
per log
● Target retention period – at least ~90 days
● Redundancy and HA everywhere
● Pluggable “ETL” modules for additional data
processing

Main problem
Team had no practical knowledge
of Hadoop, HDFS and HBase…
...and there was nobody at the
company to help

But we did not realize...
It was not THE ONLY problem we
were about to face!

First fight – capacity planning
● Tons of articles are written about Hadoop
capacity planning
● Architects may be spending months making
educated guesses
● Capacity planning is really about finding the
amount of $$$ to be spent on your cluster for
target workload
– If we had infinite amount of $$$ why would we
bother at all? ;)

Hadoop performance limiting factors

It is all about the balance
● Your Hadoop cluster and your apps use all
these resources at different time
● Over-provisioning of one of the resources
usually lead to the shortage of another one -
wasted $$$

What can we say about an app?
● It is going to store X Tb of data
– Amount of storage (not to forget the RF!)
– Accommodate for growth and failures
● It is going to ingest the data at Y Mb/s
– Your network speed and number of nodes
● Latency
– More HDDs and faster HDDs
– More RAM
– More nodes

We are big enterprise...
Geeky Hadoop developer
Old School Senior IT Guy
- many “commodity+” hosts
- good but inexpensive
networking
- more regular HDDs
- lots of RAM
- I also love cloud…
- my recent OS
- my software configuration
- simple network
SANs, RAIDs, SCSI, racks,
Blades, redundancy,
Cisco, HP, fiber optics,
4-year-old
rock-solid RHEL, SNMP
monitoring…
what? I am the Boss...

Hadoop cluster vs. old school
application servers
● Mostly identical “commodity+” machines
– Probably with the exception of NN, JT
● Better to have more simpler machines than fewer
monster ones
● No RAID, just JBOD!
● Ethernet depending on the storage density, bonded
1Gbit may be enough
● Hadoop achieves with software what used to be
achievable with [expensive!] hardware

But still, your application is the
driver, not the IT guy!
From Cloudera website – Hadoop machine configuration according to workload

Your job is:
● Educate your IT, get them on your side or at
least earn their trust
● Try to build a capacity planning spreadsheet
based on what you do know
● Apply common sense to guess what you do not
know
● ...and plan a decent buffer
● Set reasonable performance targets for your
application

Fight #2 – OMG, our application is
slow!!!
● Main part of our application was the MR job merging the
logs
● We have committed to deliver X logs/sec on a target test
cluster with sample workload
● We were delivering like ~30% of that
● ...weeks before release :)
● ...and we have ran out of other excuses :(
● It was clearly our software and/or
configuration

Wait a second – we have support
contract from Hadoop vendor!
● I mean no disrespect to the vendors!
● But they do not know your application
● And they do not know your hardware
● And they do not know exactly your OS
● And they do not know your network equipment
● They can help you with some tuning, they can
help you with bugs and crashes – but they
won't be able (or sometimes simply qualified) to
do your job!

We are on our own :(
● We have realized that our testing methods were
not adequate to Hadoop-based ETL process
● Testing the product end-to-end was too difficult,
tracking changes was impossible
● Turn-around was too long, we could not try
something quickly and revert back
● Observing and monitoring the live system with
dummy incoming data was not productive
enough

Key to successful testing
● Representative data set
● Ability to repeat the same operation as many
times as needed with quick turnaround
● Each engineer had to be able to run the tests
and try something
● Establishing the key metrics you monitor and try
to improve
● Methodological approach – analyze, change,
test, be ready to roll back

Our “reference runner”
Large sample
dataset
“Reset” tool Runner tool Statistics
Recreates HBase tables
(predefined regions),
cleans HDFS etc
Injects the test data,
prepares the environment,
launches the MR job like real
application, allows to quickly
rebuild and redeploy the part
of the application
Any improvements since
last run?
Manager

Tuning results
● In two weeks we had the job that worked about
3 times faster
● Tuning was done everywhere – from OS to
Hadoop/HBase and our code
● We were confident that the software was ready
to go to production
● During following 2 years later we realized how
bad was our design and how it should have
been done ;)

Hadoop MapReduce DOs
● Think processes, not threads
● Reusable objects, lower GC overhead
● Snappy data compression is generally good
● Reasonable use of counters provides important
information
● For frequently running jobs, distributed cache helps a
lot
● Minimize disk I/O (spills etc), RAM is cheap
● Avoid unnecessary serialization/deserialization

Hadoop MapReduce DONTs
● Small files in HDFS
● Multithreaded programming inside
mapper/reducer
● Fat tasks using too much heap
● Any I/O in M-R other than HDFS, ZK or HBase
● Over-complicated code (simple things work
better)

Fight #3 – Going Production!
● Remember the slide about engineer vs. IT God
preferences ;)
● Production hardware was slightly different from
the test cluster
● Cluster has been deployed by the people who
did not know Hadoop
● First attempt to run the software resulted in
major failure and the cluster was finally handed
over to the developers for fixing ;)

Production hardware
● HP blade servers, 32 core, 128GB of RAM
● Emulex dual-port 10G Ethernet NICs
● 14 HDDs per machine
● OEL 6.3
● 10G switch modules
● Company hosting center with dedicated
networking and operations staff

Hardware
BIOS/Firmware(s)
BIOS/Firmware settings
OS (Linux)
Java (JVM)
Hadoop services
Your application(s)
Step back – 10,000 ft look at
Hadoop stack
Hardware
BIOS/Firmware(s)
BIOS/Firmware settings
OS (Linux)
Java (JVM)
Hadoop services
Your application(s)
Network
- Hadoop is not just a bunch
of Java apps
- It is a data and application
platform
- It can run well, just run,
barely run and cause
constant headache –
depending on how much
love does it receive :)

Hadoop stack (continued)
● In Hadoop a small problem, even sometimes on
a single node can be a major pain
● Isolating and finding that small problem may be
difficult
● Symptoms are often obvious only at high level
(e.g. application)
● Complex hardware (like HP) adds more
potential problems

Example of one of the problems we
had initially
● Jobs were failing because of timeouts
● Numerous I/O errors observed in job and HDFS logs
● This simple test was failing:
$ dd if=/dev/zero of=test8Gb.bin bs=1M count=8192
$ time hdfs dfs -copyFromLocal test8Gb.bin /
Zzz..zzz...zzz...5min...zzz…
real 4m10.002s
user 0m15.130s
sys 0m4.094s
● IT was clueless but did not really bother
● In fact, 8192Mb / (4 * 60 + 10) = 32Mb/s (!?!?!)
● 10Gb network transfers to HDFS at ~160Mb/s

Role of HDFS in Hadoop
● In Hadoop HDFS is the key layer that provides
the distributed filesystem services for other
components
● Health of HDFS directly (and drastically) affects
the health of other components
HDFS
Map-Reduce
Data
HBase

So, clearly HDFS was the problem
● But what was the problem with HDFS??
● How exactly HDFS writing works?

Chasing it down
● Due to node-to-node streaming it was difficult to
understand who was responsible
● Theory of “one bad node in pipeline” was ruled
out as results were consistently bad with the
cluster of 14 nodes
● Idea (isolating the problem is good):
$ time hdfs -Ddfs.replication=1 dfs -copyFromLocal test8Gb.bin /
real 0m42.002s
real 2m53.184s
real 3m41.072s
● 8192/42=195 Mb/s – hmmm….

Discoveries
● To make even longer story short...
– Bug in “cubic” TCP congestion protocol in Linux kernel
– NIC firmware was too old
– Kernel driver for Emulex 10G NICs was too old
– Only one out of 8 NIC RX queues was enabled on some
hosts
– A number of network settings were not appropriate for 10G
network
– “irqbalance” process (due to kernel bug) was locking NIC
RX queues by “losing” NIC IRQ handlers
– ...

More discoveries
– Nodes were set up multi-homed, even HDFS at that
time did not support that
– Misconfigured DNS and reverse DNS
● On disk I/O side
– Bad filesystem parameters
– Read-ahead settings were wrong
– Disk controller firmware was old

HDFS “litmus” test - TestDFSIO
13/03/13 16:30:02 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
13/03/13 16:30:02 INFO fs.TestDFSIO: Date & time: Wed Mar 13 16:30:02 UTC 2013
13/03/13 16:30:02 INFO fs.TestDFSIO: Number of files: 16
13/03/13 16:30:02 INFO fs.TestDFSIO: Total MBytes processed: 160000.0
13/03/13 16:30:02 INFO fs.TestDFSIO: Throughput mb/sec: 103.42190773343779
13/03/13 16:30:02 INFO fs.TestDFSIO: Average IO rate mb/sec: 103.61066436767578
13/03/13 16:30:02 INFO fs.TestDFSIO: IO rate std deviation: 4.513343367320971
13/03/13 16:30:02 INFO fs.TestDFSIO: Test exec time sec: 114.876
13/03/13 16:31:31 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
13/03/13 16:31:31 INFO fs.TestDFSIO: Date & time: Wed Mar 13 16:31:31 UTC 2013
13/03/13 16:31:31 INFO fs.TestDFSIO: Number of files: 16
13/03/13 16:31:31 INFO fs.TestDFSIO: Total MBytes processed: 160000.0
13/03/13 16:31:31 INFO fs.TestDFSIO: Throughput mb/sec: 586.8243268024676
13/03/13 16:31:31 INFO fs.TestDFSIO: Average IO rate mb/sec: 648.8555908203125
13/03/13 16:31:31 INFO fs.TestDFSIO: IO rate std deviation: 267.0954600161208
13/03/13 16:31:31 INFO fs.TestDFSIO: Test exec time sec: 33.683
13/03/13 16:31:31 INFO fs.TestDFSIO:

Fight #4 – tuning Hadoop
● Why do people tune things
(IT was not interested ;) )?
● With your own expensive
hardware you want the
maximum IOPS and CPU
power for $$$ you have
paid
● Not to mention that you simply want your apps to
run faster
● Tuning is an endless process but 80/20 rule
works perfectly

Even before you have something to
tune….
● Pick reasonably good hardware but do not go
high-end
● Same for network equipment
● Hadoop scales well and the redundancy is
achieved by software
● More nodes is almost always better than going
for extra node power and/or storage space
● Simpler systems are easier to tune, maintain
and troubleshoot
● Different machines for master nodes

Tuning the hardware and BIOS
● Updating BIOS and firmwares to recent versions
● Disabling dynamic CPU frequency scaling
● Tuning memory speed, power profile
● Disk controller, tune disk cache

OS Tuning
● Pick the filesystem (ext3, ext4, XFS...), parameters (reserve
blocks 0%) and mount options (noatime,nodiratime, barriers
etc)
● I/O scheduler depending on your disks and tasks
● Read-ahead settings
● Disable swap!
● irqbalance for big machines
● Tune other parameters (number of FDs, sockets)
● Install major troubleshooting tools (iostat, iotop, tcpdump,
strace…) on every one

Network tuning
● Test your TCP performance with iperf, ttcp or any other
tools you like
● Know your NICs well, install right firmware and kernel
modules
● Tune your TCP and IP parameters (work harder if you
have expensive 10G network)
● If your NIC supports TCP offload and it works – use it
● txqueuelen, MTU 9000 (if appropriate), HDFS is chatty
● Learn ethtool and see what it can do for you
● Basic IP networking set-up (DNS etc) has to be 100%
perfect

JVM tuning
● Hadoop allows you to set JVM options for all
processes
● Your Data Node, Name Node and HBase
Region Servers are going to work hard and you
need to help them to deal with your workload
● If your MR code is well designed you will most
likely NOT need to tune JVM for MR tasks
● Your main enemy will be GC – until you
become at lease allies, if not friends :)

Tuning Hadoop services
● NameNode deals with many connections and
needs ~150 bytes per HDFS block
● NameNode and DataNode are highly concurrent,
latter needs many threads
● Use HDFS short-circuit reads if appropriate
● ZooKeeper needs to handle enough connections
● HBase uses LOTS of heap
● Reuse JVMs for MR jobs if appropriate

Tuning MapReduce tasks (that means
tuning for your code and data)
● If you run different MR jobs, consider tuning
parameters for each of them, not once and for
all of them
● Configure job scheduler to enforce the SLAs
● Estimate the resource needed for each job
● Plan how are you going to run your jobs

Tuning your own code
● Test and profile your complex MR code outside of
Hadoop (your savings will scale too!)
● Check for GC overhead
● Use reusable objects
● Avoid using expensive formats like JSON and XML
● Anything you waste is multiplied by the number of
rows and the number of tasks!
● Evaluate the need for intermediate data compression

Tuning HBase
● That requires separate presentation
● You will need to fight hard for reducing GC
pauses and overhead
● Pre-splitting regions may be a good idea to
better balance the load
● Understand HBase compactions and deal with
major compactions your way

Set up your monitoring (and
alarming)
● You cannot improve what you cannot see!
● Monitor OS, Hadoop and your app metrics
● Ganglia, Graphite, LogStash, even Cloudera
Manager are your friends
● Set the baseline, track your changes, observe
the outcome

Fight #5 - Operations
● Real hand-over to the Operations people
actually never happened
● In case of any problems either it was ignored or
escalation to engineers was taking about 1
minute
● Neither NOC nor Operations staff wanted to
acquire enough knowledge of Hadoop and the
apps
● Monitoring was nearly non-existing
● Same for appropriate alarms

If you are serious...
● Send your Ops for Hadoop training (or buy
them books and have them read those!)
● Have them automate everything
● Ops have to understand your applications, not
just the platform they are running on
● Your Ops need to be decent Linux admins
● ...and it would be great if they are also OK
programmers (scripting, Java…)
● Of course, the motivation is the key

Plan and train for disaster
● Train your Ops how to
help your system to
survive till Monday
morning
● Decide what sort of
loss you will tolerate
(BigData is not always
so precious)
● Design your system for resilience, async
processing, queuing etc

Fight #6 - evolution
● Sooner or later you will need to increase your
capacity
– Unless your business is stagnating
● Technically, you will either
– Run out of storage space
– Start hitting the wall on IOPS or CPU and fail to
respect your SLAs (even if only internal ones)
– Won't be able to deploy new applications

Understand your application - again
● Even if your apps runs fine you need to monitor the
performance factors
● Build spreadsheets reflecting your current numbers
● Plan for the business growth
● Translate this into the number of additional nodes
and networking equipment
● Especially important if your hardware purchase
cycle takes months

Conclusions
● Not all companies are ready for BigData – often
because of conservative people in key positions
● Traditional IT/Ops/NOC organizations are often
unable to support these platforms
● Engineers have to be given more power to
control how the things they build are ran
(DevOps)
● Hadoop is a complex platform and has to be
taken seriously for serious applications
● If you really depend on Hadoop you do need to
build in-house expertise

Questions?
Thanks for listening!
Nikolai Grigoriev
ngrigoriev@gmail.com

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs

Similar to BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs (20)

Recently uploaded

Recently uploaded (20)

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal Software Engineer, SociableLabs