LCA13: Hadoop DFS Performance

ASIA 2013 (LCA13)
LEG – Hadoop DFS Performance
Steve Capper <steve.capper@linaro.org>

ASIA 2013 (LCA13)
www.linaro.org
Hadoop Performance on ARM
I concentrated on reducing the time taken for CPU bound
tasks (the latency).
My work, so far, has focussed on the underlying cluster
filesystem, HDFS, as this underpins a lot of Hadoop
workloads.
The latest release-tagged Hadoop at the time of
experimentation was 2.0.2-alpha. So this was the version
used. (2.0.3-alpha has just come out).
Hadoop installations consist of rather a lot of moving parts.
So please let me know if I should be concentrating on
something else! :-)

ASIA 2013 (LCA13)
www.linaro.org
Hadoop Distributed Filesystem – Overview
HDFS is the default Hadoop distributed filesystem.
Consists of multiple “nodes”:
Namenode – holds the metadata for the filesystem and keeps track of
the datanodes.
Only one namenode is active at a time per namespace.
Thus there is a high memory requirement for a namenode.
Datanodes – store the file's blocks.
The default block size is 64MB.
(Optional) Passive namenodes – To maintain an High Availability
configuration, one can set up shared storage (via NFS or by
journalnodes) and have namenodes on standby to failover.
Filesystem blocks are replicated between datanodes.
Datanodes can be “rack aware”; data will be distributed between
racks.
Nodes can run on the same machine or on different
machines.

ASIA 2013 (LCA13)
www.linaro.org
Data Integrity in HDFS
Hadoop mitigates against hardware failure in the following
ways:
Metadata can be saved to multiple filesystems and regular snapshots
are usually taken to allow for disaster recovery.
On some HA configurations, multiple journalnodes maintain a quorum of
the metadata.
Data blocks are replicated across multiple datanodes (preferably to
different racks).
Data blocks are regularly transmitted:
All data streams are checksummed.
The default checksum algorithm is CRC32c.
The default number of bytes per checksum is 512 bytes.

ASIA 2013 (LCA13)
www.linaro.org
Test Configuration
So far I have been micro-benchmarking single machine
workloads (1 name node, 1 data node, 1 workload).
I have been working on an A9 platform.
TestDFSIO -read and -write have been tested with a
10GB filesize.
Soft-float Oracle JDK 1.7.0_10 was used.
Performance was measured using Linux perf -a, Java
samples were taken with the inbuilt profiler:
-Xrunhprof:cpu=samples,depth=10

ASIA 2013 (LCA13)
www.linaro.org
Plain TestDFSIO write tests – Samples
CPU SAMPLES BEGIN (total = 164879)
rank self accum count trace method
1 24.85% 24.85% 40970 301219 sun.nio.ch.EPollArrayWrapper.epollWait
5 0.07% 99.27% 118 300073 java.lang.ClassLoader.defineClass1
6 0.03% 99.31% 57 300020 java.util.zip.ZipFile.open
7 0.03% 99.34% 55 300179 java.util.zip.ZipFile.getEntry
8 0.03% 99.38% 54 300062 java.util.zip.ZipFile.read
9 0.02% 99.39% 31 300076 java.util.zip.Inflater.inflateBytes
10 0.01% 99.41% 23 301165 sun.nio.ch.FileDispatcherImpl.force0
+ 33.80% java perf-13641.map [.] 0xb4258538
+ 19.98% java perf-13534.map [.] 0xb403e980
+ 5.37% java libc-2.15.so [.] memcpy
+ 3.02% java [kernel.kallsyms] [k] __copy_from_user
+ 2.60% java [kernel.kallsyms] [k] __copy_to_user_std
+ 1.86% java [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
+ 1.01% swapper [kernel.kallsyms] [k] finish_task_switch
+ 0.66% swapper [kernel.kallsyms] [k] tick_nohz_idle_enter
+ 0.55% java [kernel.kallsyms] [k] get_page_from_freelist
+ 0.44% java [kernel.kallsyms] [k] free_hot_cold_page
+ 0.42% swapper [kernel.kallsyms] [k] default_idle
Namenode Java samples:
Linux perf results:

ASIA 2013 (LCA13)
www.linaro.org
Plain TestDFSIO read & write CPU usage
Plain TestDFSIO Write Plain TestDFSIO Read
Other
Java
memcpy
crc32c_sb8

ASIA 2013 (LCA13)
www.linaro.org
Some analysis of plain TestDFSIO
The namenode doesn't do much when it's in charge of 1
datanode. (this will change when things scale up).
When writing data:
Most of the Java time is spent in PureJavaCrc32C.update
Most of the CPU time is spent in Java land.
When reading data:
Most of the Java time is spent in
NativeCrc32.nativeVerifyChunkedSums.
Most of the CPU time is in the native function: crc32c_sb8
For both reading and writing data:
~10% of the CPU time is spent copying memory.

ASIA 2013 (LCA13)
www.linaro.org
Optimising the read and write paths
Trevor Robinson has proposed a patch to improve the write
path:
HDFS-3529 “Use direct buffers for data in write path”
I have been working on speeding up the computation of
CRC32c checksums, by using NEON.
Also, I've performed some very preliminary experiments on
replacing PureJavaCrc32C with a JNI class that references
the NEON CRC code.

ASIA 2013 (LCA13)
www.linaro.org
NEON optimisation of the CRC32c algorithm
I worked on an algorithm Q4 last year:
https://wiki.linaro.org/LEG/Engineering/CRC
Given an input buffer, we perform polynomial multiplication and
addition to give a slighter smaller buffer with the same CRC. This is
referred to as “folding”.
The algorithm reduced the buffer by 32 bytes at a time.
The final buffer of 32 bytes was then processed by slice-by-8 as
normal.
I found that the vast majority of CRCs in Hadoop (in both
NativeCrc32.nativeVerifyChunkedSums and
PureJavaCrc32C.update) were computed for 512 byte buffers.

ASIA 2013 (LCA13)
www.linaro.org
A single 64 bit fold
32
64
64
M(x)32
A(x)B(x)
64 =A(x) * (x^64 mod P(x))
=B(x) * (x^96 mod P(x))
M'(x)
+
+
64
CRC(M'(x)) = CRC(M(x))
=

ASIA 2013 (LCA13)
www.linaro.org
NEON implementations
With ARMv7 NEON we can only perform polynomial
multiplication on 8 bit lanes:
We need to be able to multiply at least 32 bits, thus multiple vmull.p8's
were chained together to achieve this).
There are 16 vmull.p8's per fold, so a little register pressure!
A gcc intrinsic version has been coded up:
This is considerably simpler to look at.
Unfortunately, it runs a little slower than hand-optimised assembler.
A test case has been sent to the Linaro Toolchain Working group (a
couple of weeks ago) and this is being analysed by them.

ASIA 2013 (LCA13)
www.linaro.org
Replacing PureJavaCrc32C
PureJavaCrc32C has two noteworthy methods:
public void update(byte[] b, int off, int len) – called mostly for lengths
of 512.
public void update(int b) – seldom called.
I created a new class implementing the Checksum interface
that:
Used the same implementation of: update(int b) as PureJavaCrc32C.
Called straight into JNI for update(byte[] b, int off, int len).
The class name NativeCrc32 was already taken by
something else, so I chose a rather silly temporary name:
HyperCrc32C

ASIA 2013 (LCA13)
www.linaro.org
Dealing with byte [] in JNI
We are only reading from the byte[] array, and only for a very short
time.
Thus I pinned the buffer in memory with:
GetPrimitiveArrayCritical
Then subsequently released the buffer with
ReleasePrimitiveArrayCritical(..., JNI_ABORT)
This worked for me, but perhaps a better long term solution would
be to change the backing data type to a ByteBuffer?
We could also be clever and change the alignment of these?
Rather than test every optimisation individually in this talk; I am
going to put them all together.

ASIA 2013 (LCA13)
www.linaro.org
TestDFSIO read & write CPU usage
Plain TestDFSIO Write New TestDFSIO Write
Other
Java
memcpy
crc
Plain TestDFSIO Read New TestDFSIO Read

ASIA 2013 (LCA13)
www.linaro.org
Some Analysis of the New TestDFSIO runs
The name node samples are unchanged.
For the write path:
Most of the time is now spent running native code rather than Java
code.
There is a noticeable reduction in Hadoop user CPU usage.
For the read path:
Most of the time is still spent running native code.
There is again a reduction in Hadoop user CPU usage.
For both the read and write paths there is a significant
amount of CPU time spent copying memory around.

ASIA 2013 (LCA13)
www.linaro.org
Conclusions and Further Work
The CPU usage required for TestDFSIO runs has been
reduced for both read/write paths.
This gives us more CPU cycles to run Hadoop jobs with!
Hadoop is known to be very sensitive to underlying disk IO.
To optimise HDFS IO, it would make sense to optimise disk/filesystem
IO as much as possible and re-run these benchmarks.
As Hadoop runs under Java:
It makes sense to keep track of JVMs.
A beta hard-float JVM has been released.
The CPU usage for memcpy is making me uneasy!

ASIA 2013 (LCA13)
www.linaro.org
We should also consider our hardware...
HDD SSD
0
0.5
1
1.5
2
2.5
TestDFSIO Write
Relative performance MB/s
HDD SSD
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
TestDFSIO Read
Relative performance MB/s

ASIA 2013 (LCA13)
www.linaro.org
Plain TestDFSIO write tests – Some Samples
3 16.48% 86.31% 13363 301109 org.apache.hadoop.util.PureJavaCrc32C.update
4 2.74% 89.05% 2225 301113 sun.misc.Unsafe.copyMemory
5 1.67% 90.72% 1357 301115 sun.nio.ch.FileDispatcherImpl.write0
6 1.09% 91.81% 884 301128 org.apache.hadoop.hdfs.DFSOutputStream.writeChunk
7 0.70% 92.51% 564 301110 sun.nio.ch.FileDispatcherImpl.read0
9 0.40% 93.33% 326 301202
org.apache.hadoop.hdfs.DFSOutputStream.waitAndQueueCurrentPacket
10 0.33% 93.66% 266 301124 java.lang.System.arraycopy
5 16.80% 84.08% 38169 301373 sun.nio.ch.ServerSocketChannelImpl.accept0
6 6.75% 90.83% 15342 301646 java.io.FileOutputStream.writeBytes
7 2.77% 93.60% 6294 301642 org.apache.hadoop.util.DataChecksum.verifyChunkedSums
Workload Java samples:
Datanode Java samples:

ASIA 2013 (LCA13)
www.linaro.org
Plain TestDFSIO read tests – Some Samples
1 36.82% 36.82% 5033 301083 org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums
5 2.00% 89.00% 273 301118 sun.nio.ch.SocketChannelImpl.isConnected
6 8.71% 96.22% 10320 301593 sun.nio.ch.FileChannelImpl.transferTo0
7 0.48% 96.70% 569 301524 org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise
10 0.29% 97.69% 339 301648 java.io.FileInputStream.readBytes

ASIA 2013 (LCA13)
www.linaro.org
Plain TestDFSIO read tests – More Samples
10 0.02% 99.15% 16 301165
org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.<clinit>
+ 36.37% java libhadoop.so.1.0.0 [.] crc32c_sb8
+ 6.02% java perf-14362.map [.] 0xb40d5040
+ 5.52% java perf-15039.map [.] 0xb440aba0
+ 2.09% java [kernel.kallsyms] [k] kmap_high
+ 0.70% kswapd0 [kernel.kallsyms] [k] __remove_mapping
Linux perf results:

ASIA 2013 (LCA13)
www.linaro.org
New TestDFSIO write tests – Some Samples
3 11.51% 82.74% 6076 301070 org.apache.hadoop.util.HyperCrc32C.nativeUpdate
7 0.45% 90.67% 240 301143 java.lang.System.arraycopy
9 0.38% 91.49% 203 301154 sun.nio.ch.EPollArrayWrapper.epollCtl
9 1.30% 97.52% 3038 301631 org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums

ASIA 2013 (LCA13)
www.linaro.org
New TestDFSIO write tests – More Samples
9 0.01% 99.48% 24 301230 java.lang.UNIXProcess.waitForProcessExit
+ 18.04% java libhadoop.so.1.0.0 [.] fold
+ 14.38% java perf-16729.map [.] 0xb42c0840
+ 5.77% java [kernel.kallsyms] [k] __copy_from_user
+ 2.73% java perf-16647.map [.] 0xb4142560
+ 1.28% swapper [kernel.kallsyms] [k] finish_task_switch
+ 0.96% swapper [kernel.kallsyms] [k] tick_nohz_idle_enter
+ 0.88% java libhadoop.so.1.0.0 [.] crc32c_neon
Linux perf results:

ASIA 2013 (LCA13)
www.linaro.org
New TestDFSIO read tests – Some Samples
2 25.40% 66.84% 3555 301075
org.apache.hadoop.util.NativeCrc32.nativeVerifyChunkedSums
6 8.05% 95.93% 9930 301604 sun.nio.ch.FileChannelImpl.transferTo0
8 0.49% 97.20% 604 301580 org.apache.hadoop.io.nativeio.NativeIO.posix_fadvise

ASIA 2013 (LCA13)
www.linaro.org
New TestDFSIO read tests – More Samples
10 0.01% 99.26% 13 301128
org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.<clinit>
+ 20.63% java libhadoop.so.1.0.0 [.] fold
+ 6.58% java perf-15890.map [.] 0xb44417f8
+ 5.74% java perf-15818.map [.] 0xb41a8de0
+ 2.64% java [kernel.kallsyms] [k] kmap_high
+ 1.30% java libhadoop.so.1.0.0 [.] crc32c_neon
Linux perf results:

More about Linaro Connect: www.linaro.org/connect/
More about Linaro: www.linaro.org/about/
More about Linaro engineering: www.linaro.org/engineering/
ASIA 2013 (LCA13)

LCA13: Hadoop DFS Performance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to LCA13: Hadoop DFS Performance

Similar to LCA13: Hadoop DFS Performance (20)

More from Linaro

More from Linaro (20)

Recently uploaded

Recently uploaded (20)

LCA13: Hadoop DFS Performance