Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
LCA13: Hadoop DFS Performance
1. ASIA 2013 (LCA13)
LEG – Hadoop DFS Performance
Steve Capper <steve.capper@linaro.org>
2. ASIA 2013 (LCA13)
www.linaro.org
Hadoop Performance on ARM
I concentrated on reducing the time taken for CPU bound
tasks (the latency).
My work, so far, has focussed on the underlying cluster
filesystem, HDFS, as this underpins a lot of Hadoop
workloads.
The latest release-tagged Hadoop at the time of
experimentation was 2.0.2-alpha. So this was the version
used. (2.0.3-alpha has just come out).
Hadoop installations consist of rather a lot of moving parts.
So please let me know if I should be concentrating on
something else! :-)
3. ASIA 2013 (LCA13)
www.linaro.org
Hadoop Distributed Filesystem – Overview
HDFS is the default Hadoop distributed filesystem.
Consists of multiple “nodes”:
Namenode – holds the metadata for the filesystem and keeps track of
the datanodes.
Only one namenode is active at a time per namespace.
Thus there is a high memory requirement for a namenode.
Datanodes – store the file's blocks.
The default block size is 64MB.
(Optional) Passive namenodes – To maintain an High Availability
configuration, one can set up shared storage (via NFS or by
journalnodes) and have namenodes on standby to failover.
Filesystem blocks are replicated between datanodes.
Datanodes can be “rack aware”; data will be distributed between
racks.
Nodes can run on the same machine or on different
machines.
4. ASIA 2013 (LCA13)
www.linaro.org
Data Integrity in HDFS
Hadoop mitigates against hardware failure in the following
ways:
Metadata can be saved to multiple filesystems and regular snapshots
are usually taken to allow for disaster recovery.
On some HA configurations, multiple journalnodes maintain a quorum of
the metadata.
Data blocks are replicated across multiple datanodes (preferably to
different racks).
Data blocks are regularly transmitted:
All data streams are checksummed.
The default checksum algorithm is CRC32c.
The default number of bytes per checksum is 512 bytes.
5. ASIA 2013 (LCA13)
www.linaro.org
Test Configuration
So far I have been micro-benchmarking single machine
workloads (1 name node, 1 data node, 1 workload).
I have been working on an A9 platform.
TestDFSIO -read and -write have been tested with a
10GB filesize.
Soft-float Oracle JDK 1.7.0_10 was used.
Performance was measured using Linux perf -a, Java
samples were taken with the inbuilt profiler:
-Xrunhprof:cpu=samples,depth=10
8. ASIA 2013 (LCA13)
www.linaro.org
Some analysis of plain TestDFSIO
The namenode doesn't do much when it's in charge of 1
datanode. (this will change when things scale up).
When writing data:
Most of the Java time is spent in PureJavaCrc32C.update
Most of the CPU time is spent in Java land.
When reading data:
Most of the Java time is spent in
NativeCrc32.nativeVerifyChunkedSums.
Most of the CPU time is in the native function: crc32c_sb8
For both reading and writing data:
~10% of the CPU time is spent copying memory.
9. ASIA 2013 (LCA13)
www.linaro.org
Optimising the read and write paths
Trevor Robinson has proposed a patch to improve the write
path:
HDFS-3529 “Use direct buffers for data in write path”
I have been working on speeding up the computation of
CRC32c checksums, by using NEON.
Also, I've performed some very preliminary experiments on
replacing PureJavaCrc32C with a JNI class that references
the NEON CRC code.
10. ASIA 2013 (LCA13)
www.linaro.org
NEON optimisation of the CRC32c algorithm
I worked on an algorithm Q4 last year:
https://wiki.linaro.org/LEG/Engineering/CRC
Given an input buffer, we perform polynomial multiplication and
addition to give a slighter smaller buffer with the same CRC. This is
referred to as “folding”.
The algorithm reduced the buffer by 32 bytes at a time.
The final buffer of 32 bytes was then processed by slice-by-8 as
normal.
I found that the vast majority of CRCs in Hadoop (in both
NativeCrc32.nativeVerifyChunkedSums and
PureJavaCrc32C.update) were computed for 512 byte buffers.
11. ASIA 2013 (LCA13)
www.linaro.org
A single 64 bit fold
32
64
64
M(x)32
A(x)B(x)
64 =A(x) * (x^64 mod P(x))
=B(x) * (x^96 mod P(x))
M'(x)
+
+
64
CRC(M'(x)) = CRC(M(x))
=
12. ASIA 2013 (LCA13)
www.linaro.org
NEON implementations
With ARMv7 NEON we can only perform polynomial
multiplication on 8 bit lanes:
We need to be able to multiply at least 32 bits, thus multiple vmull.p8's
were chained together to achieve this).
There are 16 vmull.p8's per fold, so a little register pressure!
A gcc intrinsic version has been coded up:
This is considerably simpler to look at.
Unfortunately, it runs a little slower than hand-optimised assembler.
A test case has been sent to the Linaro Toolchain Working group (a
couple of weeks ago) and this is being analysed by them.
13. ASIA 2013 (LCA13)
www.linaro.org
Replacing PureJavaCrc32C
PureJavaCrc32C has two noteworthy methods:
public void update(byte[] b, int off, int len) – called mostly for lengths
of 512.
public void update(int b) – seldom called.
I created a new class implementing the Checksum interface
that:
Used the same implementation of: update(int b) as PureJavaCrc32C.
Called straight into JNI for update(byte[] b, int off, int len).
The class name NativeCrc32 was already taken by
something else, so I chose a rather silly temporary name:
HyperCrc32C
14. ASIA 2013 (LCA13)
www.linaro.org
Dealing with byte [] in JNI
We are only reading from the byte[] array, and only for a very short
time.
Thus I pinned the buffer in memory with:
GetPrimitiveArrayCritical
Then subsequently released the buffer with
ReleasePrimitiveArrayCritical(..., JNI_ABORT)
This worked for me, but perhaps a better long term solution would
be to change the backing data type to a ByteBuffer?
We could also be clever and change the alignment of these?
Rather than test every optimisation individually in this talk; I am
going to put them all together.
15. ASIA 2013 (LCA13)
www.linaro.org
TestDFSIO read & write CPU usage
Plain TestDFSIO Write New TestDFSIO Write
Other
Java
memcpy
crc
Plain TestDFSIO Read New TestDFSIO Read
16. ASIA 2013 (LCA13)
www.linaro.org
Some Analysis of the New TestDFSIO runs
The name node samples are unchanged.
For the write path:
Most of the time is now spent running native code rather than Java
code.
There is a noticeable reduction in Hadoop user CPU usage.
For the read path:
Most of the time is still spent running native code.
There is again a reduction in Hadoop user CPU usage.
For both the read and write paths there is a significant
amount of CPU time spent copying memory around.
17. ASIA 2013 (LCA13)
www.linaro.org
Conclusions and Further Work
The CPU usage required for TestDFSIO runs has been
reduced for both read/write paths.
This gives us more CPU cycles to run Hadoop jobs with!
Hadoop is known to be very sensitive to underlying disk IO.
To optimise HDFS IO, it would make sense to optimise disk/filesystem
IO as much as possible and re-run these benchmarks.
As Hadoop runs under Java:
It makes sense to keep track of JVMs.
A beta hard-float JVM has been released.
The CPU usage for memcpy is making me uneasy!
26. More about Linaro Connect: www.linaro.org/connect/
More about Linaro: www.linaro.org/about/
More about Linaro engineering: www.linaro.org/engineering/
ASIA 2013 (LCA13)