This document summarizes the results of profiling a Hadoop cluster to analyze infrastructure needs. Key findings include:
- The I/O and network subsystems were underutilized, with I/O at 10% capacity and network below 25% capacity.
- CPU utilization was high with low I/O wait, indicating optimal usage. Memory usage showed a high percentage of caching to reduce I/O waits.
- Initial HBase testing showed CPU utilization could not be fully driven by two workloads. Performance was good when data fit in cache but dropped when exceeding cache size.
Tungsten University: Set Up And Manage Advanced Replication TopologiesContinuent
Do you know how to set up Tungsten Replication to handle multi-master topologies? Do you know how to replicate transactions from multiple servers into a single slave? Do you know how to replicate between Tungsten clusters? In this course we show you how to set up and manage complex replication topologies using Tungsten.
Course Topics
- Overview of Tungsten features for complex replication configurations
- Basic installation commands
- Tungsten cookbook tools for fast setup of standard topologies
- Setting up master/slave topology
- Taking over from native MySQL replication
- Adding a node to a master/slave topology
- Switching between nodes in master/slave topology
- Setting up an all-masters topology
- Setting up star topologies
- Adding a node in a star topology
- Setting up fan-in replication from multiple masters
- Standard problems with complex topologies and how to handle them
Strata + Hadoop World 2012: HDFS: Now and FutureCloudera, Inc.
Hadoop 1.0 is a significant milestone in being the most stable and robust Hadoop release tested in production against a variety of applications. It offers improved performance, support for HBase, disk-fail-in-place, Webhdfs, etc over previous releases. The next major release, Hadoop 2.0 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, further performance improvements, etc. We describe how to take advantages of the new features and their benefits. We also discuss some of the misconceptions and myths about HDFS.
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
Apache Hadoop is clearly one of the fastest growing big data platforms to store and analyze arbitrarily structured data in search of business insights. However, applicable commodity infrastructures have advanced greatly in the last number of years and there is not a lot of accurate, current information to assist the community in optimally designing and configuring
Hadoop platforms (Infrastructure and O/S). In this talk we`ll present guidance on Linux and Infrastructure deployment, configuration and optimization from both Red Hat and HP (derived from actual performance data) for clusters optimized for single workloads or balanced clusters that host multiple concurrent workloads.
Tungsten University: Set Up And Manage Advanced Replication TopologiesContinuent
Do you know how to set up Tungsten Replication to handle multi-master topologies? Do you know how to replicate transactions from multiple servers into a single slave? Do you know how to replicate between Tungsten clusters? In this course we show you how to set up and manage complex replication topologies using Tungsten.
Course Topics
- Overview of Tungsten features for complex replication configurations
- Basic installation commands
- Tungsten cookbook tools for fast setup of standard topologies
- Setting up master/slave topology
- Taking over from native MySQL replication
- Adding a node to a master/slave topology
- Switching between nodes in master/slave topology
- Setting up an all-masters topology
- Setting up star topologies
- Adding a node in a star topology
- Setting up fan-in replication from multiple masters
- Standard problems with complex topologies and how to handle them
Strata + Hadoop World 2012: HDFS: Now and FutureCloudera, Inc.
Hadoop 1.0 is a significant milestone in being the most stable and robust Hadoop release tested in production against a variety of applications. It offers improved performance, support for HBase, disk-fail-in-place, Webhdfs, etc over previous releases. The next major release, Hadoop 2.0 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, further performance improvements, etc. We describe how to take advantages of the new features and their benefits. We also discuss some of the misconceptions and myths about HDFS.
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
Apache Hadoop is clearly one of the fastest growing big data platforms to store and analyze arbitrarily structured data in search of business insights. However, applicable commodity infrastructures have advanced greatly in the last number of years and there is not a lot of accurate, current information to assist the community in optimally designing and configuring
Hadoop platforms (Infrastructure and O/S). In this talk we`ll present guidance on Linux and Infrastructure deployment, configuration and optimization from both Red Hat and HP (derived from actual performance data) for clusters optimized for single workloads or balanced clusters that host multiple concurrent workloads.
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
Performance is a thing that you can never have too much of. But performance is a nebulous concept in Hadoop. Unlike databases, there is no equivalent in Hadoop to TPC, and different use cases experience performance differently. This talk will discuss advances on how Hadoop performance is measured and will also talk about recent and future advances in performance in different areas of the Hadoop stack.
Ariel Waizel discusses the Data Plane Development Kit (DPDK), an API for developing fast packet processing code in user space.
* Who needs this library? Why bypass the kernel?
* How does it work?
* How good is it? What are the benchmarks?
* Pros and cons
Ariel worked on kernel development at the IDF, Ben Gurion University, and several companies. He is interested in networking, security, machine learning, and basically everything except UI development. Currently a Solution Architect at ConteXtream (an HPE company), which specializes in SDN solutions for the telecom industry.
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
In this session, the speakers will discuss their experiences porting Apache Spark to the Cray XC family of supercomputers. One scalability bottleneck is in handling the global file system present in all large-scale HPC installations. Using two techniques (file open pooling, and mounting the Spark file hierarchy in a specific manner), they were able to improve scalability from O(100) cores to O(10,000) cores. This is the first result at such a large scale on HPC systems, and it had a transformative impact on research, enabling their colleagues to run on 50,000 cores.
With this baseline performance fixed, they will then discuss the impact of the storage hierarchy and of the network on Spark performance. They will contrast a Cray system with two levels of storage with a “data intensive” system with fast local SSDs. The Cray contains a back-end global file system and a mid-tier fast SSD storage. One conclusion is that local SSDs are not needed for good performance on a very broad workload, including spark-perf, TeraSort, genomics, etc.
They will also provide a detailed analysis of the impact of latency of file and network I/O operations on Spark scalability. This analysis is very useful to both system procurements and Spark core developers. By examining the mean/median value in conjunction with variability, one can infer the expected scalability on a given system. For example, the Cray mid-tier storage has been marketed as the magic bullet for data intensive applications. Initially, it did improve scalability and end-to-end performance. After understanding and eliminating variability in I/O operations, they were able to outperform any configurations involving mid-tier storage by using the back-end file system directly. They will also discuss the impact of network performance and contrast results on the Cray Aries HPC network with results on InfiniBand.
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
In this session, the speakers will discuss their experiences porting Apache Spark to the Cray XC family of supercomputers. One scalability bottleneck is in handling the global file system present in all large-scale HPC installations. Using two techniques (file open pooling, and mounting the Spark file hierarchy in a specific manner), they were able to improve scalability from O(100) cores to O(10,000) cores. This is the first result at such a large scale on HPC systems, and it had a transformative impact on research, enabling their colleagues to run on 50,000 cores.
With this baseline performance fixed, they will then discuss the impact of the storage hierarchy and of the network on Spark performance. They will contrast a Cray system with two levels of storage with a “data intensive” system with fast local SSDs. The Cray contains a back-end global file system and a mid-tier fast SSD storage. One conclusion is that local SSDs are not needed for good performance on a very broad workload, including spark-perf, TeraSort, genomics, etc.
They will also provide a detailed analysis of the impact of latency of file and network I/O operations on Spark scalability. This analysis is very useful to both system procurements and Spark core developers. By examining the mean/median value in conjunction with variability, one can infer the expected scalability on a given system. For example, the Cray mid-tier storage has been marketed as the magic bullet for data intensive applications. Initially, it did improve scalability and end-to-end performance. After understanding and eliminating variability in I/O operations, they were able to outperform any configurations involving mid-tier storage by using the back-end file system directly. They will also discuss the impact of network performance and contrast results on the Cray Aries HPC network with results on InfiniBand.
Building Clustered Applications with Kubernetes and DockerSteve Watt
August 2015 - Presented at LinuxCon and ContainerCon
Demos:
1) NGINX Web Cluster with Local Storage
2) Hot Upgrade/Deploy of an NGINX Web Cluster with Shared Storage (GlusterFS)
3) MySQL with Block Storage (Ceph RBD)
4) Apache Spark in Kubernetes with Shared Storage
2. Agenda – Clearing up common misconceptions
Web Scale Hadoop Origins
Single/Dual Socket 1+ GHz
4-8 GB RAM
2-4 Cores
1x 1GbE NIC
(2-4) x 1 TB SATA Drives
Commodity in 2013
Dual Socket 2+ GHz
24-48 GB RAM
4-6 Cores
(2-4) x 1GbE NICs
(4-14) x 2 TB SATA Drives
The enterprise perspective
is also different
2
3. You can quantify what is right for you
Balancing Performance and Storage Capacity with Price
$
PRICE
Storage Performance
Capacity
3
4. “W e’ve profiled our Hadoop applications
so we know what type of infrastructure
we need”
Said no-one. Ever.
4
5. Profiling your Hadoop Cluster
High Level Design goals
Its pretty simple:
1) Instrument the Cluster
2) Run your workloads
3) Analyze the Numbers
Don’t do paper exercises. Hadoop has a way of blowing all
your hypothesis out of the water.
Lets walk through a 10 TB TeraSort on Full 42u Rack:
- 2 x HP DL360p (JobTracker and NameNode)
- 18 x HP DL380p Hadoop Slaves (18 Maps, 12 Reducers)
64 GB RAM, Dual 6 core Intel 2.9 GHz, 2 x HP p420 Smart
Array Controllers, 16 x 1TB SFF Disks, 4 x 1GbE Bonded NICs
5
6. Instrumenting the Cluster
The key is to capture the data. Use whatever framework you’re comfortable with
Analysis using the Linux SAR Tool
- Outer Script starts SAR gathering scripts on each node. Then starts Hadoop
Job.
- SAR Scripts on each node gather I/O, CPU, Memory and Network Metrics for
that node for the duration of the job.
- Upon completion the SAR data is converted to CSV and loaded into MySQL so
we can do ad-hoc analysis of the data.
- Aggregations/Summations done via SQL.
- Excel is used to generate charts.
6
7. Examples
Performed for each node – results copied to repository node
ssh wkr01 sadf -d sar_test.dat -- -u > wkr01_cpu_util.csv
ssh wkr01 sadf -d sar_test.dat -- -b > wkr01_io_rate.csv
ssh wkr01 sadf -d sar_test.dat -- -n DEV > wkr01_net_dev.csv
ssh wkr02 sadf -d sar_test.dat -- -u > wkr02_cpu_util.csv
ssh wkr02 sadf -d sar_test.dat -- -b > wkr02_io_rate.csv
ssh wkr02 sadf -d sar_test.dat -- -n DEV > wkr02_net_dev.csv
… File name is prefixed with node name (i.e., “wrknn”)
7
8. I/ Subsystem Test Chart
O
I/O Subsystem has only around 10% of Total Throughput Utilization
RUN the DD Tests first to understand
what the Read and Write throughput
capabilities of your I/O Subsystem
- X axis is time
- Y axis is MB per second
TeraSort for this server design is not
I/O bound. 1.6 GB/s is upper bound,
this is less than 10% utilized.
8
9. Network Subsystem Test Chart
Network throughput utilization per server less than a ¼ of capacity
A 1GbE NICs can drive up to 1Gb/s for
Reads and 1Gb/s for Writes which is
roughly 4 Gb/s or 400 MB/s total across all
bonded NICs
- Y axis is throughput
- X axis is elapsed time
Rx = Received MB/sec
Tx = Transmitted MB/sec
Tot = Total MB/sec
9
10. CPU Subsystem Test Chart
CPU Utilization is High, I/O Wait is low – Where you want to be
Each data point is taken every 30 seconds
Y axis is percent utilization of CPUs
X axis is timeline of the run
CPU Utilization is captured by analyzing how
busy each core is and creating an average
IO Wait (1 of 10 other SAR CPU Metrics)
measures percentage of time CPU is waiting
on I/O
10
11. Memory Subsystem Test Chart
High Percentage Cache Ratio to Total Memory shows CPU is not waiting on memory
- X axis is Memory Utilization
- Y axis is elapsed time
- Percent Caches is a SAR Memory Metrics
(kb_cached). Tells you how many blocks of
memory are cached in the Linux Memory
Page Cache. The amount of memory used to
cache data by the kernel.
- Means we’re doing a good job of caching
data so the JVM is not having to do I/Os and
incurring I/O wait time (reflected in the
previous slide)
11
12. Tuning from your Server Configuration baseline
- We’ve established now that the Server Configuration we used works pretty well. But what if you want to
Tune it? i.e. sacrifice some performance for cost or to remove an I/O, Network or Memory limitation?
- Types of Disks / Controllers
- Number of Disks / - Type (Socket R 130 W vs.
Controllers Socket B 95 W)
- Amount of Cores
- Amount of Memory Channels
- 4GB vs. 8GB DIMMS
I/ Subsystem
O
COMPUTE
- Floor Space (Rack - Type of Network (1 or 10GbE)
Density) - Amount of Server of NICs
- Power and Cooling - Switch Port Availability
Constraints - Deep Buffering
Data CENTER Network
Annual Power and Cooling Costs of a Cluster are 1/3 the cost
12
of acquiring the cluster
14. W you really want is a shared
hat
service...
The Server Building Blocks
NameNode and JobTracker Configurations
1u, 64GB of RAM, 2 x 2.9 GHz 6 Core Intel Socket R Processors, 4
Small Form Factor Disks, RAID Configuration, 1 Disk Controller
Hadoop Slave Configurations
- 2u 48 GB of RAM, 2 x 2.4 GHz 6 Core Intel Socket B Processors, 1
High Performing Disk Controller with twelve 2TB Large Form Factor
Data Disks in JBOD Configuration
- 24 TB of Storage Capacity, Low Power Consumption CPU
-
14
15. Single Rack Config
Single rack configuration
42u rack enclosure
This rack is the initia l build ing
blo c k that configures all the 1u 1GbE Switches 2 x 1GbE TOR switches
key management services for
a production scale out cluster 1u 2 Sockets, 4 Disks 1 x Management node
to 4000 nodes.
2u 2 Sockets, 12 Disks 1 x JobTracker node
1 x Name node
(3-18) x Worker nodes
Open for KVM switch
2u 2 Sockets, 12 Disks
Open 1u
15
16. Multi-Rack Config
1u Switches 2 x 10 GbE Aggregation switch
Scale out configuration
1 or more 42u Racks
The Ex te ns io n build ing blo c k.
One adds 1 or more racks of
this block to the single rack
Single rack RA 1u 1GbE Switches 2 x 1GbE TOR switches
configuration to build out a
cluster that scales to 4000 2u 2 Sockets, 12 Disks 19 x Worker nodes
nodes.
Open for KVM switch
2u 2 Sockets, 12 Disks
Open 2u
16
17. System on a Chip and Hadoop
Intel ATOM and ARM Processor Server Cartridges
Hadoop has come full circle
4-8 GB RAM
4 Cores 1.8 GHz
2-4 TB
1 GbE NICs
- Amazing Density Advances. 270 +
servers in a Rack (compared to 21!).
- Trevor Robinson from Calxeda is a
member of our community working on
this. He’d love some help:
trevor.robinson@calxeda.com
17
18. Thanks for attending
Thanks to our partner HP for letting me present their findings. If you’re
interested in learning more about these findings please check out
hp.com/go/hadoop and see detailed reference architectures and varying server
configurations for Hortonworks, MapR and Cloudera.
Steve Watt swatt@redhat.com @wattsteve
18
19. HBase Testing – Results still TBD
YCSB Workload A and C unable to Drive the CPU
CPU Utilization 2 Workloads Tested:
Network Usage
- YCSB workload “C” : 100%
100 300 read-only requests, random
250 access to DB, using a unique
80
200 primary key value
60
150
Axis Title MB/
sec
40 100 - YCSB workload “A” : 50% read
20 50 and 50% updates, random
- access to DB, using a unique
0
1 3 5 7 9 11 13 15 17 19 21
1 3 4 5 7 910 13 15 17 19 21
11 primary key value
Minutes Minutes
- When data is cached in HBase
cache, performance is really
good and we can occupy the
Disk Usage Memory Usage CPU. Ops throughput (and
hence CPU Utilization) dwindles
300 100
when data exceeds HBase
250 80
cache but is available in Linux
200 60
Axis Title
cache. Resolution TBD –
150 40
MB/
sec Related JIRA HDFS-2246
100 20
50
0
- 1 345 7 9 111315171921
10
1 3 5 7 9 11 13 15 17 19 21 Minutes
Minutes
19
Editor's Notes
The average enterprise customers are running 1 or 2 Racks in Production. In that scenario, you have redundant switches in each RACK and you RAID Mirror the O/S and Hadoop Runtime and JBOD the Data Drives because losing Racks and Servers is costly. VERY Different to the Web Scale perspective.
Hadoop Slave Server Configurations are about balancing the following: Performance (Keeping the CPU as busy as possible) Storage Capacity (Important for HDFS) Price (Managing the above at a price you can afford) Commodity Servers do not mean cheap. At scale you want to fully optimize performance and storage capacity for your workloads to keep costs down.
So as any good Infrastructure Designer will tell you… you begin by Profiling your Hadoop Applications.
But generally speaking, you design a pretty decent cluster infrastructure baseline that you can later optimize, provided you get these things right.
So as any good Infrastructure Designer will tell you… you begin by Profiling your Hadoop Applications.
In reality, you don’t want Hadoop Clusters popping up all over the place in your company. It becomes untenable to manage. You really want a single large cluster managed as a service. If that’s the case then you don’t want to be optimizing your cluster for just one workload, you want to be optimizing this for multiple concurrent workloads (dependent on scheduler) and have a balanced cluster.