Acceleration for big data, hadoop and memcached it168文库

Acceleration for Big Data, Hadoop and
Memcached
A Presentation at HPC Advisory Council Workshop, Lugano 2012
by

Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: panda@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda

Recap of Last Two Day’s Presentations
• MPI is a dominant programming model for HPC Systems
• Introduced some of the MPI Features and their Usage
• Introduced MVAPICH2 stack
• Illustrated many performance optimizations and tuning techniques for
MVAPICH2
• Provided an overview of MPI-3 Features
• Introduced challenges in designing MPI for Exascale systems
• Presented approaches being taken by MVAPICH2 for Exascale systems

HPC Advisory Council, Lugano Switzerland '12 2

High-Performance Networks in the Top500

Percentage share of InfiniBand is steadily increasing


Use of High-Performance Networks for Scientific
Computing
• OpenFabrics software stack with IB, iWARP and RoCE
interfaces are driving HPC systems
• Message Passing Interface (MPI)
• Parallel File Systems
• Almost 11.5 years of Research and Development since
InfiniBand was introduced in October 2001
• Other Programming Models are emerging to take
advantage of High-Performance Networks
– UPC
– SHMEM


One-way Latency: MPI over IB

Small Message Latency Large Message Latency
6.00 250.00
MVAPICH-Qlogic-DDR
MVAPICH-Qlogic-QDR
5.00
200.00 MVAPICH-ConnectX-DDR
MVAPICH-ConnectX2-PCIe2-QDR
4.00
MVAPICH-ConnectX3-PCIe3-FDR
150.00

Latency (us)
Latency (us)

1.82
3.00
1.66
100.00
1.64
2.00

1.56 50.00
1.00
0.81
0.00 0.00

Message Size (bytes) Message Size (bytes)

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch
FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 without IB switch


Bandwidth: MPI over IB

Unidirectional Bandwidth Bidirectional Bandwidth
7000 15000 MVAPICH-Qlogic-DDR
MVAPICH-Qlogic-QDR
6000 13000
6333 MVAPICH-ConnectX-DDR
MVAPICH-ConnectX2-PCIe2-QDR
Bandwidth (MBytes/sec)

11000

Bandwidth (MBytes/sec)
5000
MVAPICH-ConnectX3-PCIe3-FDR
9000 11043
4000 3385
7000 6521
3000 3280
5000 4407
2000 1917
3000
3704
1706
1000 1000 3341

0 -1000

Message Size (bytes) Message Size (bytes)

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch
FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 without IB switch


Large-scale InfiniBand Installations

• 209 IB Clusters (41.8%) in the November‘11 Top500 list
(http://www.top500.org)
• Installations in the Top 30 (13 systems):

120,640 cores (Nebulae) in China (4th) 29,440 cores (Mole-8.5) in China (21st)

73,278 cores (Tsubame-2.0) in Japan (5th) 42,440 cores (Red Sky) at Sandia (24th)

111,104 cores (Pleiades) at NASA Ames (7th) 62,976 cores (Ranger) at TACC (25th)

138,368 cores (Tera-100) at France (9th) 20,480 cores (Bull Benchmarks) in France (27th)

122,400 cores (RoadRunner) at LANL (10th) 20,480 cores (Helios) in Japan (28th)

137,200 cores (Sunway Blue Light) in China (14th) More are getting installed !

46,208 cores (Zin) at LLNL (15th)

33,072 cores (Lomonosov) in Russia (18th)


Enterprise/Commercial Computing

• Focuses on big data and data analytics
• Multiple environments and middleware are gaining
momentum
– Hadoop (HDFS, HBase and MapReduce)
– Memcached


Can High-Performance Interconnects Benefit Enterprise
Computing?
• Most of the current enterprise systems use 1GE
• Concerns for performance and scalability
• Usage of High-Performance Networks is beginning to draw
interest
– Oracle, IBM, Google are working along these directions
• What are the challenges?
• Where do the bottlenecks lie?
• Can these bottlenecks be alleviated with new designs (similar
to the designs adopted for MPI)?


Presentation Outline

• Overview of Hadoop, Memcached and HBase

• Challenges in Accelerating Enterprise Middleware

• Designs and Case Studies
– Memcached

– HBase

– HDFS

• Conclusion and Q&A


Memcached Architecture
Main Main
CPUs CPUs
memory memory

...
SSD HDD SSD HDD

High High
Main Main
Performance CPUs CPUs
!"#$%
"$#& Performance
Networks
memory
High Performance
memory
Networks Networks
SSD HDD SSD HDD

... Main ...
(Database Servers) memory
CPUs

SSD HDD
Web Frontend Servers (Memcached Servers)
(Memcached Clients)

• Integral part of Web 2.0 architecture
• Distributed Caching Layer
– Allows to aggregate spare memory from multiple nodes
– General purpose
• Typically used to cache database queries, results of API calls
• Scalable model, but typical usage very network intensive


Hadoop Architecture

• Underlying Hadoop Distributed
File System (HDFS)
• Fault-tolerance by replicating
data blocks
• NameNode: stores information
on data blocks
• DataNodes: store blocks and
host Map-reduce computation
• JobTracker: track jobs and
detect failure
• Model scales but high amount
of communication during
intermediate phases


Network-Level Interaction Between Clients and Data
Nodes in HDFS

(HDD/SSD)

... ...

High
Performance
(HDD/SSD)
Networks

... ...

(HDD/SSD)

(HDFS Clients) (HDFS Data Nodes)


Overview of HBase Architecture

• An open source
database project
based on Hadoop
framework for hosting
very large tables

• Major components:
HBaseMaster,
HRegionServer and
HBaseClient

• HBase and HDFS are
deployed in the same
cluster to get better
data locality

14
HPC Advisory Council, Lugano Switzerland '12

Network-Level Interaction Between
HBase Clients, Region Servers and Data Nodes

(HDD/SSD)

... ... ...

High High
Performance Performance (HDD/SSD)
Networks Networks

...
... ...

(HDD/SSD)

(HBase Clients) (HRegion Servers) (Data Nodes)





– Memcached

– HBase

– HDFS



Designing Communication and I/O Libraries for
Enterprise Systems: Challenges

Applications

Datacenter Middleware
(HDFS, HBase, MapReduce, Memcached)

Programming Models
(Socket)

Communication and I/O Library
Point-to-Point Threading Models and
Communication Synchronization

I/O and Filesys tems QoS Fault Tolerance

Commodity Computing Sys tem
Networking Technologies Architectures
(single, dual, quad, ..) Storage Technologies
(Infi niBand, 1/10/40 GiGE,
(HDD or SSD)
RNICs & Intelligent NICs) Multi/Many-c ore Architecture
and Accelerators


Common Protocols using Open Fabrics

Application

Application
Interface
Sockets Verbs

Kernel Space
TCP/IP SDP iWARP RDMA RDMA
Protocol TCP/IP
Implementation
Hardware
Ethernet Offload User User User
IPoIB RDMA space space space
Driver

Network Ethernet InfiniBand Ethernet InfiniBand iWARP RoCE InfiniBand
Adapter Adapter Adapter Adapter Adapter Adapter Adapter Adapter

Network Ethernet InfiniBand Ethernet InfiniBand Ethernet Ethernet InfiniBand
Switch
Switch Switch Switch Switch Switch Switch Switch
1/10/40 IPoIB 10/40 GigE- SDP iWARP RoCE IB Verbs
GigE TOE


Can New Data Analysis and Management Systems be
Designed with High-Performance Networks and Protocols?

Current Design Enhanced Designs Our Approach

Application Application Application

Accelerated Sockets OSU Design
Sockets
Verbs / Hardware
Verbs Interface
Offload

1/10 GigE
10 GigE or InfiniBand 10 GigE or InfiniBand
Network

• Sockets not designed for high-performance
– Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop)
– Zero-copy not available for non-blocking sockets

19

Interplay between Storage and Interconnect/Protocols

• Most of the current generation enterprise systems use the
traditional hard disks
• Since hard disks are slower, high performance
communication protocols may not have impact
• SSDs and other storage technologies are emerging
• Does it change the landscape?

20




– Memcached

– HBase

– HDFS



Memcached Design Using Verbs

Master Sockets
Sockets 1
Worker Shared
Thread
Client Thread Data
2
Sockets Memory
1 Worker Slabs
Thread Items
RDMA
…
Client 2 Verbs Verbs
Worker Worker
Thread Thread

• Server and client perform a negotiation protocol
– Master thread assigns clients to appropriate worker thread
• Once a client is assigned a verbs worker thread, it can communicate directly and is
“bound” to that thread, each verbs worker thread can support multiple clients
• All other Memcached data structures are shared among RDMA and Sockets worker
threads
• Memcached applications need not be modified; uses verbs interface if available
• Memcached Server can serve both socket and verbs clients simultaneously


Experimental Setup
• Hardware
– Intel Clovertown
• Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs,
6 GB main memory, 250 GB hard disk
• Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR)

– Intel Westmere
• Network: 1GigE, IPoIB, and IB (QDR)

• Software
– Memcached Server: 1.4.9
– Memcached Client: (libmemcached) 0.52
– In all experiments, ‘memtable’ is contained in memory (no disk
access involved)


Memcached Get Latency (Small Message)

180 180
SDP IPoIB
160 160
OSU-RC-IB 1GigE
140 140
10GigE OSU-UD-IB
120 120

Time (us)
Time (us)

100 100

80 80

60 60

40 40

20 20

0
0
1 2 4 8 16 32 64 128 256 512 1K 2K
1 2 4 8 16 32 64 128 256 512 1K 2K
Message Size
Message Size

Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)

• Memcached Get latency
– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us
– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us
• Almost factor of four improvement over 10GE (TOE) for 2K bytes on
the DDR cluster

Memcached Get Latency (Large Message)

6000 6000

SDP IPoIB
5000 5000
OSU-RC-IB 1GigE

4000 4000 10GigE OSU-UD-IB
Time (us)

Time (us)
3000 3000

2000 2000

1000 1000

0 0
2K 4K 8K 16K 32K 64K 128K 256K 512K 2K 4K 8K 16K 32K 64K 128K 256K 512K
Message Size Message Size
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)

• Memcached Get latency
– 8K bytes RC/UD – DDR: 18.9/19.1 us; QDR: 11.8/12.2 us
– 512K bytes RC/UD -- DDR: 369/403 us; QDR: 173/203 us
• Almost factor of two improvement over 10GE (TOE) for 512K bytes on
the DDR cluster

Memcached Get TPS (4byte)
1600 1600
SDP IPoIB

Thousands of Transactions per second (TPS)
Thousands of Transactions per second (TPS)

1400 1400
OSU-RC-IB 1GigE
1200 1200
OSU-UD-IB
1000 1000

800 800

600 600

400
400
200
200
0
0 1 2 4 8 16 32 64 128 256 512 800 1K
4 8
No. of Clients
No. of Clients

• Memcached Get transactions per second for 4 bytes
– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients
• Significant improvement with native IB QDR compared to SDP and IPoIB


Memcached - Memory Scalability
700

600

Memory Footprint (MB) 500 SDP IPoIB

OSU-RC-IB 1GigE
400
OSU-UD-IB OSU-Hybrid-IB

300

200

100

0
1 2 4 8 16 32 64 128 256 512 800 1K 1.6K 2K 4K
No. of Clients

• Steady Memory Footprint for UD Design
– ~ 200MB

• RC Memory Footprint increases as increase in number of clients
– ~500MB for 4K clients


Application Level Evaluation – Olio Benchmark

120 2500
SDP
100
IPoIB 2000

80 OSU-RC-IB
1500

Time (ms)
Time (ms)

OSU-UD-IB
60
OSU-Hybrid-IB
1000
40

500
20

0 0
1 4 8 64 128 256 512 1024
No. of Clients No. of Clients

• Olio Benchmark
– RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients
• 4X times better than IPoIB for 8 clients
• Hybrid design achieves comparable performance to that of pure RC design


Application Level Evaluation – Real Application Workloads
350
120
SDP
300
100 IPoIB
OSU-RC-IB 250
80 OSU-UD-IB

Time (ms)
200
Time (ms)

OSU-Hybrid-IB
60
150
40
100

20 50

0 0
1 4 8 64 128 256 512 1024

• Real Application Workload
– RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients
• 12X times better than IPoIB for 8 clients
• Hybrid design achieves comparable performance to that of pure RC design
J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K.
Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11
J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Memcached Design on
High Performance RDMA Capable Interconnects, CCGrid’12




– Memcached

– HBase

– HDFS



HBase Design Using Verbs

Current Design OSU Design

HBase HBase

JNI Interface

Sockets
OSU Module

1/10 GigE
Network InfiniBand (Verbs)

31

Experimental Setup
• Hardware
– Intel Clovertown
• Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR)

– Intel Westmere
• Network: 1GigE, IPoIB, and IB (QDR)

– 3 Nodes used
• Node1 [NameNode & HBase Master]
• Node2 [DataNode & HBase RegionServer]
• Node3 [Client]

• Software
– Hadoop 0.20.0, HBase 0.90.3 and Sun Java SDK 1.7.
– In all experiments, ‘memtable’ is contained in memory (no disk access
involved)

Details on Experiments
• Key/Value size
– Key size: 20 Bytes
– Value size: 1KB/4KB
• Get operation
– One Key/Value pair is inserted, so that Key/Value pair will stay in
memory
– Get operation is repeated 80,000 times
– Skipped first 40, 000 iterations as warm-up
• Put operation
– Memstore_Flush_Size is set to be 256 MB
– No memory flush operation involved
– Put operation is repeated 40, 000 times
– Skipped first 10, 000 iterations as warm-up

Get Operation (IB:DDR)

Latency Throughput
300 18000
1GE IPoIB
10GE OSU Design 16000
250
14000

200 12000

Operations /sec
Time (us)

10000
150
8000

100 6000

4000
50
2000

0 0
1K 4K 1K 4K
• HBase Get Operation
– 1K bytes – 65 us (15K TPS)
– 4K bytes -- 88 us (11K TPS)
• Almost factor of two improvement over 10GE (TOE)


Get Operation (IB:QDR)

Latency Throughput
350 25000
1GE
300
IPoIB 20000
250
OSU Design

Operations /sec
15000
Time (us)

200

150
10000

100
5000
50

0 0
1K 4K 1K 4K

• HBase Get Operation
• Almost factor of four improvement over IPoIB for 1KB


Put Operation (IB:DDR)

Latency Throughput
10000
400
1GE IPoIB 10GE OSU Design 9000
350
8000
300 7000

Operations /sec
250 6000
Time (us)

200 5000

4000
150
3000
100
2000
50 1000

0 0
1K 4K 1K 4K

• HBase Put Operation
– 1K bytes – 114 us (8.7K TPS)
– 4K bytes -- 179 us (5.6K TPS)
• 34% improvement over 10GE (TOE) for 1KB


Put Operation (IB:QDR)

Latency Throughput
400 14000
1GE
350 12000
IPoIB
300 OSU Design 10000

Operations /sec
250
8000
Time (us)

200
6000
150
4000
100

2000
50

0 0
1K 4K 1K 4K

• HBase Put Operation
• A factor of two improvement over IPoIB for 1KB


HBase Put/Get – Detailed Analysis
300 250
Communication
Communication Preparation
250
200 Server Processing
Server Serialization
200
Client Processing
150
Time (us)

Time (us)
Client Serialization
150

100
100

50
50

0 0
1GigE IPoIB 10GigE OSU-IB 1GigE IPoIB 10GigE OSU-IB
HBase Put 1KB HBase Get 1KB
• HBase 1KB Put
– Communication Time – 8.9 us
– A factor of 6X improvement over 10GE for communication time
• HBase 1KB Get
– Communication Time – 8.9 us
– A factor of 6X improvement over 10GE for communication time
W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda,
Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?,
ISPASS’12

HBase Single Server-Multi-Client Results
600 60000
IPoIB
500 50000
OSU-IB
400 40000 1GigE
Time (us)

Ops/sec
10GigE
300 30000

200 20000

100 10000

0 0
1 2 4 8 16 1 2 4 8 16

Latency Throughput

• HBase Get latency
– 4 clients: 104.5 us; 16 clients: 296.1 us
• HBase Get throughput
– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec
• 27% improvement in throughput for 16 clients over 10GE

HBase – YCSB Read-Write Workload
7000 10000
9000
6000
8000
5000 7000
6000

Time (us)
Time (us)

4000
5000
IPoIB OSU-IB
3000 4000
1GigE 10GigE
3000
2000
2000
1000 1000

0 0
8 16 32 64 96 128 8 16 32 64 96 128

Read Latency Write Latency

• HBase Get latency (Yahoo! Cloud Service Benchmark)
– 64 clients: 2.0 ms; 128 Clients: 3.5 ms
– 42% improvement over IPoIB for 128 clients
• HBase Get latency
– 64 clients: 1.9 ms; 128 Clients: 3.5 ms
– 40% improvement over IPoIB for 128 clients
J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High-
Performance Design of HBase with RDMA over InfiniBand, IPDPS’12




– Memcached

– HBase

– HDFS



Studies and Experimental Setup

• Two Kinds of Designs and Studies we have Done
– Studying the impact of HDD vs. SSD for HDFS
• Unmodified Hadoop for experiments
– Preliminary design of HDFS over Verbs
• Hadoop Experiments
– Intel Clovertown 2.33GHz, 6GB RAM, InfiniBand DDR, Chelsio T320
– Intel X-25E 64GB SSD and 250GB HDD
– Hadoop version 0.20.2, Sun/Oracle Java 1.6.0
– Dedicated NameServer and JobTracker
– Number of Datanodes used: 2, 4, and 8

42

Hadoop: DFS IO Write Performance
90
Four Data Nodes
80
Using HDD and SSD
Average Write Throughput (MB/sec)

70

60 1GE with HDD
IGE with SSD
50 IPoIB with HDD
IPoIB with SSD
40
SDP with HDD
30 SDP with SSD
10GE-TOE with HDD
20
10GE-TOE with SSD
10

0
1 2 3 4 5 6 7 8 9 10
File Size(GB)

• DFS IO included in Hadoop, measures sequential access throughput
• We have two map tasks each writing to a file of increasing size (1-10GB)
• Significant improvement with IPoIB, SDP and 10GigE
• With SSD, performance improvement is almost seven or eight fold!
• SSD benefits not seen without using high-performance interconnect

43

Hadoop: RandomWriter Performance
700

600
Execution Time (sec)

500 1GE with HDD
IGE with SSD
400 IPoIB with HDD
IPoIB with SSD
300
SDP with HDD
200 SDP with SSD
10GE-TOE with HDD
100
10GE-TOE with SSD
0
2 4
Number of data nodes

• Each map generates 1GB of random binary data and writes to HDFS
• SSD improves execution time by 50% with 1GigE for two DataNodes
• For four DataNodes, benefits are observed only with HPC interconnect
• IPoIB, SDP and 10GigE can improve performance by 59% on four Data Nodes

44

Hadoop Sort Benchmark
2500

2000
Execution Time (sec)

1GE with HDD
IGE with SSD
1500
IPoIB with HDD
IPoIB with SSD
1000
SDP with HDD
SDP with SSD
500 10GE-TOE with HDD
10GE-TOE with SSD
0
2 4
Number of data nodes

• Sort: baseline benchmark for Hadoop
• Sort phase: I/O bound; Reduce phase: communication bound
• SSD improves performance by 28% using 1GigE with two DataNodes
• Benefit of 50% on four DataNodes using SDP, IPoIB or 10GigE

S. Sur, H. Wang, J. Huang, X. Ouyang and D. K. Panda “Can High-Performance Interconnects Benefit Hadoop
Distributed File System?”, MASVDC ‘10 in conjunction with MICRO 2010, Atlanta, GA.

45

HDFS Design Using Verbs

Current Design OSU Design

HDFS HDFS

JNI Interface

Sockets
OSU Module

1/10 GigE
Network InfiniBand (Verbs)

46

RDMA-based Design for Native HDFS –
Preliminary Results
120

1 GigE IPoIB
100
10 GigE OSU-Design
80
Time (ms)

60

40

20

0
1 2 3 4 5
File Size (GB)
• HDFS File Write Experiment using four data nodes on IB-DDR Cluster
• HDFS File Write Time
– 2 GB – 14 s, 5 GB – 86s,
– For 5 GB File Size - 20% improvement over IPoIB,
14% improvement over 10GigE




– Memcached

– HBase

– HDFS



Concluding Remarks

• InfiniBand with RDMA feature is gaining momentum in HPC
systems with best performance and greater usage
• It is possible to use the RDMA feature in enterprise environments
for accelerating big data processing
• Presented some initial designs and performance numbers
• Many open research challenges remain to be solved so that
middleware for enterprise environments can take advantage of
– modern high-performance networks
– multi-core technologies
– emerging storage technologies


Designing Communication and I/O Libraries for
Enterprise Systems: Solved a Few Initial Challenges

Applications

Datacenter Middleware
(HDFS, HBase, MapReduce, Memcached)

Programming Models
(Socket)

Communication and I/O Library
Point-to-Point Threading Models and
Communication Synchronization

I/O and Filesys tems QoS Fault Tolerance

Commodity Computing Sys tem
Networking Technologies Architectures
(single, dual, quad, ..) Storage Technologies
(Infi niBand, 1/10/40 GiGE,
(HDD or SSD)
RNICs & Intelligent NICs) Multi/Many-c ore Architecture
and Accelerators


Web Pointers

http://www.cse.ohio-state.edu/~panda
http://nowlab.cse.ohio-state.edu

MVAPICH Web Page
http://mvapich.cse.ohio-state.edu

panda@cse.ohio-state.edu


Acceleration for big data, hadoop and memcached it168文库

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Acceleration for big data, hadoop and memcached it168文库

Similar to Acceleration for big data, hadoop and memcached it168文库 (20)

More from Accenture

More from Accenture (20)

Acceleration for big data, hadoop and memcached it168文库