Snap mirror source to tape to destination scenario
Acceleration for big data, hadoop and memcached it168文库
1. Acceleration for Big Data, Hadoop and
Memcached
A Presentation at HPC Advisory Council Workshop, Lugano 2012
by
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: panda@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
2. Recap of Last Two Day’s Presentations
• MPI is a dominant programming model for HPC Systems
• Introduced some of the MPI Features and their Usage
• Introduced MVAPICH2 stack
• Illustrated many performance optimizations and tuning techniques for
MVAPICH2
• Provided an overview of MPI-3 Features
• Introduced challenges in designing MPI for Exascale systems
• Presented approaches being taken by MVAPICH2 for Exascale systems
HPC Advisory Council, Lugano Switzerland '12 2
3. High-Performance Networks in the Top500
Percentage share of InfiniBand is steadily increasing
HPC Advisory Council, Lugano Switzerland '12 3
4. Use of High-Performance Networks for Scientific
Computing
• OpenFabrics software stack with IB, iWARP and RoCE
interfaces are driving HPC systems
• Message Passing Interface (MPI)
• Parallel File Systems
• Almost 11.5 years of Research and Development since
InfiniBand was introduced in October 2001
• Other Programming Models are emerging to take
advantage of High-Performance Networks
– UPC
– SHMEM
HPC Advisory Council, Lugano Switzerland '12 4
7. Large-scale InfiniBand Installations
• 209 IB Clusters (41.8%) in the November‘11 Top500 list
(http://www.top500.org)
• Installations in the Top 30 (13 systems):
120,640 cores (Nebulae) in China (4th) 29,440 cores (Mole-8.5) in China (21st)
73,278 cores (Tsubame-2.0) in Japan (5th) 42,440 cores (Red Sky) at Sandia (24th)
111,104 cores (Pleiades) at NASA Ames (7th) 62,976 cores (Ranger) at TACC (25th)
138,368 cores (Tera-100) at France (9th) 20,480 cores (Bull Benchmarks) in France (27th)
122,400 cores (RoadRunner) at LANL (10th) 20,480 cores (Helios) in Japan (28th)
137,200 cores (Sunway Blue Light) in China (14th) More are getting installed !
46,208 cores (Zin) at LLNL (15th)
33,072 cores (Lomonosov) in Russia (18th)
HPC Advisory Council, Lugano Switzerland '12 7
8. Enterprise/Commercial Computing
• Focuses on big data and data analytics
• Multiple environments and middleware are gaining
momentum
– Hadoop (HDFS, HBase and MapReduce)
– Memcached
HPC Advisory Council, Lugano Switzerland '12 8
9. Can High-Performance Interconnects Benefit Enterprise
Computing?
• Most of the current enterprise systems use 1GE
• Concerns for performance and scalability
• Usage of High-Performance Networks is beginning to draw
interest
– Oracle, IBM, Google are working along these directions
• What are the challenges?
• Where do the bottlenecks lie?
• Can these bottlenecks be alleviated with new designs (similar
to the designs adopted for MPI)?
HPC Advisory Council, Lugano Switzerland '12 9
10. Presentation Outline
• Overview of Hadoop, Memcached and HBase
• Challenges in Accelerating Enterprise Middleware
• Designs and Case Studies
– Memcached
– HBase
– HDFS
• Conclusion and Q&A
HPC Advisory Council, Lugano Switzerland '12 10
11. Memcached Architecture
Main Main
CPUs CPUs
memory memory
...
SSD HDD SSD HDD
High High
Main Main
Performance CPUs CPUs
!"#$%
"$#& Performance
Networks
memory
High Performance
memory
Networks Networks
SSD HDD SSD HDD
... Main ...
(Database Servers) memory
CPUs
SSD HDD
Web Frontend Servers (Memcached Servers)
(Memcached Clients)
• Integral part of Web 2.0 architecture
• Distributed Caching Layer
– Allows to aggregate spare memory from multiple nodes
– General purpose
• Typically used to cache database queries, results of API calls
• Scalable model, but typical usage very network intensive
HPC Advisory Council, Lugano Switzerland '12 11
12. Hadoop Architecture
• Underlying Hadoop Distributed
File System (HDFS)
• Fault-tolerance by replicating
data blocks
• NameNode: stores information
on data blocks
• DataNodes: store blocks and
host Map-reduce computation
• JobTracker: track jobs and
detect failure
• Model scales but high amount
of communication during
intermediate phases
HPC Advisory Council, Lugano Switzerland '12 12
13. Network-Level Interaction Between Clients and Data
Nodes in HDFS
(HDD/SSD)
... ...
High
Performance
(HDD/SSD)
Networks
... ...
(HDD/SSD)
(HDFS Clients) (HDFS Data Nodes)
HPC Advisory Council, Lugano Switzerland '12 13
14. Overview of HBase Architecture
• An open source
database project
based on Hadoop
framework for hosting
very large tables
• Major components:
HBaseMaster,
HRegionServer and
HBaseClient
• HBase and HDFS are
deployed in the same
cluster to get better
data locality
14
HPC Advisory Council, Lugano Switzerland '12
15. Network-Level Interaction Between
HBase Clients, Region Servers and Data Nodes
(HDD/SSD)
... ... ...
High High
Performance Performance (HDD/SSD)
Networks Networks
...
... ...
(HDD/SSD)
(HBase Clients) (HRegion Servers) (Data Nodes)
HPC Advisory Council, Lugano Switzerland '12 15
16. Presentation Outline
• Overview of Hadoop, Memcached and HBase
• Challenges in Accelerating Enterprise Middleware
• Designs and Case Studies
– Memcached
– HBase
– HDFS
• Conclusion and Q&A
HPC Advisory Council, Lugano Switzerland '12 16
17. Designing Communication and I/O Libraries for
Enterprise Systems: Challenges
Applications
Datacenter Middleware
(HDFS, HBase, MapReduce, Memcached)
Programming Models
(Socket)
Communication and I/O Library
Point-to-Point Threading Models and
Communication Synchronization
I/O and Filesys tems QoS Fault Tolerance
Commodity Computing Sys tem
Networking Technologies Architectures
(single, dual, quad, ..) Storage Technologies
(Infi niBand, 1/10/40 GiGE,
(HDD or SSD)
RNICs & Intelligent NICs) Multi/Many-c ore Architecture
and Accelerators
HPC Advisory Council, Lugano Switzerland '12 17
18. Common Protocols using Open Fabrics
Application
Application
Interface
Sockets Verbs
Kernel Space
TCP/IP SDP iWARP RDMA RDMA
Protocol TCP/IP
Implementation
Hardware
Ethernet Offload User User User
IPoIB RDMA space space space
Driver
Network Ethernet InfiniBand Ethernet InfiniBand iWARP RoCE InfiniBand
Adapter Adapter Adapter Adapter Adapter Adapter Adapter Adapter
Network Ethernet InfiniBand Ethernet InfiniBand Ethernet Ethernet InfiniBand
Switch
Switch Switch Switch Switch Switch Switch Switch
1/10/40 IPoIB 10/40 GigE- SDP iWARP RoCE IB Verbs
GigE TOE
HPC Advisory Council, Lugano Switzerland '12 18
19. Can New Data Analysis and Management Systems be
Designed with High-Performance Networks and Protocols?
Current Design Enhanced Designs Our Approach
Application Application Application
Accelerated Sockets OSU Design
Sockets
Verbs / Hardware
Verbs Interface
Offload
1/10 GigE
10 GigE or InfiniBand 10 GigE or InfiniBand
Network
• Sockets not designed for high-performance
– Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop)
– Zero-copy not available for non-blocking sockets
19
HPC Advisory Council, Lugano Switzerland '12
20. Interplay between Storage and Interconnect/Protocols
• Most of the current generation enterprise systems use the
traditional hard disks
• Since hard disks are slower, high performance
communication protocols may not have impact
• SSDs and other storage technologies are emerging
• Does it change the landscape?
20
HPC Advisory Council, Lugano Switzerland '12
21. Presentation Outline
• Overview of Hadoop, Memcached and HBase
• Challenges in Accelerating Enterprise Middleware
• Designs and Case Studies
– Memcached
– HBase
– HDFS
• Conclusion and Q&A
HPC Advisory Council, Lugano Switzerland '12 21
22. Memcached Design Using Verbs
Master Sockets
Sockets 1
Worker Shared
Thread
Client Thread Data
2
Sockets Memory
1 Worker Slabs
Thread Items
RDMA
…
Client 2 Verbs Verbs
Worker Worker
Thread Thread
• Server and client perform a negotiation protocol
– Master thread assigns clients to appropriate worker thread
• Once a client is assigned a verbs worker thread, it can communicate directly and is
“bound” to that thread, each verbs worker thread can support multiple clients
• All other Memcached data structures are shared among RDMA and Sockets worker
threads
• Memcached applications need not be modified; uses verbs interface if available
• Memcached Server can serve both socket and verbs clients simultaneously
HPC Advisory Council, Lugano Switzerland '12 22
23. Experimental Setup
• Hardware
– Intel Clovertown
• Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs,
6 GB main memory, 250 GB hard disk
• Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR)
– Intel Westmere
• Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs,
12 GB main memory, 160 GB hard disk
• Network: 1GigE, IPoIB, and IB (QDR)
• Software
– Memcached Server: 1.4.9
– Memcached Client: (libmemcached) 0.52
– In all experiments, ‘memtable’ is contained in memory (no disk
access involved)
HPC Advisory Council, Lugano Switzerland '12 23
24. Memcached Get Latency (Small Message)
180 180
SDP IPoIB
160 160
OSU-RC-IB 1GigE
140 140
10GigE OSU-UD-IB
120 120
Time (us)
Time (us)
100 100
80 80
60 60
40 40
20 20
0
0
1 2 4 8 16 32 64 128 256 512 1K 2K
1 2 4 8 16 32 64 128 256 512 1K 2K
Message Size
Message Size
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
• Memcached Get latency
– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us
– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us
• Almost factor of four improvement over 10GE (TOE) for 2K bytes on
the DDR cluster
HPC Advisory Council, Lugano Switzerland '12 24
25. Memcached Get Latency (Large Message)
6000 6000
SDP IPoIB
5000 5000
OSU-RC-IB 1GigE
4000 4000 10GigE OSU-UD-IB
Time (us)
Time (us)
3000 3000
2000 2000
1000 1000
0 0
2K 4K 8K 16K 32K 64K 128K 256K 512K 2K 4K 8K 16K 32K 64K 128K 256K 512K
Message Size Message Size
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
• Memcached Get latency
– 8K bytes RC/UD – DDR: 18.9/19.1 us; QDR: 11.8/12.2 us
– 512K bytes RC/UD -- DDR: 369/403 us; QDR: 173/203 us
• Almost factor of two improvement over 10GE (TOE) for 512K bytes on
the DDR cluster
HPC Advisory Council, Lugano Switzerland '12 25
26. Memcached Get TPS (4byte)
1600 1600
SDP IPoIB
Thousands of Transactions per second (TPS)
Thousands of Transactions per second (TPS)
1400 1400
OSU-RC-IB 1GigE
1200 1200
OSU-UD-IB
1000 1000
800 800
600 600
400
400
200
200
0
0 1 2 4 8 16 32 64 128 256 512 800 1K
4 8
No. of Clients
No. of Clients
• Memcached Get transactions per second for 4 bytes
– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients
• Significant improvement with native IB QDR compared to SDP and IPoIB
HPC Advisory Council, Lugano Switzerland '12 26
27. Memcached - Memory Scalability
700
600
Memory Footprint (MB) 500 SDP IPoIB
OSU-RC-IB 1GigE
400
OSU-UD-IB OSU-Hybrid-IB
300
200
100
0
1 2 4 8 16 32 64 128 256 512 800 1K 1.6K 2K 4K
No. of Clients
• Steady Memory Footprint for UD Design
– ~ 200MB
• RC Memory Footprint increases as increase in number of clients
– ~500MB for 4K clients
HPC Advisory Council, Lugano Switzerland '12 27
28. Application Level Evaluation – Olio Benchmark
120 2500
SDP
100
IPoIB 2000
80 OSU-RC-IB
1500
Time (ms)
Time (ms)
OSU-UD-IB
60
OSU-Hybrid-IB
1000
40
500
20
0 0
1 4 8 64 128 256 512 1024
No. of Clients No. of Clients
• Olio Benchmark
– RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients
• 4X times better than IPoIB for 8 clients
• Hybrid design achieves comparable performance to that of pure RC design
HPC Advisory Council, Lugano Switzerland '12 28
29. Application Level Evaluation – Real Application Workloads
350
120
SDP
300
100 IPoIB
OSU-RC-IB 250
80 OSU-UD-IB
Time (ms)
200
Time (ms)
OSU-Hybrid-IB
60
150
40
100
20 50
0 0
1 4 8 64 128 256 512 1024
No. of Clients No. of Clients
• Real Application Workload
– RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients
• 12X times better than IPoIB for 8 clients
• Hybrid design achieves comparable performance to that of pure RC design
J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K.
Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11
J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Memcached Design on
High Performance RDMA Capable Interconnects, CCGrid’12
HPC Advisory Council, Lugano Switzerland '12 29
30. Presentation Outline
• Overview of Hadoop, Memcached and HBase
• Challenges in Accelerating Enterprise Middleware
• Designs and Case Studies
– Memcached
– HBase
– HDFS
• Conclusion and Q&A
HPC Advisory Council, Lugano Switzerland '12 30
31. HBase Design Using Verbs
Current Design OSU Design
HBase HBase
JNI Interface
Sockets
OSU Module
1/10 GigE
Network InfiniBand (Verbs)
31
HPC Advisory Council, Lugano Switzerland '12
32. Experimental Setup
• Hardware
– Intel Clovertown
• Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs,
6 GB main memory, 250 GB hard disk
• Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR)
– Intel Westmere
• Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs,
12 GB main memory, 160 GB hard disk
• Network: 1GigE, IPoIB, and IB (QDR)
– 3 Nodes used
• Node1 [NameNode & HBase Master]
• Node2 [DataNode & HBase RegionServer]
• Node3 [Client]
• Software
– Hadoop 0.20.0, HBase 0.90.3 and Sun Java SDK 1.7.
– In all experiments, ‘memtable’ is contained in memory (no disk access
involved)
HPC Advisory Council, Lugano Switzerland '12 32
33. Details on Experiments
• Key/Value size
– Key size: 20 Bytes
– Value size: 1KB/4KB
• Get operation
– One Key/Value pair is inserted, so that Key/Value pair will stay in
memory
– Get operation is repeated 80,000 times
– Skipped first 40, 000 iterations as warm-up
• Put operation
– Memstore_Flush_Size is set to be 256 MB
– No memory flush operation involved
– Put operation is repeated 40, 000 times
– Skipped first 10, 000 iterations as warm-up
HPC Advisory Council, Lugano Switzerland '12 33
34. Get Operation (IB:DDR)
Latency Throughput
300 18000
1GE IPoIB
10GE OSU Design 16000
250
14000
200 12000
Operations /sec
Time (us)
10000
150
8000
100 6000
4000
50
2000
0 0
1K 4K 1K 4K
Message Size Message Size
• HBase Get Operation
– 1K bytes – 65 us (15K TPS)
– 4K bytes -- 88 us (11K TPS)
• Almost factor of two improvement over 10GE (TOE)
HPC Advisory Council, Lugano Switzerland '12 34
35. Get Operation (IB:QDR)
Latency Throughput
350 25000
1GE
300
IPoIB 20000
250
OSU Design
Operations /sec
15000
Time (us)
200
150
10000
100
5000
50
0 0
1K 4K 1K 4K
Message Size Message Size
• HBase Get Operation
– 1K bytes – 47 us (22K TPS)
– 4K bytes -- 64 us (16K TPS)
• Almost factor of four improvement over IPoIB for 1KB
HPC Advisory Council, Lugano Switzerland '12 35
36. Put Operation (IB:DDR)
Latency Throughput
10000
400
1GE IPoIB 10GE OSU Design 9000
350
8000
300 7000
Operations /sec
250 6000
Time (us)
200 5000
4000
150
3000
100
2000
50 1000
0 0
1K 4K 1K 4K
Message Size Message Size
• HBase Put Operation
– 1K bytes – 114 us (8.7K TPS)
– 4K bytes -- 179 us (5.6K TPS)
• 34% improvement over 10GE (TOE) for 1KB
HPC Advisory Council, Lugano Switzerland '12 36
37. Put Operation (IB:QDR)
Latency Throughput
400 14000
1GE
350 12000
IPoIB
300 OSU Design 10000
Operations /sec
250
8000
Time (us)
200
6000
150
4000
100
2000
50
0 0
1K 4K 1K 4K
Message Size Message Size
• HBase Put Operation
– 1K bytes – 78 us (13K TPS)
– 4K bytes -- 122 us (8K TPS)
• A factor of two improvement over IPoIB for 1KB
HPC Advisory Council, Lugano Switzerland '12 37
38. HBase Put/Get – Detailed Analysis
300 250
Communication
Communication Preparation
250
200 Server Processing
Server Serialization
200
Client Processing
150
Time (us)
Time (us)
Client Serialization
150
100
100
50
50
0 0
1GigE IPoIB 10GigE OSU-IB 1GigE IPoIB 10GigE OSU-IB
HBase Put 1KB HBase Get 1KB
• HBase 1KB Put
– Communication Time – 8.9 us
– A factor of 6X improvement over 10GE for communication time
• HBase 1KB Get
– Communication Time – 8.9 us
– A factor of 6X improvement over 10GE for communication time
W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda,
Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?,
ISPASS’12
HPC Advisory Council, Lugano Switzerland '12 38
39. HBase Single Server-Multi-Client Results
600 60000
IPoIB
500 50000
OSU-IB
400 40000 1GigE
Time (us)
Ops/sec
10GigE
300 30000
200 20000
100 10000
0 0
1 2 4 8 16 1 2 4 8 16
No. of Clients No. of Clients
Latency Throughput
• HBase Get latency
– 4 clients: 104.5 us; 16 clients: 296.1 us
• HBase Get throughput
– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec
• 27% improvement in throughput for 16 clients over 10GE
HPC Advisory Council, Lugano Switzerland '12 39
40. HBase – YCSB Read-Write Workload
7000 10000
9000
6000
8000
5000 7000
6000
Time (us)
Time (us)
4000
5000
IPoIB OSU-IB
3000 4000
1GigE 10GigE
3000
2000
2000
1000 1000
0 0
8 16 32 64 96 128 8 16 32 64 96 128
No. of Clients No. of Clients
Read Latency Write Latency
• HBase Get latency (Yahoo! Cloud Service Benchmark)
– 64 clients: 2.0 ms; 128 Clients: 3.5 ms
– 42% improvement over IPoIB for 128 clients
• HBase Get latency
– 64 clients: 1.9 ms; 128 Clients: 3.5 ms
– 40% improvement over IPoIB for 128 clients
J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High-
Performance Design of HBase with RDMA over InfiniBand, IPDPS’12
HPC Advisory Council, Lugano Switzerland '12 40
41. Presentation Outline
• Overview of Hadoop, Memcached and HBase
• Challenges in Accelerating Enterprise Middleware
• Designs and Case Studies
– Memcached
– HBase
– HDFS
• Conclusion and Q&A
HPC Advisory Council, Lugano Switzerland '12 41
42. Studies and Experimental Setup
• Two Kinds of Designs and Studies we have Done
– Studying the impact of HDD vs. SSD for HDFS
• Unmodified Hadoop for experiments
– Preliminary design of HDFS over Verbs
• Hadoop Experiments
– Intel Clovertown 2.33GHz, 6GB RAM, InfiniBand DDR, Chelsio T320
– Intel X-25E 64GB SSD and 250GB HDD
– Hadoop version 0.20.2, Sun/Oracle Java 1.6.0
– Dedicated NameServer and JobTracker
– Number of Datanodes used: 2, 4, and 8
42
HPC Advisory Council, Lugano Switzerland '12
43. Hadoop: DFS IO Write Performance
90
Four Data Nodes
80
Using HDD and SSD
Average Write Throughput (MB/sec)
70
60 1GE with HDD
IGE with SSD
50 IPoIB with HDD
IPoIB with SSD
40
SDP with HDD
30 SDP with SSD
10GE-TOE with HDD
20
10GE-TOE with SSD
10
0
1 2 3 4 5 6 7 8 9 10
File Size(GB)
• DFS IO included in Hadoop, measures sequential access throughput
• We have two map tasks each writing to a file of increasing size (1-10GB)
• Significant improvement with IPoIB, SDP and 10GigE
• With SSD, performance improvement is almost seven or eight fold!
• SSD benefits not seen without using high-performance interconnect
43
HPC Advisory Council, Lugano Switzerland '12
44. Hadoop: RandomWriter Performance
700
600
Execution Time (sec)
500 1GE with HDD
IGE with SSD
400 IPoIB with HDD
IPoIB with SSD
300
SDP with HDD
200 SDP with SSD
10GE-TOE with HDD
100
10GE-TOE with SSD
0
2 4
Number of data nodes
• Each map generates 1GB of random binary data and writes to HDFS
• SSD improves execution time by 50% with 1GigE for two DataNodes
• For four DataNodes, benefits are observed only with HPC interconnect
• IPoIB, SDP and 10GigE can improve performance by 59% on four Data Nodes
44
HPC Advisory Council, Lugano Switzerland '12
45. Hadoop Sort Benchmark
2500
2000
Execution Time (sec)
1GE with HDD
IGE with SSD
1500
IPoIB with HDD
IPoIB with SSD
1000
SDP with HDD
SDP with SSD
500 10GE-TOE with HDD
10GE-TOE with SSD
0
2 4
Number of data nodes
• Sort: baseline benchmark for Hadoop
• Sort phase: I/O bound; Reduce phase: communication bound
• SSD improves performance by 28% using 1GigE with two DataNodes
• Benefit of 50% on four DataNodes using SDP, IPoIB or 10GigE
S. Sur, H. Wang, J. Huang, X. Ouyang and D. K. Panda “Can High-Performance Interconnects Benefit Hadoop
Distributed File System?”, MASVDC ‘10 in conjunction with MICRO 2010, Atlanta, GA.
45
HPC Advisory Council, Lugano Switzerland '12
46. HDFS Design Using Verbs
Current Design OSU Design
HDFS HDFS
JNI Interface
Sockets
OSU Module
1/10 GigE
Network InfiniBand (Verbs)
46
HPC Advisory Council, Lugano Switzerland '12
47. RDMA-based Design for Native HDFS –
Preliminary Results
120
1 GigE IPoIB
100
10 GigE OSU-Design
80
Time (ms)
60
40
20
0
1 2 3 4 5
File Size (GB)
• HDFS File Write Experiment using four data nodes on IB-DDR Cluster
• HDFS File Write Time
– 2 GB – 14 s, 5 GB – 86s,
– For 5 GB File Size - 20% improvement over IPoIB,
14% improvement over 10GigE
HPC Advisory Council, Lugano Switzerland '12 47
48. Presentation Outline
• Overview of Hadoop, Memcached and HBase
• Challenges in Accelerating Enterprise Middleware
• Designs and Case Studies
– Memcached
– HBase
– HDFS
• Conclusion and Q&A
HPC Advisory Council, Lugano Switzerland '12 48
49. Concluding Remarks
• InfiniBand with RDMA feature is gaining momentum in HPC
systems with best performance and greater usage
• It is possible to use the RDMA feature in enterprise environments
for accelerating big data processing
• Presented some initial designs and performance numbers
• Many open research challenges remain to be solved so that
middleware for enterprise environments can take advantage of
– modern high-performance networks
– multi-core technologies
– emerging storage technologies
HPC Advisory Council, Lugano Switzerland '12 49
50. Designing Communication and I/O Libraries for
Enterprise Systems: Solved a Few Initial Challenges
Applications
Datacenter Middleware
(HDFS, HBase, MapReduce, Memcached)
Programming Models
(Socket)
Communication and I/O Library
Point-to-Point Threading Models and
Communication Synchronization
I/O and Filesys tems QoS Fault Tolerance
Commodity Computing Sys tem
Networking Technologies Architectures
(single, dual, quad, ..) Storage Technologies
(Infi niBand, 1/10/40 GiGE,
(HDD or SSD)
RNICs & Intelligent NICs) Multi/Many-c ore Architecture
and Accelerators
HPC Advisory Council, Lugano Switzerland '12 50
51. Web Pointers
http://www.cse.ohio-state.edu/~panda
http://nowlab.cse.ohio-state.edu
MVAPICH Web Page
http://mvapich.cse.ohio-state.edu
panda@cse.ohio-state.edu
HPC Advisory Council, Lugano Switzerland '12 51