• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Acceleration for big data, hadoop and memcached it168文库
 

Acceleration for big data, hadoop and memcached it168文库

on

  • 437 views

 

Statistics

Views

Total Views
437
Views on SlideShare
437
Embed Views
0

Actions

Likes
0
Downloads
8
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Acceleration for big data, hadoop and memcached it168文库 Acceleration for big data, hadoop and memcached it168文库 Presentation Transcript

    • Acceleration for Big Data, Hadoop and MemcachedA Presentation at HPC Advisory Council Workshop, Lugano 2012 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
    • Recap of Last Two Day’s Presentations• MPI is a dominant programming model for HPC Systems• Introduced some of the MPI Features and their Usage• Introduced MVAPICH2 stack• Illustrated many performance optimizations and tuning techniques for MVAPICH2• Provided an overview of MPI-3 Features• Introduced challenges in designing MPI for Exascale systems• Presented approaches being taken by MVAPICH2 for Exascale systemsHPC Advisory Council, Lugano Switzerland 12 2
    • High-Performance Networks in the Top500 Percentage share of InfiniBand is steadily increasingHPC Advisory Council, Lugano Switzerland 12 3
    • Use of High-Performance Networks for ScientificComputing• OpenFabrics software stack with IB, iWARP and RoCE interfaces are driving HPC systems• Message Passing Interface (MPI)• Parallel File Systems• Almost 11.5 years of Research and Development since InfiniBand was introduced in October 2001• Other Programming Models are emerging to take advantage of High-Performance Networks – UPC – SHMEMHPC Advisory Council, Lugano Switzerland 12 4
    • One-way Latency: MPI over IB Small Message Latency Large Message Latency 6.00 250.00 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR 5.00 200.00 MVAPICH-ConnectX-DDR MVAPICH-ConnectX2-PCIe2-QDR 4.00 MVAPICH-ConnectX3-PCIe3-FDR 150.00 Latency (us)Latency (us) 1.82 3.00 1.66 100.00 1.64 2.00 1.56 50.00 1.00 0.81 0.00 0.00 Message Size (bytes) Message Size (bytes) DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 without IB switch HPC Advisory Council, Lugano Switzerland 12 5
    • Bandwidth: MPI over IB Unidirectional Bandwidth Bidirectional Bandwidth 7000 15000 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR 6000 13000 6333 MVAPICH-ConnectX-DDR MVAPICH-ConnectX2-PCIe2-QDRBandwidth (MBytes/sec) 11000 Bandwidth (MBytes/sec) 5000 MVAPICH-ConnectX3-PCIe3-FDR 9000 11043 4000 3385 7000 6521 3000 3280 5000 4407 2000 1917 3000 3704 1706 1000 1000 3341 0 -1000 Message Size (bytes) Message Size (bytes) DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 without IB switch HPC Advisory Council, Lugano Switzerland 12 6
    • Large-scale InfiniBand Installations• 209 IB Clusters (41.8%) in the November‘11 Top500 list (http://www.top500.org)• Installations in the Top 30 (13 systems): 120,640 cores (Nebulae) in China (4th) 29,440 cores (Mole-8.5) in China (21st) 73,278 cores (Tsubame-2.0) in Japan (5th) 42,440 cores (Red Sky) at Sandia (24th) 111,104 cores (Pleiades) at NASA Ames (7th) 62,976 cores (Ranger) at TACC (25th) 138,368 cores (Tera-100) at France (9th) 20,480 cores (Bull Benchmarks) in France (27th) 122,400 cores (RoadRunner) at LANL (10th) 20,480 cores (Helios) in Japan (28th) 137,200 cores (Sunway Blue Light) in China (14th) More are getting installed ! 46,208 cores (Zin) at LLNL (15th) 33,072 cores (Lomonosov) in Russia (18th)HPC Advisory Council, Lugano Switzerland 12 7
    • Enterprise/Commercial Computing• Focuses on big data and data analytics• Multiple environments and middleware are gaining momentum – Hadoop (HDFS, HBase and MapReduce) – MemcachedHPC Advisory Council, Lugano Switzerland 12 8
    • Can High-Performance Interconnects Benefit EnterpriseComputing?• Most of the current enterprise systems use 1GE• Concerns for performance and scalability• Usage of High-Performance Networks is beginning to draw interest – Oracle, IBM, Google are working along these directions• What are the challenges?• Where do the bottlenecks lie?• Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)? HPC Advisory Council, Lugano Switzerland 12 9
    • Presentation Outline• Overview of Hadoop, Memcached and HBase• Challenges in Accelerating Enterprise Middleware• Designs and Case Studies – Memcached – HBase – HDFS• Conclusion and Q&AHPC Advisory Council, Lugano Switzerland 12 10
    • Memcached Architecture Main Main CPUs CPUs memory memory ... SSD HDD SSD HDD High High Main Main Performance CPUs CPUs!"#$% "$#& Performance Networks memory High Performance memory Networks Networks SSD HDD SSD HDD ... Main ... (Database Servers) memory CPUs SSD HDD Web Frontend Servers (Memcached Servers) (Memcached Clients) • Integral part of Web 2.0 architecture • Distributed Caching Layer – Allows to aggregate spare memory from multiple nodes – General purpose • Typically used to cache database queries, results of API calls • Scalable model, but typical usage very network intensive HPC Advisory Council, Lugano Switzerland 12 11
    • Hadoop Architecture• Underlying Hadoop Distributed File System (HDFS)• Fault-tolerance by replicating data blocks• NameNode: stores information on data blocks• DataNodes: store blocks and host Map-reduce computation• JobTracker: track jobs and detect failure• Model scales but high amount of communication during intermediate phases HPC Advisory Council, Lugano Switzerland 12 12
    • Network-Level Interaction Between Clients and DataNodes in HDFS (HDD/SSD) ... ... High Performance (HDD/SSD) Networks ... ... (HDD/SSD) (HDFS Clients) (HDFS Data Nodes) HPC Advisory Council, Lugano Switzerland 12 13
    • Overview of HBase Architecture• An open source database project based on Hadoop framework for hosting very large tables• Major components: HBaseMaster, HRegionServer and HBaseClient• HBase and HDFS are deployed in the same cluster to get better data locality 14 HPC Advisory Council, Lugano Switzerland 12
    • Network-Level Interaction BetweenHBase Clients, Region Servers and Data Nodes (HDD/SSD) ... ... ... High High Performance Performance (HDD/SSD) Networks Networks ... ... ... (HDD/SSD) (HBase Clients) (HRegion Servers) (Data Nodes) HPC Advisory Council, Lugano Switzerland 12 15
    • Presentation Outline• Overview of Hadoop, Memcached and HBase• Challenges in Accelerating Enterprise Middleware• Designs and Case Studies – Memcached – HBase – HDFS• Conclusion and Q&AHPC Advisory Council, Lugano Switzerland 12 16
    • Designing Communication and I/O Libraries forEnterprise Systems: Challenges Applications Datacenter Middleware (HDFS, HBase, MapReduce, Memcached) Programming Models (Socket) Communication and I/O Library Point-to-Point Threading Models and Communication Synchronization I/O and Filesys tems QoS Fault Tolerance Commodity Computing Sys tem Networking Technologies Architectures (single, dual, quad, ..) Storage Technologies (Infi niBand, 1/10/40 GiGE, (HDD or SSD) RNICs & Intelligent NICs) Multi/Many-c ore Architecture and Accelerators HPC Advisory Council, Lugano Switzerland 12 17
    • Common Protocols using Open Fabrics Application Application Interface Sockets Verbs Kernel Space TCP/IP SDP iWARP RDMA RDMA Protocol TCP/IPImplementation Hardware Ethernet Offload User User User IPoIB RDMA space space space Driver Network Ethernet InfiniBand Ethernet InfiniBand iWARP RoCE InfiniBand Adapter Adapter Adapter Adapter Adapter Adapter Adapter Adapter Network Ethernet InfiniBand Ethernet InfiniBand Ethernet Ethernet InfiniBand Switch Switch Switch Switch Switch Switch Switch Switch 1/10/40 IPoIB 10/40 GigE- SDP iWARP RoCE IB Verbs GigE TOE HPC Advisory Council, Lugano Switzerland 12 18
    • Can New Data Analysis and Management Systems be Designed with High-Performance Networks and Protocols? Current Design Enhanced Designs Our Approach Application Application Application Accelerated Sockets OSU Design Sockets Verbs / Hardware Verbs Interface Offload 1/10 GigE 10 GigE or InfiniBand 10 GigE or InfiniBand Network• Sockets not designed for high-performance – Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop) – Zero-copy not available for non-blocking sockets 19 HPC Advisory Council, Lugano Switzerland 12
    • Interplay between Storage and Interconnect/Protocols• Most of the current generation enterprise systems use the traditional hard disks• Since hard disks are slower, high performance communication protocols may not have impact• SSDs and other storage technologies are emerging• Does it change the landscape? 20 HPC Advisory Council, Lugano Switzerland 12
    • Presentation Outline• Overview of Hadoop, Memcached and HBase• Challenges in Accelerating Enterprise Middleware• Designs and Case Studies – Memcached – HBase – HDFS• Conclusion and Q&AHPC Advisory Council, Lugano Switzerland 12 21
    • Memcached Design Using Verbs Master Sockets Sockets 1 Worker Shared Thread Client Thread Data 2 Sockets Memory 1 Worker Slabs Thread Items RDMA … Client 2 Verbs Verbs Worker Worker Thread Thread• Server and client perform a negotiation protocol – Master thread assigns clients to appropriate worker thread• Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread, each verbs worker thread can support multiple clients• All other Memcached data structures are shared among RDMA and Sockets worker threads• Memcached applications need not be modified; uses verbs interface if available• Memcached Server can serve both socket and verbs clients simultaneously HPC Advisory Council, Lugano Switzerland 12 22
    • Experimental Setup• Hardware – Intel Clovertown • Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs, 6 GB main memory, 250 GB hard disk • Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR) – Intel Westmere • Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs, 12 GB main memory, 160 GB hard disk • Network: 1GigE, IPoIB, and IB (QDR)• Software – Memcached Server: 1.4.9 – Memcached Client: (libmemcached) 0.52 – In all experiments, ‘memtable’ is contained in memory (no disk access involved) HPC Advisory Council, Lugano Switzerland 12 23
    • Memcached Get Latency (Small Message) 180 180 SDP IPoIB 160 160 OSU-RC-IB 1GigE 140 140 10GigE OSU-UD-IB 120 120 Time (us)Time (us) 100 100 80 80 60 60 40 40 20 20 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 1 2 4 8 16 32 64 128 256 512 1K 2K Message Size Message Size Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) • Memcached Get latency – 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us – 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us • Almost factor of four improvement over 10GE (TOE) for 2K bytes on the DDR cluster HPC Advisory Council, Lugano Switzerland 12 24
    • Memcached Get Latency (Large Message) 6000 6000 SDP IPoIB 5000 5000 OSU-RC-IB 1GigE 4000 4000 10GigE OSU-UD-IBTime (us) Time (us) 3000 3000 2000 2000 1000 1000 0 0 2K 4K 8K 16K 32K 64K 128K 256K 512K 2K 4K 8K 16K 32K 64K 128K 256K 512K Message Size Message Size Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) • Memcached Get latency – 8K bytes RC/UD – DDR: 18.9/19.1 us; QDR: 11.8/12.2 us – 512K bytes RC/UD -- DDR: 369/403 us; QDR: 173/203 us • Almost factor of two improvement over 10GE (TOE) for 512K bytes on the DDR cluster HPC Advisory Council, Lugano Switzerland 12 25
    • Memcached Get TPS (4byte) 1600 1600 SDP IPoIB Thousands of Transactions per second (TPS)Thousands of Transactions per second (TPS) 1400 1400 OSU-RC-IB 1GigE 1200 1200 OSU-UD-IB 1000 1000 800 800 600 600 400 400 200 200 0 0 1 2 4 8 16 32 64 128 256 512 800 1K 4 8 No. of Clients No. of Clients • Memcached Get transactions per second for 4 bytes – On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients • Significant improvement with native IB QDR compared to SDP and IPoIB HPC Advisory Council, Lugano Switzerland 12 26
    • Memcached - Memory Scalability 700 600 Memory Footprint (MB) 500 SDP IPoIB OSU-RC-IB 1GigE 400 OSU-UD-IB OSU-Hybrid-IB 300 200 100 0 1 2 4 8 16 32 64 128 256 512 800 1K 1.6K 2K 4K No. of Clients• Steady Memory Footprint for UD Design – ~ 200MB• RC Memory Footprint increases as increase in number of clients – ~500MB for 4K clients HPC Advisory Council, Lugano Switzerland 12 27
    • Application Level Evaluation – Olio Benchmark 120 2500 SDP 100 IPoIB 2000 80 OSU-RC-IB 1500 Time (ms)Time (ms) OSU-UD-IB 60 OSU-Hybrid-IB 1000 40 500 20 0 0 1 4 8 64 128 256 512 1024 No. of Clients No. of Clients • Olio Benchmark – RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients • 4X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design HPC Advisory Council, Lugano Switzerland 12 28
    • Application Level Evaluation – Real Application Workloads 350 120 SDP 300 100 IPoIB OSU-RC-IB 250 80 OSU-UD-IB Time (ms) 200Time (ms) OSU-Hybrid-IB 60 150 40 100 20 50 0 0 1 4 8 64 128 256 512 1024 No. of Clients No. of Clients • Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients • 12X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11 J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, CCGrid’12 HPC Advisory Council, Lugano Switzerland 12 29
    • Presentation Outline• Overview of Hadoop, Memcached and HBase• Challenges in Accelerating Enterprise Middleware• Designs and Case Studies – Memcached – HBase – HDFS• Conclusion and Q&AHPC Advisory Council, Lugano Switzerland 12 30
    • HBase Design Using Verbs Current Design OSU Design HBase HBase JNI Interface Sockets OSU Module 1/10 GigE Network InfiniBand (Verbs) 31 HPC Advisory Council, Lugano Switzerland 12
    • Experimental Setup• Hardware – Intel Clovertown • Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs, 6 GB main memory, 250 GB hard disk • Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR) – Intel Westmere • Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs, 12 GB main memory, 160 GB hard disk • Network: 1GigE, IPoIB, and IB (QDR) – 3 Nodes used • Node1 [NameNode & HBase Master] • Node2 [DataNode & HBase RegionServer] • Node3 [Client]• Software – Hadoop 0.20.0, HBase 0.90.3 and Sun Java SDK 1.7. – In all experiments, ‘memtable’ is contained in memory (no disk access involved) HPC Advisory Council, Lugano Switzerland 12 32
    • Details on Experiments • Key/Value size – Key size: 20 Bytes – Value size: 1KB/4KB • Get operation – One Key/Value pair is inserted, so that Key/Value pair will stay in memory – Get operation is repeated 80,000 times – Skipped first 40, 000 iterations as warm-up • Put operation – Memstore_Flush_Size is set to be 256 MB – No memory flush operation involved – Put operation is repeated 40, 000 times – Skipped first 10, 000 iterations as warm-up HPC Advisory Council, Lugano Switzerland 12 33
    • Get Operation (IB:DDR) Latency Throughput 300 18000 1GE IPoIB 10GE OSU Design 16000 250 14000 200 12000 Operations /secTime (us) 10000 150 8000 100 6000 4000 50 2000 0 0 1K 4K 1K 4K Message Size Message Size• HBase Get Operation – 1K bytes – 65 us (15K TPS) – 4K bytes -- 88 us (11K TPS)• Almost factor of two improvement over 10GE (TOE) HPC Advisory Council, Lugano Switzerland 12 34
    • Get Operation (IB:QDR) Latency Throughput 350 25000 1GE 300 IPoIB 20000 250 OSU Design Operations /sec 15000Time (us) 200 150 10000 100 5000 50 0 0 1K 4K 1K 4K Message Size Message Size • HBase Get Operation – 1K bytes – 47 us (22K TPS) – 4K bytes -- 64 us (16K TPS) • Almost factor of four improvement over IPoIB for 1KB HPC Advisory Council, Lugano Switzerland 12 35
    • Put Operation (IB:DDR) Latency Throughput 10000 400 1GE IPoIB 10GE OSU Design 9000 350 8000 300 7000 Operations /sec 250 6000Time (us) 200 5000 4000 150 3000 100 2000 50 1000 0 0 1K 4K 1K 4K Message Size Message Size • HBase Put Operation – 1K bytes – 114 us (8.7K TPS) – 4K bytes -- 179 us (5.6K TPS) • 34% improvement over 10GE (TOE) for 1KB HPC Advisory Council, Lugano Switzerland 12 36
    • Put Operation (IB:QDR) Latency Throughput 400 14000 1GE 350 12000 IPoIB 300 OSU Design 10000 Operations /sec 250 8000Time (us) 200 6000 150 4000 100 2000 50 0 0 1K 4K 1K 4K Message Size Message Size • HBase Put Operation – 1K bytes – 78 us (13K TPS) – 4K bytes -- 122 us (8K TPS) • A factor of two improvement over IPoIB for 1KB HPC Advisory Council, Lugano Switzerland 12 37
    • HBase Put/Get – Detailed Analysis 300 250 Communication Communication Preparation 250 200 Server Processing Server Serialization 200 Client Processing 150Time (us) Time (us) Client Serialization 150 100 100 50 50 0 0 1GigE IPoIB 10GigE OSU-IB 1GigE IPoIB 10GigE OSU-IB HBase Put 1KB HBase Get 1KB • HBase 1KB Put – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time • HBase 1KB Get – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS’12 HPC Advisory Council, Lugano Switzerland 12 38
    • HBase Single Server-Multi-Client Results 600 60000 IPoIB 500 50000 OSU-IB 400 40000 1GigETime (us) Ops/sec 10GigE 300 30000 200 20000 100 10000 0 0 1 2 4 8 16 1 2 4 8 16 No. of Clients No. of Clients Latency Throughput • HBase Get latency – 4 clients: 104.5 us; 16 clients: 296.1 us • HBase Get throughput – 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec • 27% improvement in throughput for 16 clients over 10GE HPC Advisory Council, Lugano Switzerland 12 39
    • HBase – YCSB Read-Write Workload 7000 10000 9000 6000 8000 5000 7000 6000 Time (us)Time (us) 4000 5000 IPoIB OSU-IB 3000 4000 1GigE 10GigE 3000 2000 2000 1000 1000 0 0 8 16 32 64 96 128 8 16 32 64 96 128 No. of Clients No. of Clients Read Latency Write Latency • HBase Get latency (Yahoo! Cloud Service Benchmark) – 64 clients: 2.0 ms; 128 Clients: 3.5 ms – 42% improvement over IPoIB for 128 clients • HBase Get latency – 64 clients: 1.9 ms; 128 Clients: 3.5 ms – 40% improvement over IPoIB for 128 clients J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High- Performance Design of HBase with RDMA over InfiniBand, IPDPS’12 HPC Advisory Council, Lugano Switzerland 12 40
    • Presentation Outline• Overview of Hadoop, Memcached and HBase• Challenges in Accelerating Enterprise Middleware• Designs and Case Studies – Memcached – HBase – HDFS• Conclusion and Q&AHPC Advisory Council, Lugano Switzerland 12 41
    • Studies and Experimental Setup• Two Kinds of Designs and Studies we have Done – Studying the impact of HDD vs. SSD for HDFS • Unmodified Hadoop for experiments – Preliminary design of HDFS over Verbs• Hadoop Experiments – Intel Clovertown 2.33GHz, 6GB RAM, InfiniBand DDR, Chelsio T320 – Intel X-25E 64GB SSD and 250GB HDD – Hadoop version 0.20.2, Sun/Oracle Java 1.6.0 – Dedicated NameServer and JobTracker – Number of Datanodes used: 2, 4, and 8 42 HPC Advisory Council, Lugano Switzerland 12
    • Hadoop: DFS IO Write Performance 90 Four Data Nodes 80 Using HDD and SSDAverage Write Throughput (MB/sec) 70 60 1GE with HDD IGE with SSD 50 IPoIB with HDD IPoIB with SSD 40 SDP with HDD 30 SDP with SSD 10GE-TOE with HDD 20 10GE-TOE with SSD 10 0 1 2 3 4 5 6 7 8 9 10 File Size(GB) • DFS IO included in Hadoop, measures sequential access throughput • We have two map tasks each writing to a file of increasing size (1-10GB) • Significant improvement with IPoIB, SDP and 10GigE • With SSD, performance improvement is almost seven or eight fold! • SSD benefits not seen without using high-performance interconnect 43 HPC Advisory Council, Lugano Switzerland 12
    • Hadoop: RandomWriter Performance 700 600 Execution Time (sec) 500 1GE with HDD IGE with SSD 400 IPoIB with HDD IPoIB with SSD 300 SDP with HDD 200 SDP with SSD 10GE-TOE with HDD 100 10GE-TOE with SSD 0 2 4 Number of data nodes• Each map generates 1GB of random binary data and writes to HDFS• SSD improves execution time by 50% with 1GigE for two DataNodes• For four DataNodes, benefits are observed only with HPC interconnect• IPoIB, SDP and 10GigE can improve performance by 59% on four Data Nodes 44 HPC Advisory Council, Lugano Switzerland 12
    • Hadoop Sort Benchmark 2500 2000 Execution Time (sec) 1GE with HDD IGE with SSD 1500 IPoIB with HDD IPoIB with SSD 1000 SDP with HDD SDP with SSD 500 10GE-TOE with HDD 10GE-TOE with SSD 0 2 4 Number of data nodes • Sort: baseline benchmark for Hadoop • Sort phase: I/O bound; Reduce phase: communication bound • SSD improves performance by 28% using 1GigE with two DataNodes • Benefit of 50% on four DataNodes using SDP, IPoIB or 10GigE S. Sur, H. Wang, J. Huang, X. Ouyang and D. K. Panda “Can High-Performance Interconnects Benefit Hadoop Distributed File System?”, MASVDC ‘10 in conjunction with MICRO 2010, Atlanta, GA. 45 HPC Advisory Council, Lugano Switzerland 12
    • HDFS Design Using Verbs Current Design OSU Design HDFS HDFS JNI Interface Sockets OSU Module 1/10 GigE Network InfiniBand (Verbs) 46 HPC Advisory Council, Lugano Switzerland 12
    • RDMA-based Design for Native HDFS –Preliminary Results 120 1 GigE IPoIB 100 10 GigE OSU-Design 80 Time (ms) 60 40 20 0 1 2 3 4 5 File Size (GB)• HDFS File Write Experiment using four data nodes on IB-DDR Cluster• HDFS File Write Time – 2 GB – 14 s, 5 GB – 86s, – For 5 GB File Size - 20% improvement over IPoIB, 14% improvement over 10GigEHPC Advisory Council, Lugano Switzerland 12 47
    • Presentation Outline• Overview of Hadoop, Memcached and HBase• Challenges in Accelerating Enterprise Middleware• Designs and Case Studies – Memcached – HBase – HDFS• Conclusion and Q&AHPC Advisory Council, Lugano Switzerland 12 48
    • Concluding Remarks• InfiniBand with RDMA feature is gaining momentum in HPC systems with best performance and greater usage• It is possible to use the RDMA feature in enterprise environments for accelerating big data processing• Presented some initial designs and performance numbers• Many open research challenges remain to be solved so that middleware for enterprise environments can take advantage of – modern high-performance networks – multi-core technologies – emerging storage technologies HPC Advisory Council, Lugano Switzerland 12 49
    • Designing Communication and I/O Libraries forEnterprise Systems: Solved a Few Initial Challenges Applications Datacenter Middleware (HDFS, HBase, MapReduce, Memcached) Programming Models (Socket) Communication and I/O Library Point-to-Point Threading Models and Communication Synchronization I/O and Filesys tems QoS Fault Tolerance Commodity Computing Sys tem Networking Technologies Architectures (single, dual, quad, ..) Storage Technologies (Infi niBand, 1/10/40 GiGE, (HDD or SSD) RNICs & Intelligent NICs) Multi/Many-c ore Architecture and Accelerators HPC Advisory Council, Lugano Switzerland 12 50
    • Web Pointers http://www.cse.ohio-state.edu/~panda http://nowlab.cse.ohio-state.edu MVAPICH Web Page http://mvapich.cse.ohio-state.edu panda@cse.ohio-state.eduHPC Advisory Council, Lugano Switzerland 12 51