SlideShare a Scribd company logo
1 of 51
Download to read offline
Acceleration for Big Data, Hadoop and
               Memcached
A Presentation at HPC Advisory Council Workshop, Lugano 2012
                              by



                  Dhabaleswar K. (DK) Panda
                   The Ohio State University
               E-mail: panda@cse.ohio-state.edu
             http://www.cse.ohio-state.edu/~panda
Recap of Last Two Day’s Presentations
• MPI is a dominant programming model for HPC Systems
• Introduced some of the MPI Features and their Usage
• Introduced MVAPICH2 stack
• Illustrated many performance optimizations and tuning techniques for
  MVAPICH2
• Provided an overview of MPI-3 Features
• Introduced challenges in designing MPI for Exascale systems
• Presented approaches being taken by MVAPICH2 for Exascale systems




HPC Advisory Council, Lugano Switzerland '12                             2
High-Performance Networks in the Top500




                        Percentage share of InfiniBand is steadily increasing


HPC Advisory Council, Lugano Switzerland '12                                    3
Use of High-Performance Networks for Scientific
Computing
• OpenFabrics software stack with IB, iWARP and RoCE
  interfaces are driving HPC systems
• Message Passing Interface (MPI)
• Parallel File Systems
• Almost 11.5 years of Research and Development since
  InfiniBand was introduced in October 2001
• Other Programming Models are emerging to take
  advantage of High-Performance Networks
      – UPC
      – SHMEM


HPC Advisory Council, Lugano Switzerland '12            4
One-way Latency: MPI over IB

                                Small Message Latency                                            Large Message Latency
               6.00                                                        250.00
                                                                                               MVAPICH-Qlogic-DDR
                                                                                               MVAPICH-Qlogic-QDR
               5.00
                                                                           200.00              MVAPICH-ConnectX-DDR
                                                                                               MVAPICH-ConnectX2-PCIe2-QDR
               4.00
                                                                                               MVAPICH-ConnectX3-PCIe3-FDR
                                                                           150.00




                                                                         Latency (us)
Latency (us)




                       1.82
               3.00
                      1.66
                                                                           100.00
                      1.64
               2.00

                      1.56                                                       50.00
               1.00
                       0.81
               0.00                                                                     0.00


                                  Message Size (bytes)                                             Message Size (bytes)

                                    DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch
                                    FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 without IB switch


                 HPC Advisory Council, Lugano Switzerland '12                                                                5
Bandwidth: MPI over IB

                                         Unidirectional Bandwidth                                                  Bidirectional Bandwidth
                         7000                                                                           15000   MVAPICH-Qlogic-DDR
                                                                                                                MVAPICH-Qlogic-QDR
                         6000                                                                           13000
                                                                        6333                                    MVAPICH-ConnectX-DDR
                                                                                                                MVAPICH-ConnectX2-PCIe2-QDR
Bandwidth (MBytes/sec)




                                                                                                        11000




                                                                               Bandwidth (MBytes/sec)
                         5000
                                                                                                                MVAPICH-ConnectX3-PCIe3-FDR
                                                                                                         9000                                 11043
                         4000                                           3385
                                                                                                         7000                                 6521
                         3000                                           3280
                                                                                                         5000                                  4407
                         2000                                           1917
                                                                                                         3000
                                                                                                                                              3704
                                                                        1706
                         1000                                                                            1000                                 3341

                            0                                                                           -1000



                                           Message Size (bytes)                                                        Message Size (bytes)


                                            DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch
                                            FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 without IB switch

                         HPC Advisory Council, Lugano Switzerland '12                                                                                 6
Large-scale InfiniBand Installations

• 209 IB Clusters (41.8%) in the November‘11 Top500 list
    (http://www.top500.org)
• Installations in the Top 30 (13 systems):

 120,640 cores (Nebulae) in China (4th)              29,440 cores (Mole-8.5) in China (21st)

 73,278 cores (Tsubame-2.0) in Japan (5th)           42,440 cores (Red Sky) at Sandia (24th)

 111,104 cores (Pleiades) at NASA Ames (7th)         62,976 cores (Ranger) at TACC (25th)

 138,368 cores (Tera-100) at France (9th)            20,480 cores (Bull Benchmarks) in France (27th)

 122,400 cores (RoadRunner) at LANL (10th)           20,480 cores (Helios) in Japan (28th)

 137,200 cores (Sunway Blue Light) in China (14th)         More are getting installed !

 46,208 cores (Zin) at LLNL (15th)

 33,072 cores (Lomonosov) in Russia (18th)



HPC Advisory Council, Lugano Switzerland '12                                                           7
Enterprise/Commercial Computing

• Focuses on big data and data analytics
• Multiple environments and middleware are gaining
  momentum
      – Hadoop (HDFS, HBase and MapReduce)
      – Memcached




HPC Advisory Council, Lugano Switzerland '12         8
Can High-Performance Interconnects Benefit Enterprise
Computing?
• Most of the current enterprise systems use 1GE
• Concerns for performance and scalability
• Usage of High-Performance Networks is beginning to draw
  interest
     – Oracle, IBM, Google are working along these directions
• What are the challenges?
• Where do the bottlenecks lie?
• Can these bottlenecks be alleviated with new designs (similar
  to the designs adopted for MPI)?



 HPC Advisory Council, Lugano Switzerland '12                     9
Presentation Outline

• Overview of Hadoop, Memcached and HBase

• Challenges in Accelerating Enterprise Middleware

• Designs and Case Studies
      – Memcached

      – HBase

      – HDFS

• Conclusion and Q&A




HPC Advisory Council, Lugano Switzerland '12         10
Memcached Architecture
                                                                                                            Main                               Main
                                                                                                                    CPUs                               CPUs
                                                                                                           memory                             memory

                                                                                                                                    ...
                                                                                                             SSD    HDD                        SSD     HDD




                                  High                           High
                                                                                            Main                                                        Main
                                                             Performance                            CPUs                                                        CPUs
!"#$%
    "$#&                      Performance
                                                              Networks
                                                                                           memory
                                                                                                                           High Performance
                                                                                                                                                       memory
                               Networks                                                                                        Networks
                                                                                            SSD     HDD                                                 SSD     HDD




                                                                                                       ...                 Main                         ...
                                                                      (Database Servers)                                  memory
                                                                                                                                     CPUs



                                                                                                                            SSD       HDD
       Web Frontend Servers            (Memcached Servers)
       (Memcached Clients)



  • Integral part of Web 2.0 architecture
  • Distributed Caching Layer
           – Allows to aggregate spare memory from multiple nodes
           – General purpose
  • Typically used to cache database queries, results of API calls
  • Scalable model, but typical usage very network intensive

    HPC Advisory Council, Lugano Switzerland '12                                                                                                                   11
Hadoop Architecture

• Underlying Hadoop Distributed
  File System (HDFS)
• Fault-tolerance by replicating
  data blocks
• NameNode: stores information
  on data blocks
• DataNodes: store blocks and
  host Map-reduce computation
• JobTracker: track jobs and
  detect failure
• Model scales but high amount
  of communication during
  intermediate phases

  HPC Advisory Council, Lugano Switzerland '12   12
Network-Level Interaction Between Clients and Data
Nodes in HDFS


                                                                   (HDD/SSD)



                         ...                                      ...

                                              High
                                          Performance
                                                                        (HDD/SSD)
                                           Networks


                         ...                                      ...


                                                                   (HDD/SSD)




                (HDFS Clients)                          (HDFS Data Nodes)




 HPC Advisory Council, Lugano Switzerland '12                                       13
Overview of HBase Architecture

• An open source
  database project
  based on Hadoop
  framework for hosting
  very large tables

• Major components:
  HBaseMaster,
  HRegionServer and
  HBaseClient

• HBase and HDFS are
  deployed in the same
  cluster to get better
  data locality



                                                 14
  HPC Advisory Council, Lugano Switzerland '12
Network-Level Interaction Between
HBase Clients, Region Servers and Data Nodes



                                                                                              (HDD/SSD)


                ...                                     ...                             ...


                                   High                                 High
                               Performance                          Performance               (HDD/SSD)
                                Networks                             Networks

                                                        ...
                ...                                                                     ...


                                                                                              (HDD/SSD)



             (HBase Clients)                    (HRegion Servers)                 (Data Nodes)




 HPC Advisory Council, Lugano Switzerland '12                                                             15
Presentation Outline

• Overview of Hadoop, Memcached and HBase

• Challenges in Accelerating Enterprise Middleware

• Designs and Case Studies
      – Memcached

      – HBase

      – HDFS

• Conclusion and Q&A




HPC Advisory Council, Lugano Switzerland '12         16
Designing Communication and I/O Libraries for
Enterprise Systems: Challenges


                                                     Applications


                                            Datacenter Middleware
                                     (HDFS, HBase, MapReduce, Memcached)

                                                Programming Models
                                                      (Socket)

                                          Communication and I/O Library
                                   Point-to-Point                Threading Models and
                                  Communication                    Synchronization

                I/O and Filesys tems                      QoS                       Fault Tolerance


                                           Commodity Computing Sys tem
            Networking Technologies              Architectures
                                                 (single, dual, quad, ..)        Storage Technologies
            (Infi niBand, 1/10/40 GiGE,
                                                                                        (HDD or SSD)
            RNICs & Intelligent NICs)       Multi/Many-c ore Architecture
                                                  and Accelerators



 HPC Advisory Council, Lugano Switzerland '12                                                           17
Common Protocols using Open Fabrics

                                                                         Application


    Application
     Interface
                                                   Sockets                                         Verbs


                  Kernel Space
                                                           TCP/IP       SDP            iWARP       RDMA       RDMA
   Protocol                    TCP/IP
Implementation
                                                           Hardware
                     Ethernet                               Offload                      User       User        User
                                         IPoIB                         RDMA             space      space       space
                      Driver


  Network            Ethernet       InfiniBand             Ethernet   InfiniBand       iWARP       RoCE      InfiniBand
  Adapter            Adapter         Adapter               Adapter     Adapter         Adapter    Adapter     Adapter

  Network            Ethernet       InfiniBand             Ethernet   InfiniBand       Ethernet   Ethernet   InfiniBand
   Switch
                      Switch          Switch                Switch      Switch          Switch     Switch      Switch
                     1/10/40            IPoIB          10/40 GigE-       SDP            iWARP      RoCE       IB Verbs
                       GigE                               TOE


            HPC Advisory Council, Lugano Switzerland '12                                                                 18
Can New Data Analysis and Management Systems be
  Designed with High-Performance Networks and Protocols?

         Current Design                           Enhanced Designs           Our Approach

            Application                               Application             Application


                                                  Accelerated Sockets         OSU Design
              Sockets
                                                   Verbs / Hardware
                                                                            Verbs Interface
                                                       Offload


             1/10 GigE
                                                  10 GigE or InfiniBand   10 GigE or InfiniBand
              Network

• Sockets not designed for high-performance
    – Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop)
    – Zero-copy not available for non-blocking sockets


                                                                                                  19
   HPC Advisory Council, Lugano Switzerland '12
Interplay between Storage and Interconnect/Protocols



• Most of the current generation enterprise systems use the
  traditional hard disks
• Since hard disks are slower, high performance
  communication protocols may not have impact
• SSDs and other storage technologies are emerging
• Does it change the landscape?




                                                              20
 HPC Advisory Council, Lugano Switzerland '12
Presentation Outline

• Overview of Hadoop, Memcached and HBase

• Challenges in Accelerating Enterprise Middleware

• Designs and Case Studies
      – Memcached

      – HBase

      – HDFS

• Conclusion and Q&A




HPC Advisory Council, Lugano Switzerland '12         21
Memcached Design Using Verbs

                                                    Master    Sockets
                    Sockets                    1
                                                              Worker    Shared
                                                    Thread
                     Client                                   Thread     Data
                                               2
                                                    Sockets             Memory
                                               1    Worker               Slabs
                                                    Thread               Items
                     RDMA
                                                                           …
                     Client                    2    Verbs     Verbs
                                                    Worker    Worker
                                                    Thread    Thread



•   Server and client perform a negotiation protocol
     – Master thread assigns clients to appropriate worker thread
•   Once a client is assigned a verbs worker thread, it can communicate directly and is
    “bound” to that thread, each verbs worker thread can support multiple clients
•   All other Memcached data structures are shared among RDMA and Sockets worker
    threads
•   Memcached applications need not be modified; uses verbs interface if available
•   Memcached Server can serve both socket and verbs clients simultaneously


     HPC Advisory Council, Lugano Switzerland '12                                         22
Experimental Setup
• Hardware
     – Intel Clovertown
            • Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs,
              6 GB main memory, 250 GB hard disk
            • Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR)

     – Intel Westmere
            • Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs,
              12 GB main memory, 160 GB hard disk
            • Network: 1GigE, IPoIB, and IB (QDR)

• Software
     – Memcached Server: 1.4.9
     – Memcached Client: (libmemcached) 0.52
     – In all experiments, ‘memtable’ is contained in memory (no disk
       access involved)

 HPC Advisory Council, Lugano Switzerland '12                                            23
Memcached Get Latency (Small Message)

            180                                                                          180
                                                                                                   SDP                     IPoIB
            160                                                                          160
                                                                                                   OSU-RC-IB               1GigE
            140                                                                          140
                                                                                                   10GigE                  OSU-UD-IB
            120                                                                          120




                                                                             Time (us)
Time (us)




            100                                                                          100

            80                                                                           80

            60                                                                           60

            40                                                                           40

            20                                                                           20

                                                                                          0
             0
                                                                                               1   2     4     8   16     32   64      128 256 512   1K   2K
                   1   2    4    8    16     32   64   128 256 512 1K   2K
                                                                                                                        Message Size
                                           Message Size

                            Intel Clovertown Cluster (IB: DDR)                                     Intel Westmere Cluster (IB: QDR)

                  • Memcached Get latency
                           – 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us
                           – 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us
                  • Almost factor of four improvement over 10GE (TOE) for 2K bytes on
                    the DDR cluster
                  HPC Advisory Council, Lugano Switzerland '12                                                                                                 24
Memcached Get Latency (Large Message)

            6000                                                                                  6000

                                                                                                              SDP                  IPoIB
            5000                                                                                  5000
                                                                                                              OSU-RC-IB            1GigE

            4000                                                                                  4000        10GigE               OSU-UD-IB
Time (us)




                                                                                      Time (us)
            3000                                                                                  3000

            2000                                                                                  2000

            1000                                                                                  1000

              0                                                                                     0
                     2K    4K     8K    16K    32K     64K       128K   256K   512K                      2K   4K    8K     16K   32K       64K   128K   256K   512K
                                           Message Size                                                                      Message Size
                            Intel Clovertown Cluster (IB: DDR)                                                Intel Westmere Cluster (IB: QDR)

                   • Memcached Get latency
                          – 8K bytes RC/UD – DDR: 18.9/19.1 us; QDR: 11.8/12.2 us
                          – 512K bytes RC/UD -- DDR: 369/403 us; QDR: 173/203 us
                   • Almost factor of two improvement over 10GE (TOE) for 512K bytes on
                     the DDR cluster
                   HPC Advisory Council, Lugano Switzerland '12                                                                                                   25
Memcached Get TPS (4byte)
                                             1600                                                                                               1600
                                                                                                                                                                                      SDP                IPoIB




                                                                                                   Thousands of Transactions per second (TPS)
Thousands of Transactions per second (TPS)




                                             1400                                                                                               1400
                                                                                                                                                                                      OSU-RC-IB          1GigE
                                             1200                                                                                               1200
                                                                                                                                                                                      OSU-UD-IB
                                             1000                                                                                               1000

                                             800                                                                                                800

                                             600                                                                                                600

                                                                                                                                                400
                                             400
                                                                                                                                                200
                                             200
                                                                                                                                                   0
                                               0                                                                                                       1   2   4   8   16     32    64       128 256 512 800     1K
                                                                 4                           8
                                                                                                                                                                            No. of Clients
                                                                          No. of Clients


                                                    • Memcached Get transactions per second for 4 bytes
                                                           – On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients
                                                    • Significant improvement with native IB QDR compared to SDP and IPoIB



                                                    HPC Advisory Council, Lugano Switzerland '12                                                                                                                      26
Memcached - Memory Scalability
                                        700


                                        600

                Memory Footprint (MB)   500           SDP                   IPoIB

                                                      OSU-RC-IB             1GigE
                                        400
                                                      OSU-UD-IB             OSU-Hybrid-IB

                                        300


                                        200


                                        100


                                         0
                                              1   2    4     8    16   32   64      128   256   512   800   1K   1.6K   2K   4K
                                                                             No. of Clients

• Steady Memory Footprint for UD Design
     – ~ 200MB

• RC Memory Footprint increases as increase in number of clients
     – ~500MB for 4K clients




  HPC Advisory Council, Lugano Switzerland '12                                                                                    27
Application Level Evaluation – Olio Benchmark

            120                                                                              2500
                                                                 SDP
            100
                                                                 IPoIB                       2000

            80                                                   OSU-RC-IB
                                                                                             1500




                                                                                 Time (ms)
Time (ms)




                                                                 OSU-UD-IB
            60
                                                                 OSU-Hybrid-IB
                                                                                             1000
            40

                                                                                              500
            20


             0                                                                                  0
                          1                    4                   8                                64   128        256         512   1024
                                         No. of Clients                                                        No. of Clients




             • Olio Benchmark
                     – RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients
             • 4X times better than IPoIB for 8 clients
             • Hybrid design achieves comparable performance to that of pure RC design

                  HPC Advisory Council, Lugano Switzerland '12                                                                               28
Application Level Evaluation – Real Application Workloads
                                                                                  350
            120
                                              SDP
                                                                                  300
            100                               IPoIB
                                              OSU-RC-IB                           250
            80                                OSU-UD-IB




                                                                      Time (ms)
                                                                                  200
Time (ms)




                                              OSU-Hybrid-IB
            60
                                                                                  150
            40
                                                                                  100

            20                                                                    50

             0                                                                     0
                           1                   4                  8                     64   128        256         512   1024
                                         No. of Clients                                            No. of Clients

             • Real Application Workload
                      – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients
             • 12X times better than IPoIB for 8 clients
             • Hybrid design achieves comparable performance to that of pure RC design
                  J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K.
                  Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11
                  J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Memcached Design on
                  High Performance RDMA Capable Interconnects, CCGrid’12
                   HPC Advisory Council, Lugano Switzerland '12                                                                  29
Presentation Outline

• Overview of Hadoop, Memcached and HBase

• Challenges in Accelerating Enterprise Middleware

• Designs and Case Studies
      – Memcached

      – HBase

      – HDFS

• Conclusion and Q&A




HPC Advisory Council, Lugano Switzerland '12         30
HBase Design Using Verbs

                      Current Design              OSU Design


                              HBase                  HBase


                                                  JNI Interface

                             Sockets
                                                  OSU Module


                           1/10 GigE
                            Network             InfiniBand (Verbs)




                                                                     31
 HPC Advisory Council, Lugano Switzerland '12
Experimental Setup
• Hardware
     – Intel Clovertown
            • Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs,
              6 GB main memory, 250 GB hard disk
            • Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR)

     – Intel Westmere
            • Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs,
              12 GB main memory, 160 GB hard disk
            • Network: 1GigE, IPoIB, and IB (QDR)

     – 3 Nodes used
            • Node1 [NameNode & HBase Master]
            • Node2 [DataNode & HBase RegionServer]
            • Node3 [Client]

• Software
     – Hadoop 0.20.0, HBase 0.90.3 and Sun Java SDK 1.7.
     – In all experiments, ‘memtable’ is contained in memory (no disk access
       involved)
 HPC Advisory Council, Lugano Switzerland '12                                            32
Details on Experiments
 • Key/Value size
      – Key size: 20 Bytes
      – Value size: 1KB/4KB
 • Get operation
      – One Key/Value pair is inserted, so that Key/Value pair will stay in
        memory
      – Get operation is repeated 80,000 times
      – Skipped first 40, 000 iterations as warm-up
 • Put operation
      – Memstore_Flush_Size is set to be 256 MB
      – No memory flush operation involved
      – Put operation is repeated 40, 000 times
      – Skipped first 10, 000 iterations as warm-up
  HPC Advisory Council, Lugano Switzerland '12                                33
Get Operation (IB:DDR)

                                  Latency                                                    Throughput
    300                                                                      18000
                   1GE              IPoIB
                   10GE             OSU Design                               16000
    250
                                                                             14000

    200                                                                      12000




                                                           Operations /sec
Time (us)




                                                                             10000
    150
                                                                             8000

    100                                                                      6000

                                                                             4000
        50
                                                                             2000

            0                                                                    0
                          1K                      4K                                 1K                   4K
                                 Message Size                                             Message Size
• HBase Get Operation
   – 1K bytes – 65 us (15K TPS)
   – 4K bytes -- 88 us (11K TPS)
• Almost factor of two improvement over 10GE (TOE)

            HPC Advisory Council, Lugano Switzerland '12                                                       34
Get Operation (IB:QDR)

                                     Latency                                                    Throughput
            350                                                                    25000
                                    1GE
            300
                                    IPoIB                                          20000
            250
                                    OSU Design




                                                                 Operations /sec
                                                                                   15000
Time (us)




            200

            150
                                                                                   10000

            100
                                                                                   5000
            50

             0                                                                         0
                               1K                          4K                              1K                  4K
                                            Message Size                                        Message Size



        • HBase Get Operation
           – 1K bytes – 47 us (22K TPS)
           – 4K bytes -- 64 us (16K TPS)
        • Almost factor of four improvement over IPoIB for 1KB

                  HPC Advisory Council, Lugano Switzerland '12                                                      35
Put Operation (IB:DDR)

                                       Latency                                                   Throughput
                                                                                    10000
        400
                           1GE    IPoIB   10GE       OSU Design                     9000
        350
                                                                                    8000
        300                                                                         7000




                                                                  Operations /sec
        250                                                                         6000
Time (us)




        200                                                                         5000

                                                                                    4000
        150
                                                                                    3000
        100
                                                                                    2000
            50                                                                      1000

             0                                                                          0
                             1K                           4K                                1K                  4K
                                      Message Size                                               Message Size


        • HBase Put Operation
           – 1K bytes – 114 us (8.7K TPS)
           – 4K bytes -- 179 us (5.6K TPS)
        • 34% improvement over 10GE (TOE) for 1KB

                 HPC Advisory Council, Lugano Switzerland '12                                                        36
Put Operation (IB:QDR)

                                       Latency                                                  Throughput
       400                                                                        14000
                         1GE
       350                                                                        12000
                         IPoIB
       300               OSU Design                                               10000




                                                                Operations /sec
       250
                                                                                  8000
Time (us)




       200
                                                                                  6000
       150
                                                                                  4000
       100

                                                                                  2000
            50

             0                                                                        0
                               1K                        4K                               1K                  4K
                                       Message Size                                            Message Size


   • HBase Put Operation
      – 1K bytes – 78 us (13K TPS)
      – 4K bytes -- 122 us (8K TPS)
   • A factor of two improvement over IPoIB for 1KB

                 HPC Advisory Council, Lugano Switzerland '12                                                      37
HBase Put/Get – Detailed Analysis
            300                                                                       250
                                                                                                                        Communication
                                                                                                                        Communication Preparation
            250
                                                                                      200                               Server Processing
                                                                                                                        Server Serialization
            200
                                                                                                                        Client Processing
                                                                                      150
Time (us)




                                                                          Time (us)
                                                                                                                        Client Serialization
            150

                                                                                      100
            100


                                                                                      50
            50


             0                                                                         0
                     1GigE          IPoIB          10GigE        OSU-IB                     1GigE   IPoIB          10GigE                OSU-IB
                                        HBase Put 1KB                                                   HBase Get 1KB
                  • HBase 1KB Put
                     – Communication Time – 8.9 us
                     – A factor of 6X improvement over 10GE for communication time
                  • HBase 1KB Get
                     – Communication Time – 8.9 us
                     – A factor of 6X improvement over 10GE for communication time
                    W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda,
                    Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?,
                    ISPASS’12
                  HPC Advisory Council, Lugano Switzerland '12                                                                                    38
HBase Single Server-Multi-Client Results
            600                                                                 60000
                                                                                            IPoIB
            500                                                                 50000
                                                                                            OSU-IB
            400                                                                 40000       1GigE
Time (us)




                                                                      Ops/sec
                                                                                            10GigE
            300                                                                 30000


            200                                                                 20000


            100                                                                 10000


             0                                                                      0
                     1           2           4          8        16                     1            2         4          8   16
                                       No. of Clients                                                    No. of Clients

                                       Latency                                                           Throughput

                  • HBase Get latency
                         – 4 clients: 104.5 us; 16 clients: 296.1 us
                  • HBase Get throughput
                         – 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec
                  • 27% improvement in throughput for 16 clients over 10GE
                  HPC Advisory Council, Lugano Switzerland '12                                                                     39
HBase – YCSB Read-Write Workload
            7000                                                                    10000
                                                                                    9000
            6000
                                                                                    8000
            5000                                                                    7000
                                                                                    6000




                                                                        Time (us)
Time (us)




            4000
                                                                                    5000
                                                                                                IPoIB        OSU-IB
            3000                                                                    4000
                                                                                                1GigE        10GigE
                                                                                    3000
            2000
                                                                                    2000
            1000                                                                    1000

              0                                                                         0
                       8        16        32         64      96   128                       8           16   32         64    96   128
                                          No. of Clients                                                     No. of Clients

                                        Read Latency                                                         Write Latency

                   • HBase Get latency (Yahoo! Cloud Service Benchmark)
                      – 64 clients: 2.0 ms; 128 Clients: 3.5 ms
                      – 42% improvement over IPoIB for 128 clients
                   • HBase Get latency
                      – 64 clients: 1.9 ms; 128 Clients: 3.5 ms
                      – 40% improvement over IPoIB for 128 clients
                      J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High-
                      Performance Design of HBase with RDMA over InfiniBand, IPDPS’12
                   HPC Advisory Council, Lugano Switzerland '12                                                                          40
Presentation Outline

• Overview of Hadoop, Memcached and HBase

• Challenges in Accelerating Enterprise Middleware

• Designs and Case Studies
      – Memcached

      – HBase

      – HDFS

• Conclusion and Q&A




HPC Advisory Council, Lugano Switzerland '12         41
Studies and Experimental Setup

• Two Kinds of Designs and Studies we have Done
    – Studying the impact of HDD vs. SSD for HDFS
           • Unmodified Hadoop for experiments
    – Preliminary design of HDFS over Verbs
• Hadoop Experiments
    – Intel Clovertown 2.33GHz, 6GB RAM, InfiniBand DDR, Chelsio T320
    – Intel X-25E 64GB SSD and 250GB HDD
    – Hadoop version 0.20.2, Sun/Oracle Java 1.6.0
    – Dedicated NameServer and JobTracker
    – Number of Datanodes used: 2, 4, and 8



                                                                        42
  HPC Advisory Council, Lugano Switzerland '12
Hadoop: DFS IO Write Performance
                                    90
                                                                                                                     Four Data Nodes
                                    80
                                                                                                                    Using HDD and SSD
Average Write Throughput (MB/sec)




                                    70

                                    60                                                                                  1GE with HDD
                                                                                                                        IGE with SSD
                                    50                                                                                  IPoIB with HDD
                                                                                                                        IPoIB with SSD
                                    40
                                                                                                                        SDP with HDD
                                    30                                                                                  SDP with SSD
                                                                                                                        10GE-TOE with HDD
                                    20
                                                                                                                        10GE-TOE with SSD
                                    10

                                     0
                                             1         2        3         4         5         6    7   8   9   10
                                                                                   File Size(GB)

                                    •      DFS IO included in Hadoop, measures sequential access throughput
                                    •      We have two map tasks each writing to a file of increasing size (1-10GB)
                                    •      Significant improvement with IPoIB, SDP and 10GigE
                                    •      With SSD, performance improvement is almost seven or eight fold!
                                    •      SSD benefits not seen without using high-performance interconnect

                                                                                                                                         43
                                         HPC Advisory Council, Lugano Switzerland '12
Hadoop: RandomWriter Performance
                           700

                           600
    Execution Time (sec)




                           500                                        1GE with HDD
                                                                      IGE with SSD
                           400                                        IPoIB with HDD
                                                                      IPoIB with SSD
                           300
                                                                      SDP with HDD
                           200                                        SDP with SSD
                                                                      10GE-TOE with HDD
                           100
                                                                      10GE-TOE with SSD
                            0
                                 2                                4
                                           Number of data nodes



• Each map generates 1GB of random binary data and writes to HDFS
• SSD improves execution time by 50% with 1GigE for two DataNodes
• For four DataNodes, benefits are observed only with HPC interconnect
• IPoIB, SDP and 10GigE can improve performance by 59% on four Data Nodes




                                                                                          44
   HPC Advisory Council, Lugano Switzerland '12
Hadoop Sort Benchmark
                          2500


                          2000
   Execution Time (sec)




                                                                                     1GE with HDD
                                                                                     IGE with SSD
                          1500
                                                                                     IPoIB with HDD
                                                                                     IPoIB with SSD
                          1000
                                                                                     SDP with HDD
                                                                                     SDP with SSD
                          500                                                        10GE-TOE with HDD
                                                                                     10GE-TOE with SSD
                             0
                                        2                            4
                                              Number of data nodes


   •                       Sort: baseline benchmark for Hadoop
   •                       Sort phase: I/O bound; Reduce phase: communication bound
   •                       SSD improves performance by 28% using 1GigE with two DataNodes
   •                       Benefit of 50% on four DataNodes using SDP, IPoIB or 10GigE

 S. Sur, H. Wang, J. Huang, X. Ouyang and D. K. Panda “Can High-Performance Interconnects Benefit Hadoop
 Distributed File System?”, MASVDC ‘10 in conjunction with MICRO 2010, Atlanta, GA.

                                                                                                           45
  HPC Advisory Council, Lugano Switzerland '12
HDFS Design Using Verbs

                      Current Design              OSU Design


                              HDFS                    HDFS


                                                  JNI Interface

                             Sockets
                                                  OSU Module


                           1/10 GigE
                            Network             InfiniBand (Verbs)




                                                                     46
 HPC Advisory Council, Lugano Switzerland '12
RDMA-based Design for Native HDFS –
Preliminary Results
                    120

                              1 GigE           IPoIB
                    100
                              10 GigE          OSU-Design
                     80
        Time (ms)




                     60


                     40


                     20

                      0
                          1                2                 3          4   5
                                                       File Size (GB)
• HDFS File Write Experiment using four data nodes on IB-DDR Cluster
• HDFS File Write Time
       – 2 GB – 14 s, 5 GB – 86s,
       – For 5 GB File Size - 20% improvement over IPoIB,
         14% improvement over 10GigE
HPC Advisory Council, Lugano Switzerland '12                                    47
Presentation Outline

• Overview of Hadoop, Memcached and HBase

• Challenges in Accelerating Enterprise Middleware

• Designs and Case Studies
      – Memcached

      – HBase

      – HDFS

• Conclusion and Q&A




HPC Advisory Council, Lugano Switzerland '12         48
Concluding Remarks

• InfiniBand with RDMA feature is gaining momentum in HPC
  systems with best performance and greater usage
• It is possible to use the RDMA feature in enterprise environments
  for accelerating big data processing
• Presented some initial designs and performance numbers
• Many open research challenges remain to be solved so that
  middleware for enterprise environments can take advantage of
    – modern high-performance networks
    – multi-core technologies
    – emerging storage technologies



  HPC Advisory Council, Lugano Switzerland '12                    49
Designing Communication and I/O Libraries for
Enterprise Systems: Solved a Few Initial Challenges


                                                     Applications


                                            Datacenter Middleware
                                     (HDFS, HBase, MapReduce, Memcached)

                                                Programming Models
                                                      (Socket)

                                          Communication and I/O Library
                                   Point-to-Point                Threading Models and
                                  Communication                    Synchronization

                I/O and Filesys tems                      QoS                       Fault Tolerance


                                           Commodity Computing Sys tem
            Networking Technologies              Architectures
                                                 (single, dual, quad, ..)        Storage Technologies
            (Infi niBand, 1/10/40 GiGE,
                                                                                        (HDD or SSD)
            RNICs & Intelligent NICs)       Multi/Many-c ore Architecture
                                                  and Accelerators



 HPC Advisory Council, Lugano Switzerland '12                                                           50
Web Pointers


                            http://www.cse.ohio-state.edu/~panda
                               http://nowlab.cse.ohio-state.edu

                                       MVAPICH Web Page
                                http://mvapich.cse.ohio-state.edu




                                         panda@cse.ohio-state.edu




HPC Advisory Council, Lugano Switzerland '12                        51

More Related Content

What's hot

Deep dive network requirementsfor enterprise video conferencing
Deep dive   network requirementsfor enterprise video conferencingDeep dive   network requirementsfor enterprise video conferencing
Deep dive network requirementsfor enterprise video conferencing
Interop
 
Virtual Network Performance Challenge
Virtual Network Performance ChallengeVirtual Network Performance Challenge
Virtual Network Performance Challenge
Stephen Hemminger
 
Obs final-english version
Obs final-english versionObs final-english version
Obs final-english version
Nouma
 
20110224 saf cfip_brochure_en
20110224 saf cfip_brochure_en20110224 saf cfip_brochure_en
20110224 saf cfip_brochure_en
nezinamais
 

What's hot (19)

SG Security Switch Brochure
SG Security Switch BrochureSG Security Switch Brochure
SG Security Switch Brochure
 
(Paper) P2P VIDEO BROADCAST BASED ON PER-PEER TRANSCODING AND ITS EVALUATION ...
(Paper) P2P VIDEO BROADCAST BASED ON PER-PEER TRANSCODING AND ITS EVALUATION ...(Paper) P2P VIDEO BROADCAST BASED ON PER-PEER TRANSCODING AND ITS EVALUATION ...
(Paper) P2P VIDEO BROADCAST BASED ON PER-PEER TRANSCODING AND ITS EVALUATION ...
 
Q logic convergence solutions net-app insight (110310)
Q logic convergence solutions   net-app insight (110310)Q logic convergence solutions   net-app insight (110310)
Q logic convergence solutions net-app insight (110310)
 
RCIM 2008 - - ALTERA
RCIM 2008 - - ALTERARCIM 2008 - - ALTERA
RCIM 2008 - - ALTERA
 
Deep dive network requirementsfor enterprise video conferencing
Deep dive   network requirementsfor enterprise video conferencingDeep dive   network requirementsfor enterprise video conferencing
Deep dive network requirementsfor enterprise video conferencing
 
Virtual Network Performance Challenge
Virtual Network Performance ChallengeVirtual Network Performance Challenge
Virtual Network Performance Challenge
 
Spanning tree
Spanning treeSpanning tree
Spanning tree
 
Uhd 4 k converter x form systems xfm50-mpcuhd-a
Uhd 4 k converter x form systems xfm50-mpcuhd-aUhd 4 k converter x form systems xfm50-mpcuhd-a
Uhd 4 k converter x form systems xfm50-mpcuhd-a
 
Osi 7 layer
Osi 7 layerOsi 7 layer
Osi 7 layer
 
Obs final-english version
Obs final-english versionObs final-english version
Obs final-english version
 
Il Cloud chiavi in mano | Marco Soldi (Intel) | Milano
Il Cloud chiavi in mano | Marco Soldi (Intel) | MilanoIl Cloud chiavi in mano | Marco Soldi (Intel) | Milano
Il Cloud chiavi in mano | Marco Soldi (Intel) | Milano
 
Internship end
Internship endInternship end
Internship end
 
Hacia el Data Center virtualizado- Fabian Domínguez
Hacia el Data Center virtualizado- Fabian DomínguezHacia el Data Center virtualizado- Fabian Domínguez
Hacia el Data Center virtualizado- Fabian Domínguez
 
Bandwidth measurement
Bandwidth measurementBandwidth measurement
Bandwidth measurement
 
Vpls
VplsVpls
Vpls
 
Qf deck
Qf deckQf deck
Qf deck
 
MTP Brochure
MTP BrochureMTP Brochure
MTP Brochure
 
OpenStack Quantum
OpenStack QuantumOpenStack Quantum
OpenStack Quantum
 
20110224 saf cfip_brochure_en
20110224 saf cfip_brochure_en20110224 saf cfip_brochure_en
20110224 saf cfip_brochure_en
 

Viewers also liked

Introduction to Thrift
Introduction to ThriftIntroduction to Thrift
Introduction to Thrift
Dvir Volk
 
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Spark Summit
 

Viewers also liked (20)

Communicating the Value of IT with a Standard Model
Communicating the Value of IT with a Standard ModelCommunicating the Value of IT with a Standard Model
Communicating the Value of IT with a Standard Model
 
Technology Business Management (TBM) - Achieving Digital Strategy
Technology Business Management (TBM) - Achieving Digital StrategyTechnology Business Management (TBM) - Achieving Digital Strategy
Technology Business Management (TBM) - Achieving Digital Strategy
 
Apache Thrift
Apache ThriftApache Thrift
Apache Thrift
 
An introduction to Apache Thrift
An introduction to Apache ThriftAn introduction to Apache Thrift
An introduction to Apache Thrift
 
Apache Thrift : One Stop Solution for Cross Language Communication
Apache Thrift : One Stop Solution for Cross Language CommunicationApache Thrift : One Stop Solution for Cross Language Communication
Apache Thrift : One Stop Solution for Cross Language Communication
 
Apache Thrift, a brief introduction
Apache Thrift, a brief introductionApache Thrift, a brief introduction
Apache Thrift, a brief introduction
 
Hive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamHive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive Team
 
Hive Object Model
Hive Object ModelHive Object Model
Hive Object Model
 
Mongo DB로 진행하는 CRUD
Mongo DB로 진행하는 CRUDMongo DB로 진행하는 CRUD
Mongo DB로 진행하는 CRUD
 
아파치 쓰리프트 (Apache Thrift)
아파치 쓰리프트 (Apache Thrift) 아파치 쓰리프트 (Apache Thrift)
아파치 쓰리프트 (Apache Thrift)
 
Introduction to Thrift
Introduction to ThriftIntroduction to Thrift
Introduction to Thrift
 
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Apache big data 2016 - Speaking the language of Big Data
Apache big data 2016 - Speaking the language of Big DataApache big data 2016 - Speaking the language of Big Data
Apache big data 2016 - Speaking the language of Big Data
 
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
 
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 

Similar to Acceleration for big data, hadoop and memcached it168文库

Packet shaper datasheet 81
Packet shaper datasheet 81Packet shaper datasheet 81
Packet shaper datasheet 81
Zalli13
 
Packet shaper datasheet 81
Packet shaper datasheet 81Packet shaper datasheet 81
Packet shaper datasheet 81
Zalli13
 
20120416 tf mms_feedback_slideshare
20120416 tf mms_feedback_slideshare20120416 tf mms_feedback_slideshare
20120416 tf mms_feedback_slideshare
Osamu Takazoe
 

Similar to Acceleration for big data, hadoop and memcached it168文库 (20)

Parallel and Distributed Computing on Low Latency Clusters
Parallel and Distributed Computing on Low Latency ClustersParallel and Distributed Computing on Low Latency Clusters
Parallel and Distributed Computing on Low Latency Clusters
 
Windows Server 2012 Hyper-V Networking Evolved
Windows Server 2012 Hyper-V Networking Evolved Windows Server 2012 Hyper-V Networking Evolved
Windows Server 2012 Hyper-V Networking Evolved
 
ws-c2960+48tc-s-datasheet.pdf
ws-c2960+48tc-s-datasheet.pdfws-c2960+48tc-s-datasheet.pdf
ws-c2960+48tc-s-datasheet.pdf
 
cisco-ws-c2960+48pst-l-datasheet.pdf
cisco-ws-c2960+48pst-l-datasheet.pdfcisco-ws-c2960+48pst-l-datasheet.pdf
cisco-ws-c2960+48pst-l-datasheet.pdf
 
Packet shaper datasheet 81
Packet shaper datasheet 81Packet shaper datasheet 81
Packet shaper datasheet 81
 
Packet shaper datasheet 81
Packet shaper datasheet 81Packet shaper datasheet 81
Packet shaper datasheet 81
 
VyattaCore TIPS2013
VyattaCore TIPS2013VyattaCore TIPS2013
VyattaCore TIPS2013
 
6500overview
6500overview6500overview
6500overview
 
Virtual net performance
Virtual net performanceVirtual net performance
Virtual net performance
 
V24 Product Brief - Aggregation & Filtering Taps
V24 Product Brief - Aggregation & Filtering TapsV24 Product Brief - Aggregation & Filtering Taps
V24 Product Brief - Aggregation & Filtering Taps
 
SemiDynamics new family of High Bandwidth Vector-capable Cores
SemiDynamics new family of High Bandwidth Vector-capable CoresSemiDynamics new family of High Bandwidth Vector-capable Cores
SemiDynamics new family of High Bandwidth Vector-capable Cores
 
An Implementation of Virtual Cluster on a Cloud
An Implementation of Virtual Cluster on a CloudAn Implementation of Virtual Cluster on a Cloud
An Implementation of Virtual Cluster on a Cloud
 
Acme Packet Presentation Materials for VUC June 18th 2010
Acme Packet Presentation Materials for VUC June 18th 2010Acme Packet Presentation Materials for VUC June 18th 2010
Acme Packet Presentation Materials for VUC June 18th 2010
 
20120416 tf mms_feedback_slideshare
20120416 tf mms_feedback_slideshare20120416 tf mms_feedback_slideshare
20120416 tf mms_feedback_slideshare
 
Nettab 2006 Tutorial 3B part 2
Nettab 2006 Tutorial 3B part 2Nettab 2006 Tutorial 3B part 2
Nettab 2006 Tutorial 3B part 2
 
cisco-ws-c2960+24pc-l-datasheet.pdf
cisco-ws-c2960+24pc-l-datasheet.pdfcisco-ws-c2960+24pc-l-datasheet.pdf
cisco-ws-c2960+24pc-l-datasheet.pdf
 
Vyatta 3500 Datasheet
Vyatta 3500 DatasheetVyatta 3500 Datasheet
Vyatta 3500 Datasheet
 
Cisco catalyst 2960 xr series switches datasheet
Cisco catalyst 2960 xr series switches datasheetCisco catalyst 2960 xr series switches datasheet
Cisco catalyst 2960 xr series switches datasheet
 
cisco-vs-s720-10g-3c-datasheet.pdf
cisco-vs-s720-10g-3c-datasheet.pdfcisco-vs-s720-10g-3c-datasheet.pdf
cisco-vs-s720-10g-3c-datasheet.pdf
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane
 

More from Accenture

Certify 2014trends-report
Certify 2014trends-reportCertify 2014trends-report
Certify 2014trends-report
Accenture
 
Tier 2 net app baseline design standard revised nov 2011
Tier 2 net app baseline design standard   revised nov 2011Tier 2 net app baseline design standard   revised nov 2011
Tier 2 net app baseline design standard revised nov 2011
Accenture
 
Perf stat windows
Perf stat windowsPerf stat windows
Perf stat windows
Accenture
 
Performance problems on ethernet networks when the e0m management interface i...
Performance problems on ethernet networks when the e0m management interface i...Performance problems on ethernet networks when the e0m management interface i...
Performance problems on ethernet networks when the e0m management interface i...
Accenture
 
NetApp system installation workbook Spokane
NetApp system installation workbook SpokaneNetApp system installation workbook Spokane
NetApp system installation workbook Spokane
Accenture
 
Migrate volume in akfiler7
Migrate volume in akfiler7Migrate volume in akfiler7
Migrate volume in akfiler7
Accenture
 
Migrate vol in akfiler7
Migrate vol in akfiler7Migrate vol in akfiler7
Migrate vol in akfiler7
Accenture
 
Data storage requirements AK
Data storage requirements AKData storage requirements AK
Data storage requirements AK
Accenture
 
C mode class
C mode classC mode class
C mode class
Accenture
 
Akfiler upgrades providence july 2012
Akfiler upgrades providence july 2012Akfiler upgrades providence july 2012
Akfiler upgrades providence july 2012
Accenture
 
Reporting demo
Reporting demoReporting demo
Reporting demo
Accenture
 
Net app virtualization preso
Net app virtualization presoNet app virtualization preso
Net app virtualization preso
Accenture
 
Providence net app upgrade plan PPMC
Providence net app upgrade plan PPMCProvidence net app upgrade plan PPMC
Providence net app upgrade plan PPMC
Accenture
 
WSC Net App storage for windows challenges and solutions
WSC Net App storage for windows challenges and solutionsWSC Net App storage for windows challenges and solutions
WSC Net App storage for windows challenges and solutions
Accenture
 
50,000-seat_VMware_view_deployment
50,000-seat_VMware_view_deployment50,000-seat_VMware_view_deployment
50,000-seat_VMware_view_deployment
Accenture
 
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
Accenture
 
Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11
Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11
Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11
Accenture
 
Snap mirror source to tape to destination scenario
Snap mirror source to tape to destination scenarioSnap mirror source to tape to destination scenario
Snap mirror source to tape to destination scenario
Accenture
 

More from Accenture (20)

Certify 2014trends-report
Certify 2014trends-reportCertify 2014trends-report
Certify 2014trends-report
 
Calabrio analyze
Calabrio analyzeCalabrio analyze
Calabrio analyze
 
Tier 2 net app baseline design standard revised nov 2011
Tier 2 net app baseline design standard   revised nov 2011Tier 2 net app baseline design standard   revised nov 2011
Tier 2 net app baseline design standard revised nov 2011
 
Perf stat windows
Perf stat windowsPerf stat windows
Perf stat windows
 
Performance problems on ethernet networks when the e0m management interface i...
Performance problems on ethernet networks when the e0m management interface i...Performance problems on ethernet networks when the e0m management interface i...
Performance problems on ethernet networks when the e0m management interface i...
 
NetApp system installation workbook Spokane
NetApp system installation workbook SpokaneNetApp system installation workbook Spokane
NetApp system installation workbook Spokane
 
Migrate volume in akfiler7
Migrate volume in akfiler7Migrate volume in akfiler7
Migrate volume in akfiler7
 
Migrate vol in akfiler7
Migrate vol in akfiler7Migrate vol in akfiler7
Migrate vol in akfiler7
 
Data storage requirements AK
Data storage requirements AKData storage requirements AK
Data storage requirements AK
 
C mode class
C mode classC mode class
C mode class
 
Akfiler upgrades providence july 2012
Akfiler upgrades providence july 2012Akfiler upgrades providence july 2012
Akfiler upgrades providence july 2012
 
NA notes
NA notesNA notes
NA notes
 
Reporting demo
Reporting demoReporting demo
Reporting demo
 
Net app virtualization preso
Net app virtualization presoNet app virtualization preso
Net app virtualization preso
 
Providence net app upgrade plan PPMC
Providence net app upgrade plan PPMCProvidence net app upgrade plan PPMC
Providence net app upgrade plan PPMC
 
WSC Net App storage for windows challenges and solutions
WSC Net App storage for windows challenges and solutionsWSC Net App storage for windows challenges and solutions
WSC Net App storage for windows challenges and solutions
 
50,000-seat_VMware_view_deployment
50,000-seat_VMware_view_deployment50,000-seat_VMware_view_deployment
50,000-seat_VMware_view_deployment
 
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
 
Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11
Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11
Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11
 
Snap mirror source to tape to destination scenario
Snap mirror source to tape to destination scenarioSnap mirror source to tape to destination scenario
Snap mirror source to tape to destination scenario
 

Acceleration for big data, hadoop and memcached it168文库

  • 1. Acceleration for Big Data, Hadoop and Memcached A Presentation at HPC Advisory Council Workshop, Lugano 2012 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
  • 2. Recap of Last Two Day’s Presentations • MPI is a dominant programming model for HPC Systems • Introduced some of the MPI Features and their Usage • Introduced MVAPICH2 stack • Illustrated many performance optimizations and tuning techniques for MVAPICH2 • Provided an overview of MPI-3 Features • Introduced challenges in designing MPI for Exascale systems • Presented approaches being taken by MVAPICH2 for Exascale systems HPC Advisory Council, Lugano Switzerland '12 2
  • 3. High-Performance Networks in the Top500 Percentage share of InfiniBand is steadily increasing HPC Advisory Council, Lugano Switzerland '12 3
  • 4. Use of High-Performance Networks for Scientific Computing • OpenFabrics software stack with IB, iWARP and RoCE interfaces are driving HPC systems • Message Passing Interface (MPI) • Parallel File Systems • Almost 11.5 years of Research and Development since InfiniBand was introduced in October 2001 • Other Programming Models are emerging to take advantage of High-Performance Networks – UPC – SHMEM HPC Advisory Council, Lugano Switzerland '12 4
  • 5. One-way Latency: MPI over IB Small Message Latency Large Message Latency 6.00 250.00 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR 5.00 200.00 MVAPICH-ConnectX-DDR MVAPICH-ConnectX2-PCIe2-QDR 4.00 MVAPICH-ConnectX3-PCIe3-FDR 150.00 Latency (us) Latency (us) 1.82 3.00 1.66 100.00 1.64 2.00 1.56 50.00 1.00 0.81 0.00 0.00 Message Size (bytes) Message Size (bytes) DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 without IB switch HPC Advisory Council, Lugano Switzerland '12 5
  • 6. Bandwidth: MPI over IB Unidirectional Bandwidth Bidirectional Bandwidth 7000 15000 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR 6000 13000 6333 MVAPICH-ConnectX-DDR MVAPICH-ConnectX2-PCIe2-QDR Bandwidth (MBytes/sec) 11000 Bandwidth (MBytes/sec) 5000 MVAPICH-ConnectX3-PCIe3-FDR 9000 11043 4000 3385 7000 6521 3000 3280 5000 4407 2000 1917 3000 3704 1706 1000 1000 3341 0 -1000 Message Size (bytes) Message Size (bytes) DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 without IB switch HPC Advisory Council, Lugano Switzerland '12 6
  • 7. Large-scale InfiniBand Installations • 209 IB Clusters (41.8%) in the November‘11 Top500 list (http://www.top500.org) • Installations in the Top 30 (13 systems): 120,640 cores (Nebulae) in China (4th) 29,440 cores (Mole-8.5) in China (21st) 73,278 cores (Tsubame-2.0) in Japan (5th) 42,440 cores (Red Sky) at Sandia (24th) 111,104 cores (Pleiades) at NASA Ames (7th) 62,976 cores (Ranger) at TACC (25th) 138,368 cores (Tera-100) at France (9th) 20,480 cores (Bull Benchmarks) in France (27th) 122,400 cores (RoadRunner) at LANL (10th) 20,480 cores (Helios) in Japan (28th) 137,200 cores (Sunway Blue Light) in China (14th) More are getting installed ! 46,208 cores (Zin) at LLNL (15th) 33,072 cores (Lomonosov) in Russia (18th) HPC Advisory Council, Lugano Switzerland '12 7
  • 8. Enterprise/Commercial Computing • Focuses on big data and data analytics • Multiple environments and middleware are gaining momentum – Hadoop (HDFS, HBase and MapReduce) – Memcached HPC Advisory Council, Lugano Switzerland '12 8
  • 9. Can High-Performance Interconnects Benefit Enterprise Computing? • Most of the current enterprise systems use 1GE • Concerns for performance and scalability • Usage of High-Performance Networks is beginning to draw interest – Oracle, IBM, Google are working along these directions • What are the challenges? • Where do the bottlenecks lie? • Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)? HPC Advisory Council, Lugano Switzerland '12 9
  • 10. Presentation Outline • Overview of Hadoop, Memcached and HBase • Challenges in Accelerating Enterprise Middleware • Designs and Case Studies – Memcached – HBase – HDFS • Conclusion and Q&A HPC Advisory Council, Lugano Switzerland '12 10
  • 11. Memcached Architecture Main Main CPUs CPUs memory memory ... SSD HDD SSD HDD High High Main Main Performance CPUs CPUs !"#$% "$#& Performance Networks memory High Performance memory Networks Networks SSD HDD SSD HDD ... Main ... (Database Servers) memory CPUs SSD HDD Web Frontend Servers (Memcached Servers) (Memcached Clients) • Integral part of Web 2.0 architecture • Distributed Caching Layer – Allows to aggregate spare memory from multiple nodes – General purpose • Typically used to cache database queries, results of API calls • Scalable model, but typical usage very network intensive HPC Advisory Council, Lugano Switzerland '12 11
  • 12. Hadoop Architecture • Underlying Hadoop Distributed File System (HDFS) • Fault-tolerance by replicating data blocks • NameNode: stores information on data blocks • DataNodes: store blocks and host Map-reduce computation • JobTracker: track jobs and detect failure • Model scales but high amount of communication during intermediate phases HPC Advisory Council, Lugano Switzerland '12 12
  • 13. Network-Level Interaction Between Clients and Data Nodes in HDFS (HDD/SSD) ... ... High Performance (HDD/SSD) Networks ... ... (HDD/SSD) (HDFS Clients) (HDFS Data Nodes) HPC Advisory Council, Lugano Switzerland '12 13
  • 14. Overview of HBase Architecture • An open source database project based on Hadoop framework for hosting very large tables • Major components: HBaseMaster, HRegionServer and HBaseClient • HBase and HDFS are deployed in the same cluster to get better data locality 14 HPC Advisory Council, Lugano Switzerland '12
  • 15. Network-Level Interaction Between HBase Clients, Region Servers and Data Nodes (HDD/SSD) ... ... ... High High Performance Performance (HDD/SSD) Networks Networks ... ... ... (HDD/SSD) (HBase Clients) (HRegion Servers) (Data Nodes) HPC Advisory Council, Lugano Switzerland '12 15
  • 16. Presentation Outline • Overview of Hadoop, Memcached and HBase • Challenges in Accelerating Enterprise Middleware • Designs and Case Studies – Memcached – HBase – HDFS • Conclusion and Q&A HPC Advisory Council, Lugano Switzerland '12 16
  • 17. Designing Communication and I/O Libraries for Enterprise Systems: Challenges Applications Datacenter Middleware (HDFS, HBase, MapReduce, Memcached) Programming Models (Socket) Communication and I/O Library Point-to-Point Threading Models and Communication Synchronization I/O and Filesys tems QoS Fault Tolerance Commodity Computing Sys tem Networking Technologies Architectures (single, dual, quad, ..) Storage Technologies (Infi niBand, 1/10/40 GiGE, (HDD or SSD) RNICs & Intelligent NICs) Multi/Many-c ore Architecture and Accelerators HPC Advisory Council, Lugano Switzerland '12 17
  • 18. Common Protocols using Open Fabrics Application Application Interface Sockets Verbs Kernel Space TCP/IP SDP iWARP RDMA RDMA Protocol TCP/IP Implementation Hardware Ethernet Offload User User User IPoIB RDMA space space space Driver Network Ethernet InfiniBand Ethernet InfiniBand iWARP RoCE InfiniBand Adapter Adapter Adapter Adapter Adapter Adapter Adapter Adapter Network Ethernet InfiniBand Ethernet InfiniBand Ethernet Ethernet InfiniBand Switch Switch Switch Switch Switch Switch Switch Switch 1/10/40 IPoIB 10/40 GigE- SDP iWARP RoCE IB Verbs GigE TOE HPC Advisory Council, Lugano Switzerland '12 18
  • 19. Can New Data Analysis and Management Systems be Designed with High-Performance Networks and Protocols? Current Design Enhanced Designs Our Approach Application Application Application Accelerated Sockets OSU Design Sockets Verbs / Hardware Verbs Interface Offload 1/10 GigE 10 GigE or InfiniBand 10 GigE or InfiniBand Network • Sockets not designed for high-performance – Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop) – Zero-copy not available for non-blocking sockets 19 HPC Advisory Council, Lugano Switzerland '12
  • 20. Interplay between Storage and Interconnect/Protocols • Most of the current generation enterprise systems use the traditional hard disks • Since hard disks are slower, high performance communication protocols may not have impact • SSDs and other storage technologies are emerging • Does it change the landscape? 20 HPC Advisory Council, Lugano Switzerland '12
  • 21. Presentation Outline • Overview of Hadoop, Memcached and HBase • Challenges in Accelerating Enterprise Middleware • Designs and Case Studies – Memcached – HBase – HDFS • Conclusion and Q&A HPC Advisory Council, Lugano Switzerland '12 21
  • 22. Memcached Design Using Verbs Master Sockets Sockets 1 Worker Shared Thread Client Thread Data 2 Sockets Memory 1 Worker Slabs Thread Items RDMA … Client 2 Verbs Verbs Worker Worker Thread Thread • Server and client perform a negotiation protocol – Master thread assigns clients to appropriate worker thread • Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread, each verbs worker thread can support multiple clients • All other Memcached data structures are shared among RDMA and Sockets worker threads • Memcached applications need not be modified; uses verbs interface if available • Memcached Server can serve both socket and verbs clients simultaneously HPC Advisory Council, Lugano Switzerland '12 22
  • 23. Experimental Setup • Hardware – Intel Clovertown • Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs, 6 GB main memory, 250 GB hard disk • Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR) – Intel Westmere • Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs, 12 GB main memory, 160 GB hard disk • Network: 1GigE, IPoIB, and IB (QDR) • Software – Memcached Server: 1.4.9 – Memcached Client: (libmemcached) 0.52 – In all experiments, ‘memtable’ is contained in memory (no disk access involved) HPC Advisory Council, Lugano Switzerland '12 23
  • 24. Memcached Get Latency (Small Message) 180 180 SDP IPoIB 160 160 OSU-RC-IB 1GigE 140 140 10GigE OSU-UD-IB 120 120 Time (us) Time (us) 100 100 80 80 60 60 40 40 20 20 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 1 2 4 8 16 32 64 128 256 512 1K 2K Message Size Message Size Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) • Memcached Get latency – 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us – 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us • Almost factor of four improvement over 10GE (TOE) for 2K bytes on the DDR cluster HPC Advisory Council, Lugano Switzerland '12 24
  • 25. Memcached Get Latency (Large Message) 6000 6000 SDP IPoIB 5000 5000 OSU-RC-IB 1GigE 4000 4000 10GigE OSU-UD-IB Time (us) Time (us) 3000 3000 2000 2000 1000 1000 0 0 2K 4K 8K 16K 32K 64K 128K 256K 512K 2K 4K 8K 16K 32K 64K 128K 256K 512K Message Size Message Size Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) • Memcached Get latency – 8K bytes RC/UD – DDR: 18.9/19.1 us; QDR: 11.8/12.2 us – 512K bytes RC/UD -- DDR: 369/403 us; QDR: 173/203 us • Almost factor of two improvement over 10GE (TOE) for 512K bytes on the DDR cluster HPC Advisory Council, Lugano Switzerland '12 25
  • 26. Memcached Get TPS (4byte) 1600 1600 SDP IPoIB Thousands of Transactions per second (TPS) Thousands of Transactions per second (TPS) 1400 1400 OSU-RC-IB 1GigE 1200 1200 OSU-UD-IB 1000 1000 800 800 600 600 400 400 200 200 0 0 1 2 4 8 16 32 64 128 256 512 800 1K 4 8 No. of Clients No. of Clients • Memcached Get transactions per second for 4 bytes – On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients • Significant improvement with native IB QDR compared to SDP and IPoIB HPC Advisory Council, Lugano Switzerland '12 26
  • 27. Memcached - Memory Scalability 700 600 Memory Footprint (MB) 500 SDP IPoIB OSU-RC-IB 1GigE 400 OSU-UD-IB OSU-Hybrid-IB 300 200 100 0 1 2 4 8 16 32 64 128 256 512 800 1K 1.6K 2K 4K No. of Clients • Steady Memory Footprint for UD Design – ~ 200MB • RC Memory Footprint increases as increase in number of clients – ~500MB for 4K clients HPC Advisory Council, Lugano Switzerland '12 27
  • 28. Application Level Evaluation – Olio Benchmark 120 2500 SDP 100 IPoIB 2000 80 OSU-RC-IB 1500 Time (ms) Time (ms) OSU-UD-IB 60 OSU-Hybrid-IB 1000 40 500 20 0 0 1 4 8 64 128 256 512 1024 No. of Clients No. of Clients • Olio Benchmark – RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients • 4X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design HPC Advisory Council, Lugano Switzerland '12 28
  • 29. Application Level Evaluation – Real Application Workloads 350 120 SDP 300 100 IPoIB OSU-RC-IB 250 80 OSU-UD-IB Time (ms) 200 Time (ms) OSU-Hybrid-IB 60 150 40 100 20 50 0 0 1 4 8 64 128 256 512 1024 No. of Clients No. of Clients • Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients • 12X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11 J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, CCGrid’12 HPC Advisory Council, Lugano Switzerland '12 29
  • 30. Presentation Outline • Overview of Hadoop, Memcached and HBase • Challenges in Accelerating Enterprise Middleware • Designs and Case Studies – Memcached – HBase – HDFS • Conclusion and Q&A HPC Advisory Council, Lugano Switzerland '12 30
  • 31. HBase Design Using Verbs Current Design OSU Design HBase HBase JNI Interface Sockets OSU Module 1/10 GigE Network InfiniBand (Verbs) 31 HPC Advisory Council, Lugano Switzerland '12
  • 32. Experimental Setup • Hardware – Intel Clovertown • Each node has 8 processor cores on 2 Intel Xeon 2.33 GHz Quad-core CPUs, 6 GB main memory, 250 GB hard disk • Network: 1GigE, IPoIB, 10GigE TOE and IB (DDR) – Intel Westmere • Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs, 12 GB main memory, 160 GB hard disk • Network: 1GigE, IPoIB, and IB (QDR) – 3 Nodes used • Node1 [NameNode & HBase Master] • Node2 [DataNode & HBase RegionServer] • Node3 [Client] • Software – Hadoop 0.20.0, HBase 0.90.3 and Sun Java SDK 1.7. – In all experiments, ‘memtable’ is contained in memory (no disk access involved) HPC Advisory Council, Lugano Switzerland '12 32
  • 33. Details on Experiments • Key/Value size – Key size: 20 Bytes – Value size: 1KB/4KB • Get operation – One Key/Value pair is inserted, so that Key/Value pair will stay in memory – Get operation is repeated 80,000 times – Skipped first 40, 000 iterations as warm-up • Put operation – Memstore_Flush_Size is set to be 256 MB – No memory flush operation involved – Put operation is repeated 40, 000 times – Skipped first 10, 000 iterations as warm-up HPC Advisory Council, Lugano Switzerland '12 33
  • 34. Get Operation (IB:DDR) Latency Throughput 300 18000 1GE IPoIB 10GE OSU Design 16000 250 14000 200 12000 Operations /sec Time (us) 10000 150 8000 100 6000 4000 50 2000 0 0 1K 4K 1K 4K Message Size Message Size • HBase Get Operation – 1K bytes – 65 us (15K TPS) – 4K bytes -- 88 us (11K TPS) • Almost factor of two improvement over 10GE (TOE) HPC Advisory Council, Lugano Switzerland '12 34
  • 35. Get Operation (IB:QDR) Latency Throughput 350 25000 1GE 300 IPoIB 20000 250 OSU Design Operations /sec 15000 Time (us) 200 150 10000 100 5000 50 0 0 1K 4K 1K 4K Message Size Message Size • HBase Get Operation – 1K bytes – 47 us (22K TPS) – 4K bytes -- 64 us (16K TPS) • Almost factor of four improvement over IPoIB for 1KB HPC Advisory Council, Lugano Switzerland '12 35
  • 36. Put Operation (IB:DDR) Latency Throughput 10000 400 1GE IPoIB 10GE OSU Design 9000 350 8000 300 7000 Operations /sec 250 6000 Time (us) 200 5000 4000 150 3000 100 2000 50 1000 0 0 1K 4K 1K 4K Message Size Message Size • HBase Put Operation – 1K bytes – 114 us (8.7K TPS) – 4K bytes -- 179 us (5.6K TPS) • 34% improvement over 10GE (TOE) for 1KB HPC Advisory Council, Lugano Switzerland '12 36
  • 37. Put Operation (IB:QDR) Latency Throughput 400 14000 1GE 350 12000 IPoIB 300 OSU Design 10000 Operations /sec 250 8000 Time (us) 200 6000 150 4000 100 2000 50 0 0 1K 4K 1K 4K Message Size Message Size • HBase Put Operation – 1K bytes – 78 us (13K TPS) – 4K bytes -- 122 us (8K TPS) • A factor of two improvement over IPoIB for 1KB HPC Advisory Council, Lugano Switzerland '12 37
  • 38. HBase Put/Get – Detailed Analysis 300 250 Communication Communication Preparation 250 200 Server Processing Server Serialization 200 Client Processing 150 Time (us) Time (us) Client Serialization 150 100 100 50 50 0 0 1GigE IPoIB 10GigE OSU-IB 1GigE IPoIB 10GigE OSU-IB HBase Put 1KB HBase Get 1KB • HBase 1KB Put – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time • HBase 1KB Get – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS’12 HPC Advisory Council, Lugano Switzerland '12 38
  • 39. HBase Single Server-Multi-Client Results 600 60000 IPoIB 500 50000 OSU-IB 400 40000 1GigE Time (us) Ops/sec 10GigE 300 30000 200 20000 100 10000 0 0 1 2 4 8 16 1 2 4 8 16 No. of Clients No. of Clients Latency Throughput • HBase Get latency – 4 clients: 104.5 us; 16 clients: 296.1 us • HBase Get throughput – 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec • 27% improvement in throughput for 16 clients over 10GE HPC Advisory Council, Lugano Switzerland '12 39
  • 40. HBase – YCSB Read-Write Workload 7000 10000 9000 6000 8000 5000 7000 6000 Time (us) Time (us) 4000 5000 IPoIB OSU-IB 3000 4000 1GigE 10GigE 3000 2000 2000 1000 1000 0 0 8 16 32 64 96 128 8 16 32 64 96 128 No. of Clients No. of Clients Read Latency Write Latency • HBase Get latency (Yahoo! Cloud Service Benchmark) – 64 clients: 2.0 ms; 128 Clients: 3.5 ms – 42% improvement over IPoIB for 128 clients • HBase Get latency – 64 clients: 1.9 ms; 128 Clients: 3.5 ms – 40% improvement over IPoIB for 128 clients J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High- Performance Design of HBase with RDMA over InfiniBand, IPDPS’12 HPC Advisory Council, Lugano Switzerland '12 40
  • 41. Presentation Outline • Overview of Hadoop, Memcached and HBase • Challenges in Accelerating Enterprise Middleware • Designs and Case Studies – Memcached – HBase – HDFS • Conclusion and Q&A HPC Advisory Council, Lugano Switzerland '12 41
  • 42. Studies and Experimental Setup • Two Kinds of Designs and Studies we have Done – Studying the impact of HDD vs. SSD for HDFS • Unmodified Hadoop for experiments – Preliminary design of HDFS over Verbs • Hadoop Experiments – Intel Clovertown 2.33GHz, 6GB RAM, InfiniBand DDR, Chelsio T320 – Intel X-25E 64GB SSD and 250GB HDD – Hadoop version 0.20.2, Sun/Oracle Java 1.6.0 – Dedicated NameServer and JobTracker – Number of Datanodes used: 2, 4, and 8 42 HPC Advisory Council, Lugano Switzerland '12
  • 43. Hadoop: DFS IO Write Performance 90 Four Data Nodes 80 Using HDD and SSD Average Write Throughput (MB/sec) 70 60 1GE with HDD IGE with SSD 50 IPoIB with HDD IPoIB with SSD 40 SDP with HDD 30 SDP with SSD 10GE-TOE with HDD 20 10GE-TOE with SSD 10 0 1 2 3 4 5 6 7 8 9 10 File Size(GB) • DFS IO included in Hadoop, measures sequential access throughput • We have two map tasks each writing to a file of increasing size (1-10GB) • Significant improvement with IPoIB, SDP and 10GigE • With SSD, performance improvement is almost seven or eight fold! • SSD benefits not seen without using high-performance interconnect 43 HPC Advisory Council, Lugano Switzerland '12
  • 44. Hadoop: RandomWriter Performance 700 600 Execution Time (sec) 500 1GE with HDD IGE with SSD 400 IPoIB with HDD IPoIB with SSD 300 SDP with HDD 200 SDP with SSD 10GE-TOE with HDD 100 10GE-TOE with SSD 0 2 4 Number of data nodes • Each map generates 1GB of random binary data and writes to HDFS • SSD improves execution time by 50% with 1GigE for two DataNodes • For four DataNodes, benefits are observed only with HPC interconnect • IPoIB, SDP and 10GigE can improve performance by 59% on four Data Nodes 44 HPC Advisory Council, Lugano Switzerland '12
  • 45. Hadoop Sort Benchmark 2500 2000 Execution Time (sec) 1GE with HDD IGE with SSD 1500 IPoIB with HDD IPoIB with SSD 1000 SDP with HDD SDP with SSD 500 10GE-TOE with HDD 10GE-TOE with SSD 0 2 4 Number of data nodes • Sort: baseline benchmark for Hadoop • Sort phase: I/O bound; Reduce phase: communication bound • SSD improves performance by 28% using 1GigE with two DataNodes • Benefit of 50% on four DataNodes using SDP, IPoIB or 10GigE S. Sur, H. Wang, J. Huang, X. Ouyang and D. K. Panda “Can High-Performance Interconnects Benefit Hadoop Distributed File System?”, MASVDC ‘10 in conjunction with MICRO 2010, Atlanta, GA. 45 HPC Advisory Council, Lugano Switzerland '12
  • 46. HDFS Design Using Verbs Current Design OSU Design HDFS HDFS JNI Interface Sockets OSU Module 1/10 GigE Network InfiniBand (Verbs) 46 HPC Advisory Council, Lugano Switzerland '12
  • 47. RDMA-based Design for Native HDFS – Preliminary Results 120 1 GigE IPoIB 100 10 GigE OSU-Design 80 Time (ms) 60 40 20 0 1 2 3 4 5 File Size (GB) • HDFS File Write Experiment using four data nodes on IB-DDR Cluster • HDFS File Write Time – 2 GB – 14 s, 5 GB – 86s, – For 5 GB File Size - 20% improvement over IPoIB, 14% improvement over 10GigE HPC Advisory Council, Lugano Switzerland '12 47
  • 48. Presentation Outline • Overview of Hadoop, Memcached and HBase • Challenges in Accelerating Enterprise Middleware • Designs and Case Studies – Memcached – HBase – HDFS • Conclusion and Q&A HPC Advisory Council, Lugano Switzerland '12 48
  • 49. Concluding Remarks • InfiniBand with RDMA feature is gaining momentum in HPC systems with best performance and greater usage • It is possible to use the RDMA feature in enterprise environments for accelerating big data processing • Presented some initial designs and performance numbers • Many open research challenges remain to be solved so that middleware for enterprise environments can take advantage of – modern high-performance networks – multi-core technologies – emerging storage technologies HPC Advisory Council, Lugano Switzerland '12 49
  • 50. Designing Communication and I/O Libraries for Enterprise Systems: Solved a Few Initial Challenges Applications Datacenter Middleware (HDFS, HBase, MapReduce, Memcached) Programming Models (Socket) Communication and I/O Library Point-to-Point Threading Models and Communication Synchronization I/O and Filesys tems QoS Fault Tolerance Commodity Computing Sys tem Networking Technologies Architectures (single, dual, quad, ..) Storage Technologies (Infi niBand, 1/10/40 GiGE, (HDD or SSD) RNICs & Intelligent NICs) Multi/Many-c ore Architecture and Accelerators HPC Advisory Council, Lugano Switzerland '12 50
  • 51. Web Pointers http://www.cse.ohio-state.edu/~panda http://nowlab.cse.ohio-state.edu MVAPICH Web Page http://mvapich.cse.ohio-state.edu panda@cse.ohio-state.edu HPC Advisory Council, Lugano Switzerland '12 51