SlideShare a Scribd company logo
Convey White Paper
H y b r i d - c o r e : T h e “ B i g D at a ”
Computing Architecture

N ove m b e r, 2 0 1 2 E d i t i o n
Convey White Paper

                                   Hybrid-core: The “Big Data”
                                   Computing Architecture
                                   High-performance computing (HPC) is no longer just numerically inten-
                                   sive, it’s now data-intensive—with more and different demands on HPC
                                   system architectures.

                                   Over the past few decades, the explosion of worldwide technology has lead to a
                                   corresponding explosion in the amount of data that is “in your face.” Cell phones,
                                   personal computers, tablets, audio and video recording devices, social media
                                   interaction—all of these add to the storm of bits swarming around us every day.
                                   An obvious question is “is there opportunity here?” It certainly seems like there should
                                   be. For example, if I’m an advertiser on Facebook, and someone has requested
                                   additional information from a placed advertisement, it makes sense to advertise similar
                                   products to all the friends, and friends of friends, of the information requester. Or, as a
                                   bioinformatician, you have 20 billion short-read sequences obtained from a DNA
                                   sequencing instrument, and need to assemble them into a single genome. If you’re
                                   successful, you can create the next drought-tolerant corn plant.
1	Introduction
                                   The point is, if we can capture, manage, analyze, and understand all of this data, there is
1	 Data-intensive problems
                                   an opportunity to increase research quality, design better products, and streamline
2	 Graph algorithms
                                   otherwise inefficient business processes. And it’s not just the volume of data that’s the
2	 Architectural requirements
                                   challenge—it’s analyzing, visualizing and understanding the relationships between
3	 The Graph 500 benchmark
                                   elements of data.
4	 Hybrid-core computing and the
   Graph 500 benchmark
                                   Data-intensive problems
                                   A new class of problems
5	 More Performance—Less Power
                                   While there is no precise definition of “data intensive,” these kinds of problems can
5	Conclusions
                                   certainly be identified when data is the primary challenge, whether it is the complexity,
                                   size, or rate of the data acquisition.1 Attributes generally include:
                                   •	Large data sets (multi-terabyte, petabyte, exabyte, zettabyte, etc.)
                                   •	High ratio of memory accesses to computing
                                   •	Extremely parallelizable read accesses/computing
                                   •	Highly dynamic data—may often be processed in realtime
                                   Obviously, this covers a lot of ground. For the most part, a specific algorithm type—
                                   graph algorithms—represent many of the general characteristics of data-intensive
                                   computing applications.

Graph algorithms
                                 To narrow down the scope of this paper (and provide some supporting quantitative
                                 analysis), we will examine one subset of data-intensive problems: Those that use some
                                 kind of graph data structures. Graphs provide a flexible mechanism for describing
                                 relationships between data elements. They are among the most ubiquitous models of
                                 both natural and human-made structures, and can be used to model many types of
                                 relations and process dynamics in physical, biological, and social systems.2
                                 Certainly graph problems aren’t new. Modeling data relationships using graphs in
                                 software and hardware is “reasonably” easy. Data structures can be constructed that
                                 contain information about specific nodes in the graph, connections and relationships
                                 to other nodes, and so on.
                                 However, using graph algorithms to model real-world relationships between data, and
                                 managing graphs that have billions of nodes and edges requires multiple terabytes of
                                 storage, is new (and “different”). Computer architectures that effectively execute
                                 these algorithms are also new; as the National Science Foundation states: “Data
                                 intensive computing demands a fundamentally different set of principles than
                                 mainstream computing.”3
                                 Let’s look at some of the architectural requirements for computing systems that can
                                 efficiently tackle problems with these attributes.

                                 Architectural requirements
                                 As preparation to developing an inventory of architectural “wants” to solve these kinds
Existing HPC architectures       of problems, let’s start with examining how today’s conventional commodity computer
are designed for simulation      systems4 handle (or don’t handle) graph algorithms.
and modeling—they’re great       Deficiencies of commodity systems
for floating-point operations,   In general, commodity systems have multiple difficulties in managing data processing
but not for graph traversal.     on such a scale:
The new HPC demands
different architectures.         Very little parallelism. The number of concurrent operations in a typical commodity
                                 server is limited to the number of CPU cores—on the order of eight to thirty-two. In
                                 order to get thousands or tens of thousands of concurrent operations, huge clusters of
                                 systems are needed. Limited parallelism means low CPU utilization and execution time
                                 becomes dominated by memory latency.
                                 Memory architecture poorly mapped to the type of memory accesses. The memory
                                 architecture built-into today’s commodity processors is cache-line oriented (64 bytes).
                                 Whenever a data structure is requested from memory (let’s say an 8-byte doubleword),
                                 all 64 bytes are transferred from memory and brought into cache. Essentially, only eight
                                 out of sixty-four bytes (12.5%) of the system’s memory bandwidth is usable. This hurts
                                 performance when data accesses exhibit poor locality (as is typical in graph problems).
                                 Lack of synchronization primitives. In a parallel programming environment, modification
                                 of data structures must be coordinated to eliminate contention. That coordination needs
                                 to have minimal impact on execution time—especially with high thread counts.
                                 Typically, software (e.g. the operating system) manages accesses to memory, and
                                 introduces significant execution time overhead.
                                 Desirable architectural features
                                 What type of architectural features are desirable in a computer system that executes
                                 graph algorithms? Following are the features that give the most performance for the
                                 least cost/space/power:
                                       he term “commodity computer system” refers to “pizza-box” or blade systems based on standard x86 processors
                                      and chipsets.

Balance between compute elements and memory subsystem performance. Most
                               data-intensive problems require minimal compute resources (especially in terms of
                               floating operations), and require more memory subsystem performance. Ideally, as in a
                               reconfigurable or hybrid-core computing system, the compute elements can be changed
                               on-the-fly to adapt to the application’s compute needs.
                               Memory subsystem designed for random access. A memory system that returns exactly
                               what was requested will maximize the available memory bandwidth.
                               Highly parallel memory subsystem. In addition to optimizing bandwidth, the memory
                               subsystem should support a large number (thousands) of concurrent operations. Such
                               parallelism mitigates the effects of memory latency by overlapping compute operations
                               with memory references.
                               Massive multi-threaded capability. A combination of compute and memory
                               requirements, the ability to support tens or hundreds of thousands of concurrent
                               execution threads is a requirement. More parallelism reduces time-to-answer, improves
                               hardware utilization, and increases efficiency.
                               Hardware-based synchronization primitives. With high degrees of parallelism comes the
                               challenge of synchronizing read/write access to memory locations. Data integrity
                               demands that a read-modify-write operation to a memory location is an indivisible
                               operation. When the synchronization mechanism is “further away” from the operation,
                               more time is spent waiting for the synchronization, with a corresponding reduction in
                               efficiency of parallelization. Ideally, synchronization is implemented in hardware in the
                               memory subsystem.

                               The Graph 500 benchmark
“Current benchmarks and        As described previously, today’s commodity servers, as well as systems designed
performance metrics do not     specifically for numerically intensive algorithms (“supercomputers”), are ill suited for
provide useful information     most applications in the data-intensive sciences. Similarly, attempts to characterize
on the suitability of          system performance using popular benchmarks for these kinds of systems (e.g. the
supercomputing systems for     Top500) will not yield representative results.
data intensive applications.   Recognizing the need for a benchmark suite that will more accurately measure
A new set of benchmarks        performance on graph-type problems, a steering committee of HPC experts from
is needed in order to guide    academia, industry, and national laboratories created the Graph 500 benchmark.
the design of hardware         Currently the benchmark constructs an undirected graph and measures the performance
architectures and software     of a kernel that performs a breadth-first search of graph.5, 6
systems intended to support
such applications and to       As shown in Figure 1, the benchmark has multiple data set sizes (“Scale”).7
help procurements.”
                                                                                                 Edge    Approx. storage
                                                 Problem class                          Scale
                                                                                                factor        size in TB
                                                 Toy (level 10)                          26      16               0.0172
                                                 Mini (level 11)                         29      16               0.1374
                                                 Small (level 12)                        32      16               1.0995
                                                 Medium (level 13)                       36      16             17.5922
                                                 Large (level 14)                        39      16            140.7375
                                                 Huge (level 15)                         42      16           1125.8999
                               Figure 1. Data set sizes (“Scale”) of the Graph 500 benchmark.


In addition, incremental scale sizes are allowed by appending a “+” or “++,” e.g.
“Mini++” is scale 31. The performance metric of the benchmark is GTEPS (Billions of
Traversed Edges Per Second).

The kernel of the breadth-first search portion of the benchmark (Figure 2) contains
multiple constructs that are common to many graph-type algorithms—specifically a
high degree of parallelism, and indirect (or “vector of indices”) memory references.

Figure 2. The kernel of the breadth-first search algorithm.

Hybrid-core computing and the Graph 500 benchmark
As we have reviewed, solving graph problems takes a different approach to computing.
One such approach is the Convey hybrid-core family of computer systems.8 The Convey
systems offer a balanced architecture: reconfigurable (via Field Programmable Gate
Arrays—FPGAs) compute elements, and a supercomputing-inspired memory subsystem
(Figure 3).

                                                              Instruction Scalar
                                                                Decode Engine
                   Standard                                   Application
                                                                                                         AE0                                 AE1                                    AE2                                AE3
                    PCI I/O                          HCMI       Engine                                                                                                                                                                Application
       nn r
    rco te

                                                              Hub (AEH)
                                                                                                                                                                                                                                      Engines (AEs)
  te us

                                                                  Virtual Memory
In Cl

                  PCIe                                                Interface


                                                                            MC0                 MC1                 MC2                MC3                 MC4                 MC5                 MC6                 MC7

                                                                           TLB                 TLB                 TLB                 TLB                 TLB                 TLB                 TLB                 TLB            Memory
                                Host Memory
                                                                       aops aops
                                                                       sched sched
                                                                                          aops aops
                                                                                          sched sched
                                                                                                              aops aops
                                                                                                              sched sched
                                                                                                                                  aops aops
                                                                                                                                  sched sched
                                                                                                                                                      aops aops
                                                                                                                                                      sched sched
                                                                                                                                                                          aops aops
                                                                                                                                                                          sched sched
                                                                                                                                                                                              aops aops
                                                                                                                                                                                              sched sched
                                                                                                                                                                                                                  aops aops
                                                                                                                                                                                                                  sched sched         Controllers (MCs)

                                    Hybrid-core Globally Shared                                                                                                                                                                       Memory















                                         Memory (HCGSM)                                                                                                                                                                               DDR3 SG-DIMMs

               Standard Intel® x86-64 Server                                              Convey coprocessor
               • x86-64 Linux (RH 6.x)                                                    • Massively Multithreaded Architecture
               • Choice of processors, form factor,                                       • Highly Parallel, High-bandwidth Memory
                  I/O Chassis, memory size                                                • Hybrid-Core Globally Shared Memory (HCGSM)

Figure 3. Overview of the Convey hybrid-core computing architecture.
The benefit of hybrid-core computing is that the compute-intensive kernel of the Graph
500 breadth-first search is implemented in hardware on the FPGAs in the coprocessor.
The FPGA implementation allows much more parallelism than a commodity system (the
current Graph 500 implementation employs over 32,000 threads of execution). The
massive parallelism combined with the hardware implementation of the logic portions of
the algorithm allow for increased overall performance with much less hardware.
In addition to increased parallelism, the memory subsystem of the Convey systems is
specifically designed to provide high bandwidth for references that exhibit poor locality
(e.g. offers high performance for random accesses). Thus, the vector of indices portion

of the code is highly accelerated over architectures that are not well-suited for random

                               More Performance—Less Power
                               The hybrid-core architecture of the Convey hybrid-core family lends itself exceedingly
                               well to the Graph 500 benchmark (Figure 4). While the problem size is considered
                               “small” (which is understandable, given that the benchmark is run on a single node
                               system), the performance-per-watt and performance-per-dollar are well beyond any
                               other system on the list.
                               Figure 4 is a partial list of the performance results for the current Graph 500
                               benchmark.9 Approximating power requirements allows for an arbitrary metric
                               illustrating power efficiency (GTEPS/kW).

                                   Rank                        Machine                             Installation               Nodes    Cores Scale GTEPS GTEPS/kW
                                ★    48      #thunder4                               #Convey#Computer#Corporation                    1     12   27    11.61  16.59
                                ★    49      #Celero                                 #Argonne#National#Laboratory                    1      8   27    11.45  16.35
                                ★    50      #Convey#HCB2ex                          #UC#Riverside                                   1      8   27    11.45  16.35
                                ★    45      #fox6                                   #Convey#Computer#Corporation                    1     16   29    14.56  13.87
                                ★    62      #Vortex                                 #Convey#Computer#Corporation                    1      4   27     6.64   9.48
“Pound-for-pound, watt-              22
                                                                                                                                                34 382.00
                                                                                                                                                14     0.03
for-watt, the Convey hybrid-          2      #DOE/SC/Argonne#National#Laboratory#Mira#Argonne#National#Laboratory               32768 524288    39 10461.00   5.22
                                     30      #BlueGene/Q                             #Brookhaven#National#Laboratory             1024   16384   34 294.29     4.70
core systems provide more             1      #DOE/NNSA/Lawrence#Livermore#National#Laboratory#Sequoia
                                                                                     #Lawrence#Livermore#National#Laboratory65536 1048576       40 15363.00   2.91
graph processing power               77
                                             #Intel(R)#Xeon(R)#CPU#E7B4870#@#2.40GHz #Chuo#University
                                             #JUQUEEN                                #Forschungszentrum#Juelich#(FZJ)
                                                                                                                                     1     40
                                                                                                                                16384 262144
                                                                                                                                                30     2.16
                                                                                                                                                38 5848.00
than any other system on        *    52      #Grace                                  #Mayo#Clinic,#Rochester                        64     64   31    10.94   0.50
                                     21      #TSUBAME#2.0                            #GSIC#Center,#Tokyo#Institute#of#Technology
                                                                                                                              ##1,366  16,392   35 462.25     0.48
the Graph 500 list—hands-           111      #ultraviolet                            #Sandia#National#Laboratories                         32   29     0.42   0.42
down.”                               46
                                     93      #PowerEdge#R815#Opteron#6174            #STE#Lab,#Nagoya#University                     4    192   22     1.16   0.17
                                 *  101      #XMT2                                   #CSCS                                          64          31     0.86   0.04
                                **   56      #Lonestar                               #TACC                                       1024   12288   35     9.23   0.03
                                     32      #DOE/SC#Hopper                          #NERSC/Lawrence#Berkeley#National#Laboratory 115600
                                                                                                                                 4817           35 254.07     0.03
                                     90      #XMT                                    #Sandia#National#Laboratories                128           29     1.26   0.02
                                *    91      #cougarxmt                              #Pacific#Northwest#National#Laboratory       128           29     1.18   0.02
                                *    94      #graphstorm                             #Sandia#National#Laboratories                128           29     1.07   0.02
                                    108      #Hyperion#+#FusionIO                    #Lawrence#Livermore#National#Laboratory 64           512   36     0.60   0.01
                                    124      #Gordon                                 #San#Diego#Supercomputing#Center                7     84   29     0.02   0.00

                               Figure 4. Performance and power on the Graph 500 benchmark.

                               The massive explosion of data available for analysis and understanding is creating a
                               “whole new HPC,” with demands on existing HPC architectures that cannot be fulfilled
                               by current commodity systems. Future generations of HPC systems will be required to
                               acknowledge some of the architectural requirements of data-intensive algorithms. For
                               example, memory subsystems will need to increase effective bandwidth, more
                               parallelism will be needed, and synchronization primitives will need to be “closer” to the
                               memory subsystem.
                               By implementing a balanced, hybrid approach, Convey’s hybrid-core family of systems
                               are able to execute problems in the data-intensive sciences much more effectively. The
                               hybrid-core architecture is poised for exascale levels of computing in the data-intensive
                               sciences because it offers reconfigurable compute elements balanced with a
                               supercomputer-inspired memory subsystem.

                                    With a few exceptions for comparison, results of scale less than 27 are omitted.

Convey Computer Corporation
1302 E. Collins Boulevard
Richardson, Texas 75081
Phone: 214.666.6024 Fax: 214.576.9848
Toll Free: 866.338.1768                 CONV-11-023.DEC2012 © 2011-2012 Convey Computer Corporation. Convey Computer, the Convey logo, HC-2 and Convey HC-2ex are trademarks of Convey                  Computer Corporation in the U.S. and other countries. Printed in the U.S.A.

More Related Content

What's hot

2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
Cloak-Reduce Load Balancing Strategy for Mapreduce
Cloak-Reduce Load Balancing Strategy for MapreduceCloak-Reduce Load Balancing Strategy for Mapreduce
Cloak-Reduce Load Balancing Strategy for Mapreduce
AIRCC Publishing Corporation
IOSR Journals
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
IOSR Journals
Bt9002 grid computing 1
Bt9002 grid computing 1Bt9002 grid computing 1
Bt9002 grid computing 1
Survey on cloud backup services of personal storage
Survey on cloud backup services of personal storageSurvey on cloud backup services of personal storage
Survey on cloud backup services of personal storage
eSAT Journals
Wp greenplum
Wp greenplumWp greenplum
Wp greenplumAccenture
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
MIT College Of Engineering,Pune
Efficient Image Compression Technique using Clustering and Random Permutation
Efficient Image Compression Technique using Clustering and Random PermutationEfficient Image Compression Technique using Clustering and Random Permutation
Efficient Image Compression Technique using Clustering and Random Permutation
IJERA Editor
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...
Cisco and Greenplum  Partner to Deliver  High-Performance  Hadoop Reference  ...Cisco and Greenplum  Partner to Deliver  High-Performance  Hadoop Reference  ...
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...
Building a data warehouse of call data records
Building a data warehouse of call data recordsBuilding a data warehouse of call data records
Building a data warehouse of call data recordsDavid Walker

What's hot (15)

2013 International Conference on Knowledge, Innovation and Enterprise Presen...
2013  International Conference on Knowledge, Innovation and Enterprise Presen...2013  International Conference on Knowledge, Innovation and Enterprise Presen...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
Cloak-Reduce Load Balancing Strategy for Mapreduce
Cloak-Reduce Load Balancing Strategy for MapreduceCloak-Reduce Load Balancing Strategy for Mapreduce
Cloak-Reduce Load Balancing Strategy for Mapreduce
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
Comparative Analysis, Security Aspects & Optimization of Workload in Gfs Base...
335 340
335 340335 340
335 340
Bt9002 grid computing 1
Bt9002 grid computing 1Bt9002 grid computing 1
Bt9002 grid computing 1
Survey on cloud backup services of personal storage
Survey on cloud backup services of personal storageSurvey on cloud backup services of personal storage
Survey on cloud backup services of personal storage
Wp greenplum
Wp greenplumWp greenplum
Wp greenplum
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Efficient Image Compression Technique using Clustering and Random Permutation
Efficient Image Compression Technique using Clustering and Random PermutationEfficient Image Compression Technique using Clustering and Random Permutation
Efficient Image Compression Technique using Clustering and Random Permutation
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...
Cisco and Greenplum  Partner to Deliver  High-Performance  Hadoop Reference  ...Cisco and Greenplum  Partner to Deliver  High-Performance  Hadoop Reference  ...
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...
Building a data warehouse of call data records
Building a data warehouse of call data recordsBuilding a data warehouse of call data records
Building a data warehouse of call data records

Viewers also liked

Cscc pg2 c_cv1_2
Cscc pg2 c_cv1_2Cscc pg2 c_cv1_2
Cscc pg2 c_cv1_2Accenture
06072012 the it_services_site_ibm__server_virtualization_and_beyond_webinar_f...
06072012 the it_services_site_ibm__server_virtualization_and_beyond_webinar_f...06072012 the it_services_site_ibm__server_virtualization_and_beyond_webinar_f...
06072012 the it_services_site_ibm__server_virtualization_and_beyond_webinar_f...Accenture
Infochimps report 451 research impact report
Infochimps report   451 research impact reportInfochimps report   451 research impact report
Infochimps report 451 research impact reportAccenture
Digital universe 2020
Digital universe 2020Digital universe 2020
Digital universe 2020Accenture
Cloud & big d.云计算:大数据时代的系统工程试读 it168文库
Cloud & big d.云计算:大数据时代的系统工程试读 it168文库Cloud & big d.云计算:大数据时代的系统工程试读 it168文库
Cloud & big d.云计算:大数据时代的系统工程试读 it168文库Accenture
Analytics big data ibm
Analytics big data ibmAnalytics big data ibm
Analytics big data ibmAccenture
Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库Accenture
Cloud & big d.云计算:大数据时代的系统工程试读 it168文库
Cloud & big d.云计算:大数据时代的系统工程试读 it168文库Cloud & big d.云计算:大数据时代的系统工程试读 it168文库
Cloud & big d.云计算:大数据时代的系统工程试读 it168文库Accenture
Big data analytics
Big data analyticsBig data analytics
Big data analyticsAccenture
Certify 2014trends-report
Certify 2014trends-reportCertify 2014trends-report
Certify 2014trends-report
Calabrio analyze
Calabrio analyzeCalabrio analyze
Calabrio analyze
NetApp system installation workbook Spokane
NetApp system installation workbook SpokaneNetApp system installation workbook Spokane
NetApp system installation workbook SpokaneAccenture
Atelier t14 interface usager
Atelier t14 interface usagerAtelier t14 interface usager
Atelier t14 interface usagerguest2331fd
Near Field Communication & Android
Near Field Communication & AndroidNear Field Communication & Android
Near Field Communication & Android

Viewers also liked (16)

Cscc pg2 c_cv1_2
Cscc pg2 c_cv1_2Cscc pg2 c_cv1_2
Cscc pg2 c_cv1_2
06072012 the it_services_site_ibm__server_virtualization_and_beyond_webinar_f...
06072012 the it_services_site_ibm__server_virtualization_and_beyond_webinar_f...06072012 the it_services_site_ibm__server_virtualization_and_beyond_webinar_f...
06072012 the it_services_site_ibm__server_virtualization_and_beyond_webinar_f...
Infochimps report 451 research impact report
Infochimps report   451 research impact reportInfochimps report   451 research impact report
Infochimps report 451 research impact report
Digital universe 2020
Digital universe 2020Digital universe 2020
Digital universe 2020
Cloud & big d.云计算:大数据时代的系统工程试读 it168文库
Cloud & big d.云计算:大数据时代的系统工程试读 it168文库Cloud & big d.云计算:大数据时代的系统工程试读 it168文库
Cloud & big d.云计算:大数据时代的系统工程试读 it168文库
Analytics big data ibm
Analytics big data ibmAnalytics big data ibm
Analytics big data ibm
Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库Acceleration for big data, hadoop and memcached it168文库
Acceleration for big data, hadoop and memcached it168文库
Cloud & big d.云计算:大数据时代的系统工程试读 it168文库
Cloud & big d.云计算:大数据时代的系统工程试读 it168文库Cloud & big d.云计算:大数据时代的系统工程试读 it168文库
Cloud & big d.云计算:大数据时代的系统工程试读 it168文库
Cloud ready
Cloud readyCloud ready
Cloud ready
Big data analytics
Big data analyticsBig data analytics
Big data analytics
Certify 2014trends-report
Certify 2014trends-reportCertify 2014trends-report
Certify 2014trends-report
Calabrio analyze
Calabrio analyzeCalabrio analyze
Calabrio analyze
NetApp system installation workbook Spokane
NetApp system installation workbook SpokaneNetApp system installation workbook Spokane
NetApp system installation workbook Spokane
Atelier t14 interface usager
Atelier t14 interface usagerAtelier t14 interface usager
Atelier t14 interface usager
Near Field Communication & Android
Near Field Communication & AndroidNear Field Communication & Android
Near Field Communication & Android

Similar to The big data_computing_architecture-graph500

The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudKhazret Sapenov
Cloud computing
Cloud computingCloud computing
Cloud computing
Govardhan Gottigalla
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
How do data analysts work with big data and distributed computing frameworks.pdf
How do data analysts work with big data and distributed computing frameworks.pdfHow do data analysts work with big data and distributed computing frameworks.pdf
How do data analysts work with big data and distributed computing frameworks.pdf
Soumodeep Nanee Kundu
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
Stavros Kontopoulos
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
White Paper: Advanced Cyber Analytics with Greenplum Database
White Paper: Advanced Cyber Analytics with Greenplum DatabaseWhite Paper: Advanced Cyber Analytics with Greenplum Database
White Paper: Advanced Cyber Analytics with Greenplum Database
Big data management
Big data managementBig data management
Big data management
zeba khanam
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore
Geoffrey Fox
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
Robert Grossman
Inroduction to grid computing by gargi shankar verma
Inroduction to grid computing by gargi shankar vermaInroduction to grid computing by gargi shankar verma
Inroduction to grid computing by gargi shankar verma
Re-Engineering Databases using Meta-Programming Technology
Re-Engineering Databases using Meta-Programming TechnologyRe-Engineering Databases using Meta-Programming Technology
Re-Engineering Databases using Meta-Programming Technology
Gihan Wikramanayake
Basic data warehousing
Basic data warehousingBasic data warehousing
Basic data warehousingganikota
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
Debajani Mohanty

Similar to The big data_computing_architecture-graph500 (20)

The elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloudThe elephantintheroom bigdataanalyticsinthecloud
The elephantintheroom bigdataanalyticsinthecloud
Cloud computing
Cloud computingCloud computing
Cloud computing
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
How do data analysts work with big data and distributed computing frameworks.pdf
How do data analysts work with big data and distributed computing frameworks.pdfHow do data analysts work with big data and distributed computing frameworks.pdf
How do data analysts work with big data and distributed computing frameworks.pdf
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
White Paper: Advanced Cyber Analytics with Greenplum Database
White Paper: Advanced Cyber Analytics with Greenplum DatabaseWhite Paper: Advanced Cyber Analytics with Greenplum Database
White Paper: Advanced Cyber Analytics with Greenplum Database
Big data management
Big data managementBig data management
Big data management
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore Linking Programming models between Grids, Web 2.0 and Multicore
Linking Programming models between Grids, Web 2.0 and Multicore
Big data business case
Big data   business caseBig data   business case
Big data business case
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
Inroduction to grid computing by gargi shankar verma
Inroduction to grid computing by gargi shankar vermaInroduction to grid computing by gargi shankar verma
Inroduction to grid computing by gargi shankar verma
Re-Engineering Databases using Meta-Programming Technology
Re-Engineering Databases using Meta-Programming TechnologyRe-Engineering Databases using Meta-Programming Technology
Re-Engineering Databases using Meta-Programming Technology
Basic data warehousing
Basic data warehousingBasic data warehousing
Basic data warehousing
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data

More from Accenture

Tier 2 net app baseline design standard revised nov 2011
Tier 2 net app baseline design standard   revised nov 2011Tier 2 net app baseline design standard   revised nov 2011
Tier 2 net app baseline design standard revised nov 2011Accenture
Perf stat windows
Perf stat windowsPerf stat windows
Perf stat windowsAccenture
Performance problems on ethernet networks when the e0m management interface i...
Performance problems on ethernet networks when the e0m management interface i...Performance problems on ethernet networks when the e0m management interface i...
Performance problems on ethernet networks when the e0m management interface i...Accenture
Migrate volume in akfiler7
Migrate volume in akfiler7Migrate volume in akfiler7
Migrate volume in akfiler7Accenture
Migrate vol in akfiler7
Migrate vol in akfiler7Migrate vol in akfiler7
Migrate vol in akfiler7Accenture
Data storage requirements AK
Data storage requirements AKData storage requirements AK
Data storage requirements AKAccenture
C mode class
C mode classC mode class
C mode classAccenture
Akfiler upgrades providence july 2012
Akfiler upgrades providence july 2012Akfiler upgrades providence july 2012
Akfiler upgrades providence july 2012Accenture
Reporting demo
Reporting demoReporting demo
Reporting demoAccenture
Net app virtualization preso
Net app virtualization presoNet app virtualization preso
Net app virtualization presoAccenture
Providence net app upgrade plan PPMC
Providence net app upgrade plan PPMCProvidence net app upgrade plan PPMC
Providence net app upgrade plan PPMCAccenture
WSC Net App storage for windows challenges and solutions
WSC Net App storage for windows challenges and solutionsWSC Net App storage for windows challenges and solutions
WSC Net App storage for windows challenges and solutionsAccenture
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...Accenture
Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11
Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11
Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11Accenture
Snap mirror source to tape to destination scenario
Snap mirror source to tape to destination scenarioSnap mirror source to tape to destination scenario
Snap mirror source to tape to destination scenarioAccenture
Ref arch for ve sg248155
Ref arch for ve sg248155Ref arch for ve sg248155
Ref arch for ve sg248155Accenture
PAM 3832
PAM 3832PAM 3832
PAM 3832Accenture
NetApp VM .tr-3749
NetApp VM .tr-3749NetApp VM .tr-3749
NetApp VM .tr-3749Accenture

More from Accenture (20)

Tier 2 net app baseline design standard revised nov 2011
Tier 2 net app baseline design standard   revised nov 2011Tier 2 net app baseline design standard   revised nov 2011
Tier 2 net app baseline design standard revised nov 2011
Perf stat windows
Perf stat windowsPerf stat windows
Perf stat windows
Performance problems on ethernet networks when the e0m management interface i...
Performance problems on ethernet networks when the e0m management interface i...Performance problems on ethernet networks when the e0m management interface i...
Performance problems on ethernet networks when the e0m management interface i...
Migrate volume in akfiler7
Migrate volume in akfiler7Migrate volume in akfiler7
Migrate volume in akfiler7
Migrate vol in akfiler7
Migrate vol in akfiler7Migrate vol in akfiler7
Migrate vol in akfiler7
Data storage requirements AK
Data storage requirements AKData storage requirements AK
Data storage requirements AK
C mode class
C mode classC mode class
C mode class
Akfiler upgrades providence july 2012
Akfiler upgrades providence july 2012Akfiler upgrades providence july 2012
Akfiler upgrades providence july 2012
NA notes
NA notesNA notes
NA notes
Reporting demo
Reporting demoReporting demo
Reporting demo
Net app virtualization preso
Net app virtualization presoNet app virtualization preso
Net app virtualization preso
Providence net app upgrade plan PPMC
Providence net app upgrade plan PPMCProvidence net app upgrade plan PPMC
Providence net app upgrade plan PPMC
WSC Net App storage for windows challenges and solutions
WSC Net App storage for windows challenges and solutionsWSC Net App storage for windows challenges and solutions
WSC Net App storage for windows challenges and solutions
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
Tr 3998 -deployment_guide_for_hosted_shared_desktops_and_on-demand_applicatio...
Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11
Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11
Tr 3749 -net_app_storage_best_practices_for_v_mware_vsphere,_dec_11
Snap mirror source to tape to destination scenario
Snap mirror source to tape to destination scenarioSnap mirror source to tape to destination scenario
Snap mirror source to tape to destination scenario
Ref arch for ve sg248155
Ref arch for ve sg248155Ref arch for ve sg248155
Ref arch for ve sg248155
PAM 3832
PAM 3832PAM 3832
PAM 3832
NetApp VM .tr-3749
NetApp VM .tr-3749NetApp VM .tr-3749
NetApp VM .tr-3749

The big data_computing_architecture-graph500

  • 1. Convey White Paper H y b r i d - c o r e : T h e “ B i g D at a ” Computing Architecture N ove m b e r, 2 0 1 2 E d i t i o n
  • 2. Convey White Paper Hybrid-core: The “Big Data” Computing Architecture High-performance computing (HPC) is no longer just numerically inten- sive, it’s now data-intensive—with more and different demands on HPC system architectures. Introduction Over the past few decades, the explosion of worldwide technology has lead to a corresponding explosion in the amount of data that is “in your face.” Cell phones, personal computers, tablets, audio and video recording devices, social media interaction—all of these add to the storm of bits swarming around us every day. An obvious question is “is there opportunity here?” It certainly seems like there should be. For example, if I’m an advertiser on Facebook, and someone has requested additional information from a placed advertisement, it makes sense to advertise similar products to all the friends, and friends of friends, of the information requester. Or, as a bioinformatician, you have 20 billion short-read sequences obtained from a DNA sequencing instrument, and need to assemble them into a single genome. If you’re successful, you can create the next drought-tolerant corn plant. Contents 1 Introduction The point is, if we can capture, manage, analyze, and understand all of this data, there is 1 Data-intensive problems an opportunity to increase research quality, design better products, and streamline 2 Graph algorithms otherwise inefficient business processes. And it’s not just the volume of data that’s the 2 Architectural requirements challenge—it’s analyzing, visualizing and understanding the relationships between 3 The Graph 500 benchmark elements of data. 4 Hybrid-core computing and the Graph 500 benchmark Data-intensive problems A new class of problems 5 More Performance—Less Power While there is no precise definition of “data intensive,” these kinds of problems can 5 Conclusions certainly be identified when data is the primary challenge, whether it is the complexity, size, or rate of the data acquisition.1 Attributes generally include: • Large data sets (multi-terabyte, petabyte, exabyte, zettabyte, etc.) • High ratio of memory accesses to computing • Extremely parallelizable read accesses/computing • Highly dynamic data—may often be processed in realtime Obviously, this covers a lot of ground. For the most part, a specific algorithm type— graph algorithms—represent many of the general characteristics of data-intensive computing applications. 1
  • 3. Graph algorithms To narrow down the scope of this paper (and provide some supporting quantitative analysis), we will examine one subset of data-intensive problems: Those that use some kind of graph data structures. Graphs provide a flexible mechanism for describing relationships between data elements. They are among the most ubiquitous models of both natural and human-made structures, and can be used to model many types of relations and process dynamics in physical, biological, and social systems.2 Certainly graph problems aren’t new. Modeling data relationships using graphs in software and hardware is “reasonably” easy. Data structures can be constructed that contain information about specific nodes in the graph, connections and relationships to other nodes, and so on. However, using graph algorithms to model real-world relationships between data, and managing graphs that have billions of nodes and edges requires multiple terabytes of storage, is new (and “different”). Computer architectures that effectively execute these algorithms are also new; as the National Science Foundation states: “Data intensive computing demands a fundamentally different set of principles than mainstream computing.”3 Let’s look at some of the architectural requirements for computing systems that can efficiently tackle problems with these attributes. Architectural requirements As preparation to developing an inventory of architectural “wants” to solve these kinds Existing HPC architectures of problems, let’s start with examining how today’s conventional commodity computer are designed for simulation systems4 handle (or don’t handle) graph algorithms. and modeling—they’re great Deficiencies of commodity systems for floating-point operations, In general, commodity systems have multiple difficulties in managing data processing but not for graph traversal. on such a scale: The new HPC demands different architectures. Very little parallelism. The number of concurrent operations in a typical commodity server is limited to the number of CPU cores—on the order of eight to thirty-two. In order to get thousands or tens of thousands of concurrent operations, huge clusters of systems are needed. Limited parallelism means low CPU utilization and execution time becomes dominated by memory latency. Memory architecture poorly mapped to the type of memory accesses. The memory architecture built-into today’s commodity processors is cache-line oriented (64 bytes). Whenever a data structure is requested from memory (let’s say an 8-byte doubleword), all 64 bytes are transferred from memory and brought into cache. Essentially, only eight out of sixty-four bytes (12.5%) of the system’s memory bandwidth is usable. This hurts performance when data accesses exhibit poor locality (as is typical in graph problems). Lack of synchronization primitives. In a parallel programming environment, modification of data structures must be coordinated to eliminate contention. That coordination needs to have minimal impact on execution time—especially with high thread counts. Typically, software (e.g. the operating system) manages accesses to memory, and introduces significant execution time overhead. Desirable architectural features What type of architectural features are desirable in a computer system that executes graph algorithms? Following are the features that give the most performance for the least cost/space/power: 2 3 4 T he term “commodity computer system” refers to “pizza-box” or blade systems based on standard x86 processors and chipsets. 2
  • 4. Balance between compute elements and memory subsystem performance. Most data-intensive problems require minimal compute resources (especially in terms of floating operations), and require more memory subsystem performance. Ideally, as in a reconfigurable or hybrid-core computing system, the compute elements can be changed on-the-fly to adapt to the application’s compute needs. Memory subsystem designed for random access. A memory system that returns exactly what was requested will maximize the available memory bandwidth. Highly parallel memory subsystem. In addition to optimizing bandwidth, the memory subsystem should support a large number (thousands) of concurrent operations. Such parallelism mitigates the effects of memory latency by overlapping compute operations with memory references. Massive multi-threaded capability. A combination of compute and memory requirements, the ability to support tens or hundreds of thousands of concurrent execution threads is a requirement. More parallelism reduces time-to-answer, improves hardware utilization, and increases efficiency. Hardware-based synchronization primitives. With high degrees of parallelism comes the challenge of synchronizing read/write access to memory locations. Data integrity demands that a read-modify-write operation to a memory location is an indivisible operation. When the synchronization mechanism is “further away” from the operation, more time is spent waiting for the synchronization, with a corresponding reduction in efficiency of parallelization. Ideally, synchronization is implemented in hardware in the memory subsystem. The Graph 500 benchmark “Current benchmarks and As described previously, today’s commodity servers, as well as systems designed performance metrics do not specifically for numerically intensive algorithms (“supercomputers”), are ill suited for provide useful information most applications in the data-intensive sciences. Similarly, attempts to characterize on the suitability of system performance using popular benchmarks for these kinds of systems (e.g. the supercomputing systems for Top500) will not yield representative results. data intensive applications. Recognizing the need for a benchmark suite that will more accurately measure A new set of benchmarks performance on graph-type problems, a steering committee of HPC experts from is needed in order to guide academia, industry, and national laboratories created the Graph 500 benchmark. the design of hardware Currently the benchmark constructs an undirected graph and measures the performance architectures and software of a kernel that performs a breadth-first search of graph.5, 6 systems intended to support such applications and to As shown in Figure 1, the benchmark has multiple data set sizes (“Scale”).7 help procurements.” —www.Graph Edge Approx. storage Problem class Scale factor size in TB Toy (level 10) 26 16 0.0172 Mini (level 11) 29 16 0.1374 Small (level 12) 32 16 1.0995 Medium (level 13) 36 16 17.5922 Large (level 14) 39 16 140.7375 Huge (level 15) 42 16 1125.8999 Figure 1. Data set sizes (“Scale”) of the Graph 500 benchmark. 5 http://www.Graph 6 7 http://www.Graph 3
  • 5. In addition, incremental scale sizes are allowed by appending a “+” or “++,” e.g. “Mini++” is scale 31. The performance metric of the benchmark is GTEPS (Billions of Traversed Edges Per Second). The kernel of the breadth-first search portion of the benchmark (Figure 2) contains multiple constructs that are common to many graph-type algorithms—specifically a high degree of parallelism, and indirect (or “vector of indices”) memory references. Figure 2. The kernel of the breadth-first search algorithm. Hybrid-core computing and the Graph 500 benchmark As we have reviewed, solving graph problems takes a different approach to computing. One such approach is the Convey hybrid-core family of computer systems.8 The Convey systems offer a balanced architecture: reconfigurable (via Field Programmable Gate Arrays—FPGAs) compute elements, and a supercomputing-inspired memory subsystem (Figure 3). Instruction Scalar Decode Engine Standard Application AE0 AE1 AE2 AE3 PCI I/O HCMI Engine Application t nn r ec rco te Hub (AEH) Engines (AEs) te us Virtual Memory In Cl PCIe Interface Memory Interconnect MC0 MC1 MC2 MC3 MC4 MC5 MC6 MC7 TLB TLB TLB TLB TLB TLB TLB TLB Memory Host Memory aops aops sched sched aops aops sched sched aops aops sched sched aops aops sched sched aops aops sched sched aops aops sched sched aops aops sched sched aops aops sched sched Controllers (MCs) Hybrid-core Globally Shared Memory SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM SG-DIMM Memory (HCGSM) DDR3 SG-DIMMs Standard Intel® x86-64 Server Convey coprocessor • x86-64 Linux (RH 6.x) • Massively Multithreaded Architecture • Choice of processors, form factor, • Highly Parallel, High-bandwidth Memory I/O Chassis, memory size • Hybrid-Core Globally Shared Memory (HCGSM) Figure 3. Overview of the Convey hybrid-core computing architecture. The benefit of hybrid-core computing is that the compute-intensive kernel of the Graph 500 breadth-first search is implemented in hardware on the FPGAs in the coprocessor. The FPGA implementation allows much more parallelism than a commodity system (the current Graph 500 implementation employs over 32,000 threads of execution). The massive parallelism combined with the hardware implementation of the logic portions of the algorithm allow for increased overall performance with much less hardware. In addition to increased parallelism, the memory subsystem of the Convey systems is specifically designed to provide high bandwidth for references that exhibit poor locality (e.g. offers high performance for random accesses). Thus, the vector of indices portion 8 4
  • 6. of the code is highly accelerated over architectures that are not well-suited for random accesses. More Performance—Less Power The hybrid-core architecture of the Convey hybrid-core family lends itself exceedingly well to the Graph 500 benchmark (Figure 4). While the problem size is considered “small” (which is understandable, given that the benchmark is run on a single node system), the performance-per-watt and performance-per-dollar are well beyond any other system on the list. Figure 4 is a partial list of the performance results for the current Graph 500 benchmark.9 Approximating power requirements allows for an arbitrary metric illustrating power efficiency (GTEPS/kW). Rank Machine Installation Nodes Cores Scale GTEPS GTEPS/kW ★ 48 #thunder4 #Convey#Computer#Corporation 1 12 27 11.61 16.59 ★ 49 #Celero #Argonne#National#Laboratory 1 8 27 11.45 16.35 ★ 50 #Convey#HCB2ex #UC#Riverside 1 8 27 11.45 16.35 ★ 45 #fox6 #Convey#Computer#Corporation 1 16 29 14.56 13.87 ★ 62 #Vortex #Convey#Computer#Corporation 1 4 27 6.64 9.48 “Pound-for-pound, watt- 22 123 #Vesta #Scott#Beamer's#iPad #DOE/SC/Argonne#National#Laboratory #UC#Berkeley 1024 1 16384 2 34 382.00 14 0.03 6.10 6.08 for-watt, the Convey hybrid- 2 #DOE/SC/Argonne#National#Laboratory#Mira#Argonne#National#Laboratory 32768 524288 39 10461.00 5.22 30 #BlueGene/Q #Brookhaven#National#Laboratory 1024 16384 34 294.29 4.70 core systems provide more 1 #DOE/NNSA/Lawrence#Livermore#National#Laboratory#Sequoia #Lawrence#Livermore#National#Laboratory65536 1048576 40 15363.00 2.91 graph processing power 77 3 #Intel(R)#Xeon(R)#CPU#E7B4870#@#2.40GHz #Chuo#University #JUQUEEN #Forschungszentrum#Juelich#(FZJ) 1 40 16384 262144 30 2.16 38 5848.00 2.21 1.95 than any other system on * 52 #Grace #Mayo#Clinic,#Rochester 64 64 31 10.94 0.50 21 #TSUBAME#2.0 #GSIC#Center,#Tokyo#Institute#of#Technology ##1,366 16,392 35 462.25 0.48 the Graph 500 list—hands- 111 #ultraviolet #Sandia#National#Laboratories 32 29 0.42 0.42 down.” 46 72 #Altix#ICE#8400EX #GPUBbased#cluster #SGI #Seoul#National#University,#Korea 256 8 1024 192 31 26 13.96 4.06 0.36 0.25 93 #PowerEdge#R815#Opteron#6174 #STE#Lab,#Nagoya#University 4 192 22 1.16 0.17 * 101 #XMT2 #CSCS 64 31 0.86 0.04 ** 56 #Lonestar #TACC 1024 12288 35 9.23 0.03 32 #DOE/SC#Hopper #NERSC/Lawrence#Berkeley#National#Laboratory 115600 4817 35 254.07 0.03 90 #XMT #Sandia#National#Laboratories 128 29 1.26 0.02 * 91 #cougarxmt #Pacific#Northwest#National#Laboratory 128 29 1.18 0.02 * 94 #graphstorm #Sandia#National#Laboratories 128 29 1.07 0.02 108 #Hyperion#+#FusionIO #Lawrence#Livermore#National#Laboratory 64 512 36 0.60 0.01 124 #Gordon #San#Diego#Supercomputing#Center 7 84 29 0.02 0.00 Figure 4. Performance and power on the Graph 500 benchmark. Conclusions The massive explosion of data available for analysis and understanding is creating a “whole new HPC,” with demands on existing HPC architectures that cannot be fulfilled by current commodity systems. Future generations of HPC systems will be required to acknowledge some of the architectural requirements of data-intensive algorithms. For example, memory subsystems will need to increase effective bandwidth, more parallelism will be needed, and synchronization primitives will need to be “closer” to the memory subsystem. By implementing a balanced, hybrid approach, Convey’s hybrid-core family of systems are able to execute problems in the data-intensive sciences much more effectively. The hybrid-core architecture is poised for exascale levels of computing in the data-intensive sciences because it offers reconfigurable compute elements balanced with a supercomputer-inspired memory subsystem. 9 With a few exceptions for comparison, results of scale less than 27 are omitted. 5
  • 7. Convey Computer Corporation 1302 E. Collins Boulevard Richardson, Texas 75081 Phone: 214.666.6024 Fax: 214.576.9848 Toll Free: 866.338.1768 CONV-11-023.DEC2012 © 2011-2012 Convey Computer Corporation. Convey Computer, the Convey logo, HC-2 and Convey HC-2ex are trademarks of Convey Computer Corporation in the U.S. and other countries. Printed in the U.S.A.