This document discusses the evolution of computing from being compute-centric to data-centric. It introduces concepts like the memory wall problem and concurrent average memory access time (C-AMAT) model which considers data access concurrency. It also discusses solutions for improving parallel I/O performance like prefetching, data layout optimization, and merging of HPC and big data frameworks. The challenges in data-intensive computing and how the author's lab can help with applications and methods are also summarized.
Greater Chicago Area - Independent Non-Profit Organization Management Professional
View clifford sugerman's professional profile on LinkedIn. LinkedIn is the world's largest business network, helping professionals like clifford sugerman discover.
Iterative computations are at the core of the vast majority of data-intensive scientific computations. Recent advancements in data intensive computational fields are fueling a dramatic growth in number as well as usage of such data intensive iterative computations. The utility computing model introduced by cloud computing combined with the rich set of cloud infrastructure services offers a very viable environment for the scientists to perform data intensive computations. However, clouds by nature offer unique reliability and sustained performance challenges to large scale distributed computations necessitating computation frameworks specifically tailored for cloud characteristics to harness the power of clouds easily and effectively. My research focuses on identifying and developing user-friendly distributed parallel computation frameworks to facilitate the optimized efficient execution of iterative as well as non-iterative data-intensive computations in cloud environments, alongside the evaluation of heterogeneous cloud resources offering GPGPU resources in addition to CPU resources, for data-intensive iterative computations.
What is high performance Computing, examples of OpenMP, MPI and Pthreads. How obtain HPC benefits using OpenSolaris.
This version was presented at OpenSolaris Day in Porto Alegre, Brazil, in April 16, 2008.
Greater Chicago Area - Independent Non-Profit Organization Management Professional
View clifford sugerman's professional profile on LinkedIn. LinkedIn is the world's largest business network, helping professionals like clifford sugerman discover.
Iterative computations are at the core of the vast majority of data-intensive scientific computations. Recent advancements in data intensive computational fields are fueling a dramatic growth in number as well as usage of such data intensive iterative computations. The utility computing model introduced by cloud computing combined with the rich set of cloud infrastructure services offers a very viable environment for the scientists to perform data intensive computations. However, clouds by nature offer unique reliability and sustained performance challenges to large scale distributed computations necessitating computation frameworks specifically tailored for cloud characteristics to harness the power of clouds easily and effectively. My research focuses on identifying and developing user-friendly distributed parallel computation frameworks to facilitate the optimized efficient execution of iterative as well as non-iterative data-intensive computations in cloud environments, alongside the evaluation of heterogeneous cloud resources offering GPGPU resources in addition to CPU resources, for data-intensive iterative computations.
What is high performance Computing, examples of OpenMP, MPI and Pthreads. How obtain HPC benefits using OpenSolaris.
This version was presented at OpenSolaris Day in Porto Alegre, Brazil, in April 16, 2008.
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYCMLconf
Graph Traversal at 30 billion edges per second with NVIDIA GPUs: I will discuss current research on the MapGraph platform. MapGraph is a new and disruptive technology for ultra-fast processing of large graphs on commodity many-core hardware. On a single GPU you can analyze the bitcoin transaction graph in .35 seconds. With MapGraph on 64 NVIDIA K20 GPUs, you can traverse a scale-free graph of 4.3 billion directed edges in .13 seconds for a throughput of 32 Billion Traversed Edges Per Second (32 GTEPS). I will explain why GPUs are an interesting option for data intensive applications, how we map graphs onto many-core processors, and what the future looks like for the MapGraph platform.
MapGraph provides a familiar vertex-centric abstraction, but its GPU acceleration is 100s of times faster than main memory CPU-only technologies and up to 100,000 times faster than graph technologies based on MapReduce or key-value stores such as HBase, Titan, and Accumulo. Learn more athttp://MapGraph.io.
In this deck, Peter Braam looks at how TensorFlow framework could be used to accelerate high performance computing.
"Google has developed TensorFlow, a truly complete platform for ML. The performance of the platform is amazing, and it begs the question if it will be useful for HPC in a similar manner that GPU’s heralded a revolution.
As described in his talk at the CHPC 2018 Conference in South Africa, TensorFlow contains many ingredients, for example:
* many domain specific libraries for machine learning
* the TensorFlow domain specific data-flow language
carefully organized input and output for data flow
* an optimizing runtime and compiler
* hardware implementations of TensorFlow operations in
* TensorFlow processing unit (TPU) chips
Learn more: https://wp.me/p3RLHQ-jMv
and
https://www.tensorflow.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Big Graph Analytics Systems (Sigmod16 Tutorial)Yuanyuan Tian
In recent years we have witnessed a surging interest in developing Big Graph processing systems. To date, tens of Big Graph systems have been proposed. This tutorial provides a timely and comprehensive review of existing Big Graph systems, and summarizes their pros and cons from various perspectives. We start from the existing vertex-centric systems, which which a programmer thinks intuitively like a vertex when developing parallel graph algorithms. We then introduce systems that adopt other computation paradigms and execution settings. The topics covered in this tutorial include programming models and algorithm design, computation models, communication mechanisms, out-of-core support, fault tolerance, dynamic graph support, and so on. We also highlight future research opportunities on Big Graph analytics.
NERSC is the production high-performance computing (HPC) center for the United States Department of Energy (DOE) Office of Science. The center supports over 6,000 users in 600 projects, using a variety of applications in materials science, chemistry, biology, astrophysics, high energy physics, climate science, fusion science, and more.
NERSC deployed the Cori system on over 9,000 Intel® Xeon Phi™ processors. This session describes the optimization strategy for porting codes that target traditional manycore architectures to the processors. We also discuss highlights and lessons learned from the optimization process on 20 applications associated with the NERSC Exascale Science Application Program (NESAP).
Large data with Scikit-learn - Boston Data Mining Meetup - Alex PerrierAlexis Perrier
A presentation of adaptive classification and regression algorithms available in scikit-learn with a Focus on Stochastic Gradient Descent and KNN. Performance examples on 2 Large datasets are presented for SGD, Multinomial Naive Bayes, Perceptron and Passive Aggressive Algorithms.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
In this deck, Torsten Hoefler from ETH Zurich presents: Data-Centric Parallel Programming.
"The ubiquity of accelerators in high-performance computing has driven programming complexity beyond the skill-set of the average domain scientist. To maintain performance portability in the future, it is imperative to decouple architecture-specific programming paradigms from the underlying scientific computations. We present the Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that enables separating code definition from its optimization. We show how to tune several applications in this model and IR. Furthermore, we show a global, datacentric view of a state-of-the-art quantum transport simulator to optimize its execution on supercomputers. The approach yields coarse and fine-grained data-movement characteristics, which are used for performance and communication modeling, communication avoidance, and data-layout transformations. The transformations are tuned for the Piz Daint and Summit supercomputers, where each platform requires different caching and fusion strategies to perform optimally. We show that SDFGs deliver competitive performance, allowing domain scientists to develop applications naturally and port them to approach peak hardware performance without modifying the original scientific code."
Watch the video: https://wp.mep3RLHQ-kup
Learn more: http://htor.inf.ethz.ch
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUScsandit
Graphics Processing Units (GPUs) have been emerged as powerful parallel compute platforms for various
application domains. A GPU consists of hundreds or even thousands processor cores and adopts Single
Instruction Multiple Threading (SIMT) architecture. Previously, we have proposed an approach that
optimizes the Tabu Search algorithm for solving the Permutation Flowshop Scheduling Problem (PFSP)
on a GPU by using a math function to generate all different permutations, avoiding the need of placing all
the permutations in the global memory. Based on the research result, this paper proposes another
approach that further improves the performance by avoiding duplicated computation among threads,
which is incurred when any two permutations have the same prefix. Experimental results show that the
GPU implementation of our proposed Tabu Search for PFSP runs up to 1.5 times faster than another GPU
implementation proposed by Czapinski and Barnes
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYCMLconf
Graph Traversal at 30 billion edges per second with NVIDIA GPUs: I will discuss current research on the MapGraph platform. MapGraph is a new and disruptive technology for ultra-fast processing of large graphs on commodity many-core hardware. On a single GPU you can analyze the bitcoin transaction graph in .35 seconds. With MapGraph on 64 NVIDIA K20 GPUs, you can traverse a scale-free graph of 4.3 billion directed edges in .13 seconds for a throughput of 32 Billion Traversed Edges Per Second (32 GTEPS). I will explain why GPUs are an interesting option for data intensive applications, how we map graphs onto many-core processors, and what the future looks like for the MapGraph platform.
MapGraph provides a familiar vertex-centric abstraction, but its GPU acceleration is 100s of times faster than main memory CPU-only technologies and up to 100,000 times faster than graph technologies based on MapReduce or key-value stores such as HBase, Titan, and Accumulo. Learn more athttp://MapGraph.io.
In this deck, Peter Braam looks at how TensorFlow framework could be used to accelerate high performance computing.
"Google has developed TensorFlow, a truly complete platform for ML. The performance of the platform is amazing, and it begs the question if it will be useful for HPC in a similar manner that GPU’s heralded a revolution.
As described in his talk at the CHPC 2018 Conference in South Africa, TensorFlow contains many ingredients, for example:
* many domain specific libraries for machine learning
* the TensorFlow domain specific data-flow language
carefully organized input and output for data flow
* an optimizing runtime and compiler
* hardware implementations of TensorFlow operations in
* TensorFlow processing unit (TPU) chips
Learn more: https://wp.me/p3RLHQ-jMv
and
https://www.tensorflow.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Big Graph Analytics Systems (Sigmod16 Tutorial)Yuanyuan Tian
In recent years we have witnessed a surging interest in developing Big Graph processing systems. To date, tens of Big Graph systems have been proposed. This tutorial provides a timely and comprehensive review of existing Big Graph systems, and summarizes their pros and cons from various perspectives. We start from the existing vertex-centric systems, which which a programmer thinks intuitively like a vertex when developing parallel graph algorithms. We then introduce systems that adopt other computation paradigms and execution settings. The topics covered in this tutorial include programming models and algorithm design, computation models, communication mechanisms, out-of-core support, fault tolerance, dynamic graph support, and so on. We also highlight future research opportunities on Big Graph analytics.
NERSC is the production high-performance computing (HPC) center for the United States Department of Energy (DOE) Office of Science. The center supports over 6,000 users in 600 projects, using a variety of applications in materials science, chemistry, biology, astrophysics, high energy physics, climate science, fusion science, and more.
NERSC deployed the Cori system on over 9,000 Intel® Xeon Phi™ processors. This session describes the optimization strategy for porting codes that target traditional manycore architectures to the processors. We also discuss highlights and lessons learned from the optimization process on 20 applications associated with the NERSC Exascale Science Application Program (NESAP).
Large data with Scikit-learn - Boston Data Mining Meetup - Alex PerrierAlexis Perrier
A presentation of adaptive classification and regression algorithms available in scikit-learn with a Focus on Stochastic Gradient Descent and KNN. Performance examples on 2 Large datasets are presented for SGD, Multinomial Naive Bayes, Perceptron and Passive Aggressive Algorithms.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
In this deck, Torsten Hoefler from ETH Zurich presents: Data-Centric Parallel Programming.
"The ubiquity of accelerators in high-performance computing has driven programming complexity beyond the skill-set of the average domain scientist. To maintain performance portability in the future, it is imperative to decouple architecture-specific programming paradigms from the underlying scientific computations. We present the Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that enables separating code definition from its optimization. We show how to tune several applications in this model and IR. Furthermore, we show a global, datacentric view of a state-of-the-art quantum transport simulator to optimize its execution on supercomputers. The approach yields coarse and fine-grained data-movement characteristics, which are used for performance and communication modeling, communication avoidance, and data-layout transformations. The transformations are tuned for the Piz Daint and Summit supercomputers, where each platform requires different caching and fusion strategies to perform optimally. We show that SDFGs deliver competitive performance, allowing domain scientists to develop applications naturally and port them to approach peak hardware performance without modifying the original scientific code."
Watch the video: https://wp.mep3RLHQ-kup
Learn more: http://htor.inf.ethz.ch
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUScsandit
Graphics Processing Units (GPUs) have been emerged as powerful parallel compute platforms for various
application domains. A GPU consists of hundreds or even thousands processor cores and adopts Single
Instruction Multiple Threading (SIMT) architecture. Previously, we have proposed an approach that
optimizes the Tabu Search algorithm for solving the Permutation Flowshop Scheduling Problem (PFSP)
on a GPU by using a math function to generate all different permutations, avoiding the need of placing all
the permutations in the global memory. Based on the research result, this paper proposes another
approach that further improves the performance by avoiding duplicated computation among threads,
which is incurred when any two permutations have the same prefix. Experimental results show that the
GPU implementation of our proposed Tabu Search for PFSP runs up to 1.5 times faster than another GPU
implementation proposed by Czapinski and Barnes
A Tale of Data Pattern Discovery in ParallelJenny Liu
In the era of IoTs and A.I., distributed and parallel computing is embracing big data driven and algorithm focused applications and services. With rapid progress and development on parallel frameworks, algorithms and accelerated computing capacities, it still remains challenging on deliver an efficient and scalable data analysis solution. This talk shares a research experience on data pattern discovery in domain applications. In particular, the research scrutinizes key factors in analysis workflow design and data parallelism improvement on cloud.
We leave in the era where the atomic building elements of silicon computers, e.g., transistors and wires, are no longer visible using traditional optical microscopes and their sizes are measured in just tens of Angstroms. In addition, power dissipation per unit volume is bounded by the laws of Physics that all resulted among others in stagnating processor clock frequencies. Adding more and more processor cores that perform simpler and simpler tasks in an attempt to efficiently fill the available on-chip area seems to be the current trend taken by the Industry.
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
BREEDING METHODS FOR DISEASE RESISTANCE.pptxRASHMI M G
Plant breeding for disease resistance is a strategy to reduce crop losses caused by disease. Plants have an innate immune system that allows them to recognize pathogens and provide resistance. However, breeding for long-lasting resistance often involves combining multiple resistance genes
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxRASHMI M G
Abnormal or anomalous secondary growth in plants. It defines secondary growth as an increase in plant girth due to vascular cambium or cork cambium. Anomalous secondary growth does not follow the normal pattern of a single vascular cambium producing xylem internally and phloem externally.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Xian He Sun Data-Centric Into
1. Scalable Computing Software Lab, Illinois Institute of Technology 1
Computing:
From Compute-Centric to Data-Centric
Xian-He Sun
Illinois Institute of Technology
sun@iit.edu
2. CS546 Lecture 8 Page 2
The Design of Parallel Algorithms
Unbalanced Workload
Communication Delay
Overhead Increases with the Ensemble Size
3. 10/3/2017
Maximum
System
256 racks
3.5 PF/s
512 TB
850 MHz
8 MB EDRAM
4 cores
1 chip, 20
DRAMs
13.6 GF/s
2.0 GB DDR
Supports 4-way SMP
32 Node Cards
1024 chips, 4096 procs
14 TF/s
2 TB
72 Racks
1 PF/s
144 TB
Cabled 8x8x16Rack
Petaflops
System
Compute Card
Chip
435 GF/s
64 GB
(32 chips 4x4x2)
32 compute, 0-2 IO cards
Node Board
Front End Node / Service Node
System p Servers
Linux SLES10
HPC SW:
Compilers
GPFS
ESSL
Loadleveler
Evolution of Computing:
The scalability
Source: ANL ALCF
IBM BG/P
4. Scalable Computing Software Lab, Illinois Institute of Technology 4
1
10
100
1,000
10,000
100,000
1980 1985 1990 1995 2000 2005 2010
Year
Performance
Memory
Uni-rocessor
Multi-core/many-core processor
The Memory-wall Problem
Processor performance
increases rapidly
Uni-processor: ~52% until
2004
Aggregate multi-core/many-
core processor performance
even higher since 2004
Memory: ~9% per year
Storage: ~6% per year
Processor-memory speed gap
keeps increasing
Source: Intel
Source: OCZ
25%
52%
20%
9%
60%
9%
Memory-bounded speedup (1990), Memory wall problem (1994)
5. CPU Registers
<8KB
<0.2~0.5 ns, 500~800 GB/s/core
Cache
<50MB
1-10 ns, 50~150GB/s/core
Main Memory
Giga Bytes
50ns-100ns 5~10GB/s/channel
Disk
Tera Bytes, 5 ms
100~300MB/s
Capacity
Access Time, Bandwidth
Tape
Peta Bytes or
infinite
sec-min
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
Staging
Xfer Unit
prog./compiler
1-8 bytes
cache cntl
32-128 bytes
OS
4K-4M bytes
user/operator
Mbytes
Upper Level
Lower Level
faster
Larger
Current Solution: Memory Hierarchy
11. Model of Data Access (Memory Hierarchy)
The conventional AMAT (Average Memory Access Time) Model
AMAT = HitCycle1 + MR1×AMP1
Where AMP1 = (HitCycle2 + MR2×AMP2)
AMAT = HitCycle + MR×(H2 + MR2× (H3 + MR3×AMP3 ))
The new C-AMAT (Concurrent-AMAT) model
Where
1
1
1 1 1 2- -
H
H
C AMAT MR C AMAT
C
2 2
2 2
2 2-
H M
H pAMP
C AMAT pMR
C C
1
1
1 1
1
1 1
m
M
CpMR pAMP
MR AMP C
X.-H. Sun and D. Wang, "Concurrent Average Memory Access Time", in IEEE Computers, vol.
47, no. 5, pp. 74-80,May 2014
Here is the overlapping contribution
12. Lecture Page 12
Data Access Time: AMAT
Average Memory Access Time (AMAT)
= Thit(L1) + Miss%(L1)* (Thit(L2) + Miss%(L2)* (Thit(L3) + Miss%(L3)*T(memory) ) )
Example: (Latency as shown above)
Miss rate: L1=10%, L2=5%, L3=1% (Be careful miss rate definition)
AMAT
= 2.115
L1
Cache L2
Cache L3
Cache Main
Memory
(DRAM)
1 clk
Hit Time
10 clks 20 clks 300 clks
On-die
13. Data Access Time: C-AMAT
L1
Cache L2
Cache L3
Cache Main
Memory
(DRAM)
1 clk
Hit Time
2
Hit Concurrency
10 clks 20 clks 300 clks
3 4 6
Concurrent Average Memory Access Time (C-AMAT)
=
H1
CH1
+ MR1 × κ1 ×
H2
CH2
+ MR2 × κ2 ×
H3
CH3
+ MR3 × κ3 ×
HMem
CHMem
Example
Miss Rate: L1=10%, L2=5%, L3=1% pMR, pAMP, AMP, CM, Cm: L1=7%, 10, 10, 5, 4
𝜅: 𝐿1=0.56, L2=0.6, L3=0.8 L2=3%, 60, 40, 9, 6
C-AMAT≈0.696 L3=0.8%, 400, 300, 16, 12
14. What Does C-AMAT Say?
C-AMAT is an extension of AMAT to consider concurrency
C-AMAT can be measured at each layer with APC
C-AMAT is data-centric thinking
Data access is as important as computing
High locality may hurt performance
The Pure Miss concept
Balance locality, concurrency, overlapping with C-AMAT
C-AMAT uniquely integrates the joint impact of locality,
concurrency, and overlapping for optimization (analysis and
measurement)
15. Off-chip
side
Pace Matching Data Access(搏动数据获取)
Processor
side
Layer 1 Layer 2 Layer 3 Layer 4
No delay data access (using C-AMAT as the gate to guide
and LPM as the global controller for in-situ optimization)
Case #1
Case #2
Case #3
Case #4
16. IO Optimization: from Prefetching to Merging
Application-aware optimization
Methods (LPM) and software
Model (C-AMAT) measurement
Prefetching SC’08 Parallel I/O Prefetching Using MPI File Caching and I/O Signatures
Data Layout
Data
Coordination
Data
Organization
HPDC11
A Cost-intelligent Application-specific Data layout Scheme for Parallel
File Systems
SC11 Server-Side I/O Coordination for Parallel File Systems
PDSW11 Pattern-aware File Reorganization in MPI-IO
Data
Replication
IPDPS13 Layout-Aware Replication Scheme for Parallel I/O Systems
Cache ICDCS14 S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems
Heterogeneity
Merging
IPDPS15
BigData15
HAS: Heterogeneity-Aware Selective Data Layout Scheme for Parallel
File Systems on Hybrid Servers
PortHadoop: Support Direct HPC Data Processing in Hadoop
• Website: www.cs.iit.edu/~scs/iosig/
Data Compression
17. Merging with Industrial Solution:
Distributed I/O Systems
Architecture of a typical HDFS system
Models show for many applications the best optimization is to use HDFS
18. Integrated data access system (IDAS)
18
Two camps
o MPI for HPC
o MapReduce for big data
Native data access
o PFS (POSIX I/O and
MPI-IO)
o Big Data (HDFS API)
Limited Support for
Data Access
o Use explicit serial copy
o Redundant data store in
HDFS
• Merging is more than access (maintain the merits)
Yang, N. Liu, B. Feng, X.-H. Sun and S. Zhou, "PortHadoop: Support Direct HPC Data Processing in Hadoop," in Proc.
of IEEE International Conference on Big Data (IEEE BigData 2015). Santa Clara, CA, USA, Oct. 2015.
19. Application -- NASA Earth Science Data Analysis
(Super-Cloud Project)
10/3/2017 19
Simulation is done on MPI environment and Data Analysis is done on Hadoop environment
20. 48 96 192 384
0
1000
2000
3000
4000
5000
6000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Number of Files
ElapsedTime(Seconds)
Serial Vis Serial Copy Parallel Vis Parallel Copy PortHadoop-R Vis Serial Animation
PortHadoop-R VS Conventional Methods: Performance
(Vis-only)
Serial Copy: one copy command issued on one Hadoop node
Parallel Copy: 4 copy commands issued on each Hadoop node (32 in total)
Serial Vis: read all data into one fat node, visualize
Parallel Vis: visualize all files on all Hadoop nodes
PortHadoop-R Vis: visualize all files into images Serial Animation: merge all images into one animation
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 𝑃𝑜𝑟𝑡𝐻𝑎𝑑𝑜𝑜𝑝 = 1.47
𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝑡𝑜𝑡𝑎𝑙 = 15
Good speedup: Total time is reduced by up to 15x
and 1.47x compared to serial and parallel solution.
20
21. What Are the Challenges in Computing?
When communication is the issue
Domain decomposition, pipelining
When locality (memory hierarchy) is the concern
Block-based computing
When memory bound is the constraint
Memory bound analysis
When data access concurrency is a determining factor
What is the new frame of algorithm design?
22. What We Can Help You?
If you have big data applications
If you have data intensive scientific applications
If you need both computing and data powers
If you are interested in data-centric algorithm designs
What You Can Help Us
Applications examples
Methods
Probability sampling, etc.
23. CS570: Advanced
Computer Architecture Lecture Page 23
An Unbalanced System
Source: Bob Colwell keynote ISCA’29 2002 http://systems.cs.colorado.edu/ISCA2002/Colwell-ISCA-
KEYNOTE-2002-final.ppt