Fast and Scalable NUMA-based Thread Parallel Breadth-first SearchYuichiro Yasui
The 2015 International Conference on High Performance Computing & Simulation (HPCS2015)
Session 9A: July 22, 14:45 − 16:00
July 20 – 24, 2015, Amsterdam, the Netherlands
Fast and Scalable NUMA-based Thread Parallel Breadth-first SearchYuichiro Yasui
The 2015 International Conference on High Performance Computing & Simulation (HPCS2015)
Session 9A: July 22, 14:45 − 16:00
July 20 – 24, 2015, Amsterdam, the Netherlands
By Tobias Grosser, Scalable Parallel Computing Laboratory
The COSMO climate and weather model delivers daily forecasts for Switzerland and many other nations. As a traditional HPC application it was developed with SIMD-CPUs in mind and large manual efforts were required to enable the 2016 move to GPU acceleration. As today's high-performance computer systems increasingly rely on accelerators to reach peak performance and manual translation to accelerators is both costly and difficult to maintain, we propose a fully automatic accelerator compiler for the automatic translation of scientific Fortran codes to CUDA GPU accelerated systems. Several challenges had to be overcome to make this reality: 1) improved scalability, 2) automatic data placement using unified memory, 3) loop rescheduling to expose coarse-grained parallelism, 4) inter-procedural loop optimization, and 5) plenty of performance tuning. Our evaluation shows that end-to-end automatic accelerator compilation is possible for non-trivial portions of the COSMO climate model, despite the lack of complete static information. Non-trivial loop optimizations previously implemented manually are performed fully automatically and memory management happens fully transparently using unified memory. Our preliminary results show notable performance improvements over sequential CPU code (40s to 8s reduction in execution time) and we are currently working on closing the remaining gap to hand-tuned GPU code. This talk is a status update on our most recent efforts and also intended to gather feedback on future research plans towards automatically mapping COSMO to FPGAs.
Tobias Grosser Bio
Tobias Grosser is a senior researcher in the Scalable Parallel Computing Laboratory (SPCL) of Torsten Hoefler at the Computer Science Department of ETH Zürich. Supported by a Google PhD Fellowship he received his doctoral degree from Universite Pierre et Marie Curie under the supervision of Albert Cohen. Tobias' research is taking place at the border of low-level compilers and high-level program transformations with the goal of enabling complex - but highly-beneficial - program transformations in a production compiler environment. He develops with the Polly loop optimizer a loop transformation framework which today is a community project supported throught the Polly Labs research laboratory. Tobias also developed advanced tiling schemes for the efficient execution of iterated stencils. Today Tobias leads the heterogeneous compute efforts in the Swiss University funded ComPASC project and is about to start a three year NSF Ambizione project on advancing automatic compilation and heterogenization techniques at ETH Zurich.
Email
bgerofi@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Cycle’s topological optimizations and the iterative decoding problem on gener...Usatyuk Vasiliy
We consider several problem related to graph model related to error-correcting codes. From base problem of cycle broken, trapping set elliminating and bypass to fundamental problem of graph model. Thanks to the hard work of Michail Chertkov, Michail Stepanov and Andrea Montanari which inspirit me...
Slides presented at Applied Mathematics Day, Steklov Mathematical Institute of the Russian Academy of Sciences September 22, 2017 http://www.mathnet.ru/conf1249
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
Jack Dongarra from the University of Tennessee presented these slides at Ken Kennedy Institute of Information Technology on Feb 13, 2014.
Listen to the podcast review of this talk: http://insidehpc.com/2014/02/13/week-hpc-jack-dongarra-talks-algorithms-exascale/
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.Robo India
This article is about learning Global Positioning system.
In order to understand GPS, we need to communication protocol of GPS. GPS communicates in NMEA messages.
This document describes NMEA messages and algorithm to extract data.
We welcome all of your queries and views. We are found at-
website- http://roboindia.com
mail-info@roboindia.com
Area-Delay Efficient Binary Adders in QCAIJERA Editor
In this paper, a novel quantum-dot cellular automata (QCA) adder design is presented that decrease the number
of QCA cells compared to previously method designs. The proposed one-bit QCA adder is based on a new
algorithm that requires only three majority gates and two inverters for the QCA addition. A novel 128-bit adder
designed in QCA was implemented. It achieved speed performances higher than all the existing. QCA adders,
with an area requirement comparable with the low RCA and CFA established. The novel adder operates in the
RCA functional, but it could propagate a carry signal through a number of cascaded MGs significantly lower
than conventional RCA adders. In adding together, because of the adopted basic logic and layout strategy, the
number of clock cycles required for completing the explanation was limited. As transistors reduce in size more
and more of them can be accommodated in a single die, thus increasing chip computational capabilities.
However, transistors cannot find much smaller than their current size. The quantum-dot cellular automata
approach represents one of the possible solutions in overcome this physical limit, even though the design of
logic modules in QCA is not forever straightforward.
In this paper we propose Regularised Cross-Modal Hashing
(RCMH) a new cross-modal hashing model that projects
annotation and visual feature descriptors into a common
Hamming space. RCMH optimises the hashcode similarity
of related data-points in the annotation modality using an
iterative three-step hashing algorithm: in the first step each
training image is assigned a K-bit hashcode based on hyperplanes learnt at the previous iteration; in the second step the binary bits are smoothed by a formulation of graph regularisation so that similar data-points have similar bits; in the third step a set of binary classifiers are trained to predict the regularised bits with maximum margin. Visual descriptors are projected into the annotation Hamming space by a set of binary classifiers learnt using the bits of the corresponding annotations as labels. RCMH is shown to consistently improve retrieval effectiveness over state-of-the-art baselines.
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ AdderIJERA Editor
In this paper, we propse 16-bit sparse tree RSFQ adder (Rapid single flux quantam), kogge-stone adder, carry lookahead adder. In general N-bit adders like Ripple carry adder s(slow adders compare to other adders), and carry lookahead adders(area consuming adders) are used in earlier days. But now the most of industries are using parallel prefix adders because of their advantages compare to kogge-stone adder, carry lookahead adder, Our prefix sparse tree adders are faster and area efficient. Parallel prefix adder is a technique for increasing the speed in DSP processor while performing addition. We simulate and synthesis different types of 16-bit sparse tree RSFQ adders using Xilinx ISE10.1i tool, By using these synthesis results, We noted the performance parameters like number of LUT’s and delay. We compare these three adders interms of LUT’s represents area) and delay values.
Development of Routing for Car Navigation SystemsAtsushi Koike
Car navigation systems are devices that show us routes to our destination. Finding good routes is a key feature in the systems. I will explain the development of routing for the systems.
On Extending MapReduce - Survey and ExperimentsYu Liu
It talks a survey and my experiments on extending MapReduce programming model. A BSP-based MapReduce interface was implemented and evaluated, which shows dramatically improvement on performance.
final Year Projects, Final Year Projects in Chennai, Software Projects, Embedded Projects, Microcontrollers Projects, DSP Projects, VLSI Projects, Matlab Projects, Java Projects, .NET Projects, IEEE Projects, IEEE 2009 Projects, IEEE 2009 Projects, Software, IEEE 2009 Projects, Embedded, Software IEEE 2009 Projects, Embedded IEEE 2009 Projects, Final Year Project Titles, Final Year Project Reports, Final Year Project Review, Robotics Projects, Mechanical Projects, Electrical Projects, Power Electronics Projects, Power System Projects, Model Projects, Java Projects, J2EE Projects, Engineering Projects, Student Projects, Engineering College Projects, MCA Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, Wireless Networks Projects, Network Security Projects, Networking Projects, final year projects, ieee projects, student projects, college projects, ieee projects in chennai, java projects, software ieee projects, embedded ieee projects, "ieee2009projects", "final year projects", "ieee projects", "Engineering Projects", "Final Year Projects in Chennai", "Final year Projects at Chennai", Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, Final Year Java Projects, Final Year ASP.NET Projects, Final Year VB.NET Projects, Final Year C# Projects, Final Year Visual C++ Projects, Final Year Matlab Projects, Final Year NS2 Projects, Final Year C Projects, Final Year Microcontroller Projects, Final Year ATMEL Projects, Final Year PIC Projects, Final Year ARM Projects, Final Year DSP Projects, Final Year VLSI Projects, Final Year FPGA Projects, Final Year CPLD Projects, Final Year Power Electronics Projects, Final Year Electrical Projects, Final Year Robotics Projects, Final Year Solor Projects, Final Year MEMS Projects, Final Year J2EE Projects, Final Year J2ME Projects, Final Year AJAX Projects, Final Year Structs Projects, Final Year EJB Projects, Final Year Real Time Projects, Final Year Live Projects, Final Year Student Projects, Final Year Engineering Projects, Final Year MCA Projects, Final Year MBA Projects, Final Year College Projects, Final Year BE Projects, Final Year BTech Projects, Final Year ME Projects, Final Year MTech Projects, Final Year M.Sc Projects, IEEE Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, IEEE 2009 Java Projects, IEEE 2009 ASP.NET Projects, IEEE 2009 VB.NET Projects, IEEE 2009 C# Projects, IEEE 2009 Visual C++ Projects, IEEE 2009 Matlab Projects, IEEE 2009 NS2 Projects, IEEE 2009 C Projects, IEEE 2009 Microcontroller Projects, IEEE 2009 ATMEL Projects, IEEE 2009 PIC Projects, IEEE 2009 ARM Projects, IEEE 2009 DSP Projects, IEEE 2009 VLSI Projects, IEEE 2009 FPGA Projects, IEEE 2009 CPLD Projects, IEEE 2009 Power Electronics Projects, IEEE 2009 Electrical Projects, IEEE 2009 Robotics Projects, IEEE 2009 Solor Projects, IEEE 2009 MEMS Projects, IEEE 2009 J2EE P
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
By Tobias Grosser, Scalable Parallel Computing Laboratory
The COSMO climate and weather model delivers daily forecasts for Switzerland and many other nations. As a traditional HPC application it was developed with SIMD-CPUs in mind and large manual efforts were required to enable the 2016 move to GPU acceleration. As today's high-performance computer systems increasingly rely on accelerators to reach peak performance and manual translation to accelerators is both costly and difficult to maintain, we propose a fully automatic accelerator compiler for the automatic translation of scientific Fortran codes to CUDA GPU accelerated systems. Several challenges had to be overcome to make this reality: 1) improved scalability, 2) automatic data placement using unified memory, 3) loop rescheduling to expose coarse-grained parallelism, 4) inter-procedural loop optimization, and 5) plenty of performance tuning. Our evaluation shows that end-to-end automatic accelerator compilation is possible for non-trivial portions of the COSMO climate model, despite the lack of complete static information. Non-trivial loop optimizations previously implemented manually are performed fully automatically and memory management happens fully transparently using unified memory. Our preliminary results show notable performance improvements over sequential CPU code (40s to 8s reduction in execution time) and we are currently working on closing the remaining gap to hand-tuned GPU code. This talk is a status update on our most recent efforts and also intended to gather feedback on future research plans towards automatically mapping COSMO to FPGAs.
Tobias Grosser Bio
Tobias Grosser is a senior researcher in the Scalable Parallel Computing Laboratory (SPCL) of Torsten Hoefler at the Computer Science Department of ETH Zürich. Supported by a Google PhD Fellowship he received his doctoral degree from Universite Pierre et Marie Curie under the supervision of Albert Cohen. Tobias' research is taking place at the border of low-level compilers and high-level program transformations with the goal of enabling complex - but highly-beneficial - program transformations in a production compiler environment. He develops with the Polly loop optimizer a loop transformation framework which today is a community project supported throught the Polly Labs research laboratory. Tobias also developed advanced tiling schemes for the efficient execution of iterated stencils. Today Tobias leads the heterogeneous compute efforts in the Swiss University funded ComPASC project and is about to start a three year NSF Ambizione project on advancing automatic compilation and heterogenization techniques at ETH Zurich.
Email
bgerofi@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Cycle’s topological optimizations and the iterative decoding problem on gener...Usatyuk Vasiliy
We consider several problem related to graph model related to error-correcting codes. From base problem of cycle broken, trapping set elliminating and bypass to fundamental problem of graph model. Thanks to the hard work of Michail Chertkov, Michail Stepanov and Andrea Montanari which inspirit me...
Slides presented at Applied Mathematics Day, Steklov Mathematical Institute of the Russian Academy of Sciences September 22, 2017 http://www.mathnet.ru/conf1249
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
Jack Dongarra from the University of Tennessee presented these slides at Ken Kennedy Institute of Information Technology on Feb 13, 2014.
Listen to the podcast review of this talk: http://insidehpc.com/2014/02/13/week-hpc-jack-dongarra-talks-algorithms-exascale/
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.Robo India
This article is about learning Global Positioning system.
In order to understand GPS, we need to communication protocol of GPS. GPS communicates in NMEA messages.
This document describes NMEA messages and algorithm to extract data.
We welcome all of your queries and views. We are found at-
website- http://roboindia.com
mail-info@roboindia.com
Area-Delay Efficient Binary Adders in QCAIJERA Editor
In this paper, a novel quantum-dot cellular automata (QCA) adder design is presented that decrease the number
of QCA cells compared to previously method designs. The proposed one-bit QCA adder is based on a new
algorithm that requires only three majority gates and two inverters for the QCA addition. A novel 128-bit adder
designed in QCA was implemented. It achieved speed performances higher than all the existing. QCA adders,
with an area requirement comparable with the low RCA and CFA established. The novel adder operates in the
RCA functional, but it could propagate a carry signal through a number of cascaded MGs significantly lower
than conventional RCA adders. In adding together, because of the adopted basic logic and layout strategy, the
number of clock cycles required for completing the explanation was limited. As transistors reduce in size more
and more of them can be accommodated in a single die, thus increasing chip computational capabilities.
However, transistors cannot find much smaller than their current size. The quantum-dot cellular automata
approach represents one of the possible solutions in overcome this physical limit, even though the design of
logic modules in QCA is not forever straightforward.
In this paper we propose Regularised Cross-Modal Hashing
(RCMH) a new cross-modal hashing model that projects
annotation and visual feature descriptors into a common
Hamming space. RCMH optimises the hashcode similarity
of related data-points in the annotation modality using an
iterative three-step hashing algorithm: in the first step each
training image is assigned a K-bit hashcode based on hyperplanes learnt at the previous iteration; in the second step the binary bits are smoothed by a formulation of graph regularisation so that similar data-points have similar bits; in the third step a set of binary classifiers are trained to predict the regularised bits with maximum margin. Visual descriptors are projected into the annotation Hamming space by a set of binary classifiers learnt using the bits of the corresponding annotations as labels. RCMH is shown to consistently improve retrieval effectiveness over state-of-the-art baselines.
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ AdderIJERA Editor
In this paper, we propse 16-bit sparse tree RSFQ adder (Rapid single flux quantam), kogge-stone adder, carry lookahead adder. In general N-bit adders like Ripple carry adder s(slow adders compare to other adders), and carry lookahead adders(area consuming adders) are used in earlier days. But now the most of industries are using parallel prefix adders because of their advantages compare to kogge-stone adder, carry lookahead adder, Our prefix sparse tree adders are faster and area efficient. Parallel prefix adder is a technique for increasing the speed in DSP processor while performing addition. We simulate and synthesis different types of 16-bit sparse tree RSFQ adders using Xilinx ISE10.1i tool, By using these synthesis results, We noted the performance parameters like number of LUT’s and delay. We compare these three adders interms of LUT’s represents area) and delay values.
Development of Routing for Car Navigation SystemsAtsushi Koike
Car navigation systems are devices that show us routes to our destination. Finding good routes is a key feature in the systems. I will explain the development of routing for the systems.
On Extending MapReduce - Survey and ExperimentsYu Liu
It talks a survey and my experiments on extending MapReduce programming model. A BSP-based MapReduce interface was implemented and evaluated, which shows dramatically improvement on performance.
final Year Projects, Final Year Projects in Chennai, Software Projects, Embedded Projects, Microcontrollers Projects, DSP Projects, VLSI Projects, Matlab Projects, Java Projects, .NET Projects, IEEE Projects, IEEE 2009 Projects, IEEE 2009 Projects, Software, IEEE 2009 Projects, Embedded, Software IEEE 2009 Projects, Embedded IEEE 2009 Projects, Final Year Project Titles, Final Year Project Reports, Final Year Project Review, Robotics Projects, Mechanical Projects, Electrical Projects, Power Electronics Projects, Power System Projects, Model Projects, Java Projects, J2EE Projects, Engineering Projects, Student Projects, Engineering College Projects, MCA Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, Wireless Networks Projects, Network Security Projects, Networking Projects, final year projects, ieee projects, student projects, college projects, ieee projects in chennai, java projects, software ieee projects, embedded ieee projects, "ieee2009projects", "final year projects", "ieee projects", "Engineering Projects", "Final Year Projects in Chennai", "Final year Projects at Chennai", Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, Final Year Java Projects, Final Year ASP.NET Projects, Final Year VB.NET Projects, Final Year C# Projects, Final Year Visual C++ Projects, Final Year Matlab Projects, Final Year NS2 Projects, Final Year C Projects, Final Year Microcontroller Projects, Final Year ATMEL Projects, Final Year PIC Projects, Final Year ARM Projects, Final Year DSP Projects, Final Year VLSI Projects, Final Year FPGA Projects, Final Year CPLD Projects, Final Year Power Electronics Projects, Final Year Electrical Projects, Final Year Robotics Projects, Final Year Solor Projects, Final Year MEMS Projects, Final Year J2EE Projects, Final Year J2ME Projects, Final Year AJAX Projects, Final Year Structs Projects, Final Year EJB Projects, Final Year Real Time Projects, Final Year Live Projects, Final Year Student Projects, Final Year Engineering Projects, Final Year MCA Projects, Final Year MBA Projects, Final Year College Projects, Final Year BE Projects, Final Year BTech Projects, Final Year ME Projects, Final Year MTech Projects, Final Year M.Sc Projects, IEEE Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, IEEE 2009 Java Projects, IEEE 2009 ASP.NET Projects, IEEE 2009 VB.NET Projects, IEEE 2009 C# Projects, IEEE 2009 Visual C++ Projects, IEEE 2009 Matlab Projects, IEEE 2009 NS2 Projects, IEEE 2009 C Projects, IEEE 2009 Microcontroller Projects, IEEE 2009 ATMEL Projects, IEEE 2009 PIC Projects, IEEE 2009 ARM Projects, IEEE 2009 DSP Projects, IEEE 2009 VLSI Projects, IEEE 2009 FPGA Projects, IEEE 2009 CPLD Projects, IEEE 2009 Power Electronics Projects, IEEE 2009 Electrical Projects, IEEE 2009 Robotics Projects, IEEE 2009 Solor Projects, IEEE 2009 MEMS Projects, IEEE 2009 J2EE P
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
Various processor architectures are described in this presentation. It could be useful for people working for h/w selection and processor identification.
uCluster (micro-Cluster) is a toy computer cluster composed of 3 Raspberry Pi boards, 2 NVIDIA Jetson Nano boards and 1 NVIDIA Jetson TX2 board.
The presentation shows how to build the uCluster and focuses on few interesting technologies for further consideration when building a cluster at any scale.
The project is for educational purposes and tinkering with various technologies.
The OpenCSD library for decoding CoreSight traces has reached the point where it is ready to be integrated into applications. This session will present an overview of the state of the library, its interfaces and explore and demonstrate a sample integration with perf.
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
In this deck from the Hot Chips conference, Chris Nicol from Wave Computing presents: A Dataflow Processing Chip for Training Deep Neural Networks.
Watch the video: https://wp.me/p3RLHQ-k6W
Learn more: https://wavecomp.ai/
and
http://www.hotchips.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)byteLAKE
byteLAKE's presentation from the PPAM 2019 conference.
Abstract:
The goal of this work is to adapt 4 CFD kernels to the Xilinx ALVEO U250 FPGA, including first-order step of the non-linear iterative upwind advection MPDATA schemes (non-oscillatory forward in time), the divergence part of the matrix-free linear operator formulation in the iterative Krylov scheme, tridiagonal Thomas algorithm for vertical matrix inversion inside preconditioner for the iterative solver, and computation of the psuedovelocity for the second pass of upwind algorithm in MPDATA. All the kernels use 3-dimensional compute domain consisted from 7 to 11 arrays. Since all kernels belong to the group of memory bound algorithms, our main challenge is to provide the highest utilization of global memory bandwidth. Our adaptation allows us to reduce the execution time upto 4x.
Find out more at: www.byteLAKE.com/en/CFD
Foot note:
This is the presentation about the non-AI version of byteLAKE's CFD kernels, highly optimized for Alveo FPGA. Based on this research project and many others in the CFD space, we decided to shift the course of the CFD Suite product development and leverage AI to accelerate computations and enable new possibilities. Instead of adapting CFD solvers to accelerators, we use AI and work on a cross-platform solution. More on the latest: www.byteLAKE.com/en/CFDSuite.
-
Update for 2020: byteLAKE is currently developing CFD Suite as AI for CFD Suite, a collection of AI/ Artificial Intelligence Models to accelerate and enable new features for CFD simulations. It is a cross-platform solution (not only for FPGAs). More: www.byteLAKE.com/en/CFDSuite.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph500 Benchmarks on SGI UV 2000
1. NUMA-aware thread-parallel breadth-first
search for Graph500 and Green
Graph500 Benchmarks on SGI UV 2000
Yuichiro Yasui & Katsuki Fujisawa
Kyushu University
ISM High Performance Computing Conference
11:00 − 11:50, Oct 9-10, 2015
2. Outline
• Introduction
• NUMA-aware threading
– NUMA architecture and NUMA based system
– Our library “ULIBC” for NUMA aware threading
• Efficient BFS algorithm for Graph500 benchmark
– NUMA-based Distributed Graph Representation … [BD13]
– Efficient algorithm considering the vertex degree [ISC14, HPCS15]
• Conclusion
3. Graph processing for Large scale networks
• Large-scale graphs in various fields
– US Road network: 58 million edges
– Twitter follow-ship: 1.47 billion edges
– Neuronal network: 100 trillion edges
89 billion vertices & 100 trillion edges
Neuronal network @ Human Brain Project
Cyber-security
Twitter
US road network
24 million vertices & 58 million edges 15 billion log entries / day
Social network
• Fast and scalable graph processing by using HPC
large
61.6 million vertices
& 1.47 billion edges
4. • Transportation
• Social network
• Cyber-security
• Bioinformatics
Graph analysis and important kernel BFS
• Graph analysis to understanding for relationships on real-networks
graph
processing
Understanding
Application fields
- SCALE
- edgefactor
- SCALE
- edgefac
- BFS Tim
- Traverse
- TEPS
Input parameters ResuGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
Relationships
- SCALE
- edgefactor
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
graph
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
results
Step1
Step2
Step3
constructing
・Breadth-first search ・Single-source shortest path
・Maximum flow ・Maximal independent set
・Centrality metrics ・Clustering ・Graph Mining
5. • One of most important and fundamental algorithm to traverse graph structures
• Many algorithms and applications based on BFS (Max. flow and Centrality)
• Linear time algorithm, required many widely memory accesses w/o reuse
Breadth-first search (BFS)
Source
BFS Lv. 3
Source Lv. 2
Lv. 1
Outputs
• Predecessor tree
• Distance
Inputs
• Graph
• Source vertex
• Transportation
• Social network
• Cyber-security
• Bioinformatics
Graph analysis and important kernel BFS
• Graph analysis to understanding for relationships on real-networks
graph
processing
Understanding
Application fields
- SCALE
- edgefactor
- SCALE
- edgefac
- BFS Tim
- Traverse
- TEPS
Input parameters ResuGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
Relationships
- SCALE
- edgefactor
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
graph
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
results
Step1
Step2
Step3
constructing
・Breadth-first search ・Single-source shortest path
・Maximum flow ・Maximal independent set
・Centrality metrics ・Clustering ・Graph Mining
BFS Tree
6. Highway
Bridge
Betweenness centrality (BC)
CB(v) =
s v t∈V
σst(v)
σst
σst : number of shortest (s, t)-paths
σst(v) : number of shortest (s, t)-paths passing through vertex
CB(v) =
s v t∈V
σst(v)
σst
σst : number of shortest (s, t)-paths
σst(v) : number of shortest (s, t)-paths passing through vertex v
: # of (s, t)-shortest paths
: # of (s, t)-shortest paths
passing throw v
Osaka road network
13,076 vertices and 40,528 edges
• BC requires #vertices-times BFS,
because BFS obtains one-to-all shortest paths
• Computes an importance for each vertices
and edges utilizing all-to-all shortest-paths
(breadth-first search) w/o vertex coordinates
Importance
low high
Osaka station
Our software “NETAL” can solves BC for
Osaka road network within one second
Y. Yasui, K. Fujisawa, K. Goto, N. Kamiyama, and M. Takamatsu:
NETAL: High-performance Implementation of Network Analysis Library
Considering Computer Memory Hierarchy, JORSJ, Vol. 54-4, 2011.
7. Single-node NUMA system
• Single-node or Multi-node system?
• Uniform memory access, or not?
Single-node (Single-OS) Multi-node
NUMA (Non-uniform memory access)
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
NUMA node
Fast local access Slow non-local access
CPU CPU
RAM RAM
UMA (Uniform memory access)
Same cost
• UV2000 @ ISM (256 CPUs)
• Intel Xeon server (4 CPUs)
• Laptop PC (1 CPU)
• Smartphone (1 CPU)
Currently, major CPU arch.
Many configurations
8. Threading or Process-parallel
• Which we choice parallel programming model?
OpenMP (Pthreads)
Explicit memory access
Using MPI_send() & MPI_recv()
MPI-OpenMP Hybrid
• Distributed memory
• Explicit memory access
between processes for Good
locality
• Shared memory
• Implicit memory access for
reduce programming cost
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
Single-process Multi-process
Implicit memory access
Implicit
9. NUMA system
• NUMA = non-uniform memory access
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640
shared L3 cache
RAM RAM
Local memory access
Remote (non-local)
Memory access
NUMA node
• 4-way Intel Xeon E5-4640 (Sandybridge-EP)
– 4 (# of CPU sockets)
– 8 (# of physical cores per socket)
– 2 (# of threads per core)
4 x 8 x 2 = 64 threads
NUMA node
Max.
10. Memory Bandwidth between NUMA nodes
• Local memory is 2.6x faster than remote memory
0
2
4
6
8
10
12
14
16
10 15 20 25 30
Bandwidth(GB/s)
Number of elements log2 n
NUMA 0 ! NUMA 0
NUMA 0 ! NUMA 1
NUMA 0 ! NUMA 2
NUMA 0 ! NUMA 3
Local access: 13 GB/s
Remote access: 5 GB/s
Approx. 2.6 faster
double a[N], b[N], c[N];
void STREAM_Triad(double scalar)
{
long j;
#pragma omp parallel for
for (j = 0; j < N; ++j)
a[j] = b[j]+scalar*c[j];
}
STREAM TRIAD
threads data
vector size a[n], b[n], c[n]
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
NUMA 0
NUMA 1 NUMA 2
NUMA 3
11. Problem and motivation
• To develop efficient graph algorithm on NUMA system
CPU CPU
RAM RAM
UMA (Uniform memory access)
Same cost
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640
shared L3 cache
RAM RAM
T
T T
T
T
D
D
D
D
D
Accessing remote memory
T
Moving on another core
Default
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640
shared L3 cache
RAM RAM
T
NUMA-aware
D
T
D
T
D
T
D
D
T
Accessing local memory
Pinning threads and memory
NUMA (Non-uniform memory access)
Fast local access Slow non-local access
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
NUMA node
12. processor : 23
model name : Intel(R) Xeon(R) CPU E5-4640 0 @ 2.40GHz
stepping : 7
cpu MHz : 1200.000
cache size : 20480 KB
physical id : 2
siblings : 16
core id : 7
cpu cores : 8
apicid : 78
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
Processor ID
NUMA node ID
Core ID in NUMA node
Programming cost for NUMA-aware
• Not easy to apply NUMA-aware threading?
#define _GNU_SOURCE
#include <sched.h>
int bind_thread(int procid) {
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(procid, &set);
return sched_setaffinity((pid_t)0, sizeof(cpu_set_t), &set) );
}
Processor ID
– Linux provides processor topology as machine file only.
– The pinning function “sched_setaffinity” use processor ID
File size of /proc/cpuinfo
8.0 KB on DesktopPC
2.4 MB on UV 2000
13. NUMA-aware computation with ULIBC
• ULIBC is callable library for CPU and memory affinity settings
• Detects processor topology on a system on run time
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640
shared L3 cache
RAM RAM
T
NUMA-aware
D
T
D
T
D
T
D
D
T
Accessing local memory
Pinning threads and memory
NUMA (Non-uniform memory access)
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
NUMA node
Local memory
Remote memory
#include <ulibc.h>
#include <omp.h>
int main(void) {
ULIBC_init();
#pragma omp parallel
{
const struct numainfo_t ni = ULIBC_get_current_numainfo();
printf(“[%02d] Node: %d of %d, Core: %d of %dn”,
ni.id, ni.node, ULIBC_get_online_nodes(),
ni.core, ULIBC_get_online_cores(ni.node));
}
return 0;
}
struct numainfo_t {
int id; /* Thread ID */
int proc; /* Processor ID */
int node; /* NUMA node ID */
int core; /* core ID */
};
[04] Node 0 of 4, Core: 1 of 16
[55] Node 3 of 4, Core: 13 of 16
[16] Node 0 of 4, Core: 4 of 16
[37] Node 1 of 4, Core: 9 of 16
[30] Node 2 of 4, Core: 7 of 16
. . .
Core IDNUMA node IDThread ID
Init.
Thread pinning
https://bitbucket.org/yuichiro_yasui/ulibc
14. 1. Detects entire topology
Cores
CPU 1 P1, P5, P9, P13
CPU 2 P2, P6, P10, P14
2. Detects online (available) topology
Threads
NUMA 0 0(P1), 2(P5), 4(P9), 6(P13)
NUMA 1 1(P2), 3(P6), 5(P10)
3. Constructs ULIBC affinity
ULIBC_set_affinity_policy(7,
COMPACT_MAPPING, THREAD_TO_CORE)
CPU Affinity construction with ULIBC
Cores
CPU 0 P0, P4, P8, P12
CPU 1 P1, P5, P9, P13
CPU 2 P2, P6, P10, P14
CPU 3 P3, P7, P11, P15
Use Other
processes
NUMA 0
NUMA 1
core 0
core 1
core 2
core 3
RAM
RAM
Local RAM
Threads
NUMA 0 0(P1), 1(P5), 2(P9), 3(P13)
NUMA 1 4(P2), 5(P6), 6(P10)
NUMA 0
NUMA 1
core 0
core 1
core 2
core 3
RAM
RAM
Local RAM
ULIBC_set_affinity_policy(7,
SCATTER_MAPPING, THREAD_TO_CORE)
Job manager (PBS) or
numactl --cpunodebind=1,2
3. Constructs ULIBC affinity (2 types)
15. Graph500 Benchmark
• Measures a performance of irregular memory accesses
• TEPS score (# of Traversed edges per second) in a BFS
SCALE & edgefactor (=16)
Median
TEPS
1. Generation
SCALE
edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
t parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed e
- TEPS
Input parameters ResultGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- S
- e
- B
- T
- T
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
3. BFS x 642. Construction
x 64
TEPS ratio
• Generates synthetic scale-free network with 2SCALE vertices and
2SCALE×edgefactor edges by using SCALE-times the Rursive
Kronecker products
www.graph500.org
G1 G2 G3 G4
Kronecker graph
Input parameters for problem size
16. Green Graph500 Benchmark
• Measures power-efficient using TEPS/W score
• Our results on various systems such as SGI UV series and
Xeon servers, Android devices
http://green.graph500.org
Median
TEPS
1. Generation
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
Input parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
Input parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
E
factor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
rameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
3. BFS phase
2. Construction
x 64
TEPS ratio
Watt
TEPS/W
Power measurement
Green Graph500
Graph500
17. Level-synchronized parallel BFS (Top-down)
• Started from source vertex and
executes following two phases
for each level
ns (timed).: This step iterates the timed
untimed verify-phase 64 times. The BFS-
BFS for each source, and the verify-phase
ut of the BFS.
k is based on the TEPS ratio, which is
ven graph and the BFS output. Submission
hmark must report five TEPS ratios: the
uartile, median, third quartile, and maxi-
ARALLEL BFS ALGORITHM
ized Parallel BFS
the input of a BFS is a graph G = (V, E)
et of vertices V and a set of edges E.
f G are contained as pairs (v, w), where
et of edges E corresponds to a set of
where an adjacency list A(v) contains
s (v, w) ∈ E for each vertex v ∈ V . A
various edges spanning all other vertices
he source vertex s ∈ V in a given graph
predecessor map π, which is a map from
Algorithm 1: Level-synchronized Parallel BFS.
Input : G = (V, A) : unweighted directed graph.
s : source vertex.
Variables: QF
: frontier queue.
QN
: neighbor queue.
visited : vertices already visited.
Output : π(v) : predecessor map of BFS tree.
1 π(v) ← −1, ∀v ∈ V
2 π(s) ← s
3 visited ← {s}
4 QF
← {s}
5 QN
← ∅
6 while QF
̸= ∅ do
7 for v ∈ QF
in parallel do
8 for w ∈ A(v) do
9 if w ̸∈ visited atomic then
10 π(w) ← v
11 visited ← visited ∪ {w}
12 QN
← QN
∪ {w}
13 QF
← QN
14 QN
← ∅
Traversal
Swap
Frontier
Neighbor
Level k
Level k+1QF
QN
Swap exchanges the frontier QF
and the neighbors QN for next level
Traversal finds neighbors QN
from current frontier QF
visited
unvisited
18. Frontier
Level k
Level k+1
NeighborsFrontier
NeighborsLevel k
Level k+1
Candidates of
neighbors
前方探索と後方探索でのデータアクセスの観察
• 前方探索でのデータの書込み
v → w
v
w
Input : Directed graph G = (V, AF
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for v ∈ QF
in parallel do
for w ∈ AF
(v) do
if w visited atomic then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
QF
← QN
• 後方探索でのデータの書込み
w → v
v
w
Input : Directed graph G = (V, AB
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for w ∈ V visited in parallel do
for v ∈ AB
(w) do
if v ∈ QF
then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
break
QF
← QN
Direction-optimizing BFS
Top-down direction
• Efficient for small-frontier
• Uses out-going edges
Bottom-up direction
• Efficient for large-frontier
• Uses in-coming edges
前方探索と後方探索でのデータアクセスの観察
• 前方探索でのデータの書込み
v → w
v
w
Input : Directed graph G = (V, AF
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for v ∈ QF
in parallel do
for w ∈ AF
(v) do
if w visited atomic then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
QF
← QN
• 後方探索でのデータの書込み
w → v
v
w
Input : Directed graph G = (V, AB
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for w ∈ V visited in parallel do
for v ∈ AB
(w) do
if v ∈ QF
then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
break
QF
← QN
Current frontier
Unvisited
neighbors
Current frontier
Candidates of
neighbors
Skips unnecessary edge traversal
Outgoing
edges Incoming
edges
Chooses direction from Top-down or Bottom-up Beamer @ SC12
19. # of traversal edges of Kronecker graph with SCALE 26
Hybrid-BFS reduces
unnecessary edge traversals
Direction-optimizing BFS
Top-down探索に対する前方探索 (Top-down) と後方探索 (Bottom-up)
Level Top-down Bottom-up Hybrid
0 2 2,103,840,895 2
1 66,206 1,766,587,029 66,206
2 346,918,235 52,677,691 52,677,691
3 1,727,195,615 12,820,854 12,820,854
4 29,557,400 103,184 103,184
5 82,357 21,467 21,467
6 221 21,240 227
Total 2,103,820,036 3,936,072,360 65,689,631
Ratio 100.00% 187.09% 3.12%
Bottom-up
Top-down
Distance from source
|V| = 226, |E| = 230
= |E|
Chooses direction from Top-down or Bottom-up Beamer @ SC12
for small frontier for large frontier
Large
frontier
20. 0
10
20
30
40
50
2011 SC10
SC12
BigData13
ISC14
G500,ISC14
GTEPS
Reference
NUMA-aware
Dir.Opt.
NUMA-Opt.
NUMA-Opt.+Deg.aware
NUMA-Opt.+Deg.aware+Vtx.Sort
87M 800M
5G
11G
29G
42G
⇥1 ⇥9
⇥58
⇥125
⇥334
⇥489
NUMA-Opt. + Dir. Opt. BFS [BD13]
• Manages memory accesses on a NUMA system carefully.
Top-down Top-down
Bottom-up
Top-down
CPU
RAM
NUMA-aware Bottom-up
Top-down
CPU
RAM
NUMA-aware
Our previous result (2013)
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM
RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640
shared L3 cache
RAM RAM
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640
shared L3 cache
RAM RAM0th 3th
1st 2nd
0 1 2 3
Adjacency matrix
binds a partial adjacency matrix
into each NUMA nodes.
Binding using our library
NUMA nodes
23. • Forward graph GF for Top-down
• Backward graph GB for Bottom-up
Top-down
Bottom-up
A0
A1
A2
A3
Input:
Frontier
V
Data: visited Vk
Output:
neighbors
Vk
Local RAM
A0
A1
A2
A3
Data:
Frontier
V
Data: visited Vk
Output:
neighbors
Vk
Local RAM
NQ
Bottom-up(G, CQ, VS, π)
∅
∈ V VS in parallel do
v ∈ AB
(w) do
if v ∈ CQ then
π(w) ← v
VS ← VS ∪ {w}
NQ ← NQ ∪ {w}
break
NQ
mory, and these connect to one another via an
such as the Intel QPI, AMD HyperTransport,
MAlink 6. On such systems, processor cores can
ocal memory faster than they can access remote
memory (i.e., memory local to another processor
hared between processors). To some degree, the
of BFS depends on the speed of memory access,
omplexity of memory accesses is greater than that
on. Therefore, in this paper, we propose a general
approach for processor and memory affinities on
tem. However, we cannot find a library for obtain-
sentation and working variables to allow a BFS t
over the local memory before the traversal. In ou
all accesses to remote memory are avoided in
phase using the following column-wise partitioni
V = V0 | V1 | · · · | Vℓ−1 ,
A = A0 | A1 | · · · | Aℓ−1 ,
and each set of partial vertices Vk on the k-th N
is defined by
Vk = vj ∈ V | j ∈
k
ℓ
· n,
(k + 1)
ℓ
· n
where n is the number of vertices and the diviso
the number of NUMA nodes (CPU sockets). In
avoid accessing remote memory, we define parti
lists AF
k and AB
k for the Top-down and Bottom-u
follows:
AF
k (v) = {w | w ∈ {Vk ∩ A(v)}} , v ∈ V,
AB
k (w) = {v | v ∈ A(w)} , w ∈ Vk
Furthermore, the working spaces NQk, VSk,
partial vertices Vk are allocated to the local memo
th NUMA node with the memory pinned. Note th
of each current queue CQk is all vertices V in a
and these are allocated to the local memory on the
or w ∈ V VS in parallel do
for v ∈ AB
(w) do
if v ∈ CQ then
π(w) ← v
VS ← VS ∪ {w}
NQ ← NQ ∪ {w}
break
eturn NQ
al memory, and these connect to one another via an
nnect such as the Intel QPI, AMD HyperTransport,
NUMAlink 6. On such systems, processor cores can
heir local memory faster than they can access remote
cal) memory (i.e., memory local to another processor
mory shared between processors). To some degree, the
mance of BFS depends on the speed of memory access,
the complexity of memory accesses is greater than that
putation. Therefore, in this paper, we propose a general
ment approach for processor and memory affinities on
A system. However, we cannot find a library for obtain-
V = V0 | V1 |
A = A0 | A1 |
and each set of partial vertice
is defined by
Vk = vj ∈ V | j ∈
k
ℓ
where n is the number of ver
the number of NUMA nodes
avoid accessing remote memo
lists AF
k and AB
k for the Top-d
follows:
AF
k (v) = {w | w ∈ {Vk
AB
k (w) = {v | v
Furthermore, the working s
partial vertices Vk are allocated
th NUMA node with the mem
of each current queue CQk is
and these are allocated to the lo
o
nect to one another via an
QPI, AMD HyperTransport,
ystems, processor cores can
han they can access remote
y local to another processor
ssors). To some degree, the
he speed of memory access,
accesses is greater than that
paper, we propose a general
or and memory affinities on
nnot find a library for obtain-
V = V0 | V1 | · · · | Vℓ−1 ,
A = A0 | A1 | · · · | Aℓ−1 ,
and each set of partial vertices Vk on the k-th NUMA
is defined by
Vk = vj ∈ V | j ∈
k
ℓ
· n,
(k + 1)
ℓ
· n ,
where n is the number of vertices and the divisor ℓ is s
the number of NUMA nodes (CPU sockets). In additio
avoid accessing remote memory, we define partial adjac
lists AF
k and AB
k for the Top-down and Bottom-up polici
follows:
AF
k (v) = {w | w ∈ {Vk ∩ A(v)}} , v ∈ V,
AB
k (w) = {v | v ∈ A(w)} , w ∈ Vk.
Furthermore, the working spaces NQk, VSk, and πk
partial vertices Vk are allocated to the local memory on th
th NUMA node with the memory pinned. Note that the r
of each current queue CQk is all vertices V in a given g
and these are allocated to the local memory on the k-th NU
NUMA-based 1-D partitioned graph representation
These sub-graphs
represent the same
area, but not same
data structures.
Partial vertex set Vk
24. Vertex sorting for BFS
Degree distribution
Access freq. w/ vertex sorting
• # traversals is equal to out-degree for each vertex
• Locality of accessing vertices depends on vertex index
Improving cache hit ratios !!
Applying the vertex sorting,
an access frequency is similar
to the degree distribution.
Access frequency V.S. Degree distribution
0
2
1
3
4
4
0
2
3
1
Original
indices
Sorted
indices
High degree
Many accesses
for small-index vertex
25. Two strategies for implementation
1. Highest TEPS for Graph500
– Graph500 list uses TEPS scores only
2. Largest SCALE (problem size) for Green Graph500
– Green Graph500 list is separated two categories by problem size
– The big data category collects over SCALE 30 entries
Median size of all entries
0
10
20
30
40
50
20 21 22 23 24 25 26 27 28 29 30
GTEPS
SCALE
DG-V
DG-S
SG
Highest TEPS
Largest SCALE
#1 on 5th Green Graph500
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640
shared L3 cache
RAM
RAM
On the 4-way NUMA system,
• Highest TEPS model obtains 42
GTEPS for SCALE 27
• Largest SCALE model can solve
up to SCALE 30
26. A0
A1
A2
A3
Comparison of two implementations
• Dual-dir. graphs or Single-graph
Highest-TEPS mode Largest-SCALE mode
• Forward graph GF for Top-down • Transposed GB for Top-down
A0
A1
A2
A3
Input:
Frontier
V
Data: visited Vk
Output:
neighbors Vk
Local RAM
• Backward graph GB for Bottom-up
A0
A1
A2
A3
Data:
Frontier
V
Data: visited Vk
Output:
neighbors
Vk
Local RAM
• Backward graph GB for Bottom-up
A0
A1
A2
A3
Data:
Frontier
V
Data: visited Vk
Output:
neighbors
Vk
Local RAM
Output:
Neighbors V
Data:
Frontier
V
Input: visited V
Local and Remote RAM
Same
Diff.
27. Results on 4-way NUMA system (Xeon)
0
10
20
30
40
50
20 21 22 23 24 25 26 27 28 29 30
GTEPS
SCALE
DG-V
DG-S
SG
1
2
4
8
16
32
64
1
(1×1×1)
2
(1×2×1)
4
(1×4×1)
8
(1×8×1)
16
(2×8×1)
32
(4×8×1)
64
(4×8×2)
Speedupratio
Number of threads (#NUMA Nodes × #cores × #threads)
DG-V
DG-S
SG
• TEPS-model handle up to SCALE29
• Largest-SCALE model can solve up to SCALE30
Strong scaling for SCALE 27
• With 64 threads, these models achieve over
20X faster than sequential.
• Comparing 32 and 64 threads, HT was to
produce a speedup of over 20%.
HT speedups
are more
than 20%.
Highest TEPS score
Largest SCALE
20x faster than seq.
#1 entry on 5th Green Graph500
HT
CPU: 4-way Intel Xeon E5-4640 (64 threads)
Base-architecture: SandyBridge-EP
RAM: 512 GB
CC: GCC-4.4.7
28. SGI UV 2000 system
• SGI UV 2000
– Shared-memory supercomputer based on cc-NUMA arch.
– Running with a single Linux OS
– User handles a large memory space by the thread parallelization
e.g. OpenMP or Pthreads (also can use MPI)
– The full-spec. UV 2000 (4 racks) has 2,560 cores and 64 TB memory
• ISM, SGI, and us collaborate for the Graph500 benchmarks
The Institute of Statistical Mathematics
• Japan's national research institute for statistical science.
#2 system
ISM has two Full-spec. UV 2000 (totality 8 racks)
#1 full spec. UV 2K
29. System configuration of UV 2000
• UV2000 has hierarchical hardware topologies
– Sockets, Nodes, Cubes, Inner-racks, and Inter-racks
Node = 2 sockets Cube = 8 nodes Rack = 32 nodes
• We used NUMA-based flat parallelization
– Each NUMA node contains a “Xeon CPU E5-2470 v2” and a “256 GB RAM”
CPU
RAM
CPU
RAM
× 4 =
CPU
RAM
Node = 2 NUMA nodes Rack = 64 NUMA nodes
× 64 =
CPU
RAM
Cube = 16 NUMA nodes
× 2
CPU
RAM
× 16
NUMAlink
6.7GB/s
(20 cores, 512GB) (160 cores, 4TB) (640 cores, 16TB)
Cannot detect
32. 0
50
100
150
200
26
(1)
27
(2)
28
(4)
29
(8)
30
(16)
31
(32)
32
(64)
33
(128)
SCALE 31
(64)
GTEPS
SCALE (#sockets)
DG-V (SCALE 26 per NUMA node)
DG-V (SCALE 25 per NUMA node)
DG-S (SCALE 26 per NUMA node)
SG (SCALE 26 per NUMA node)
Results on UV2000
DG-V is fastest
SG is fastest and scalable
1 to 32 socket(s)
128 CPU sockets (1280 threads)
64 sockets Fastest of
single node
174 GTEPS
9th on SC14,
10th on ISC15
Graph500
SCALE 33
• 8.59 B vertices
• 137.44 B edges
Two UV2000 racks
DG-S is faster than DG-V
Highest TEPS
model
Large-problem
model
33. Breakdown with 2-racks UV2000
• Comp. (57 %) > Comm. (43 %) scalable
0
50
100
150
200
250
300
350
Init. 0 1 2 3 4 5 6 7
CPUtime(ms)
Level
Breakdown of SCALE 33 on UV 2000 with 128 CPUs
Traversal (ms)
Comm. (ms)
Top-down
Top-down
Bottom-up
Bottom-up
Bottom-up
Bottom-up
Bottom-up
Bottom-up
Remote memory
communications
Computation
Most of CPU time
at middle-level
34. • 4-way Intel Xeon server
– DG-V (High-TEPS) model achieves
the fastest single-server entries
– SG (Large-SCALE) model won 3rd,
4th, 5th Green Graph500 lists
5th Green Graph500 list
62.93 MTEPS/W
31.33 GTEPS
Our achievements of Graph500 benchmarks
• UV 2000
– DG-S (Middle) model achieves
131 GTEPS with 640 threads and
the most power-efficient of
commercial supercomputers
4th Green Graph500 list
61.48 MTEPS/W
28.61 GTEPS
3rd Green Graph500 list
59.12 MTEPS/W
28.48 GTEPS
#1
#1
#1
ISC15 in Last week
#7 on 3rd list #9 on 4rd list
– SG (Largest-SCALE) model
achieves 174 GTEPS for SCALE
33 with 1,280 threads and
the fastest single-node entry
36. Conclusion
1. Efficient graph algorithm
considering the processor
topology on a single-node
NUMA system
2. NUMA-aware programming
utilizing by our ULIBC
3. Our implementation works well on many computers;
– Scales up to 1,280 threads on UV2000 at ISM
– UV2000 achieves fast single node entry of 9th and 10th Graph500
– Xeon server won most energy-efficient 3rd, 4th, and 5th Green
Graph500
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640shared L3 cache
RAM RAM
T
NUMA-aware threading with ULBIC
D
T
D
T
D
T
D
D
T
Accessing local memory
Pinning threads and memory
Our library “ULIBC“ is available at Bitbucket�
https://bitbucket.org/yuichiro_yasui/ulibc�
37. Reference
• [BD13] Y. Yasui, K. Fujisawa, and K. Goto: NUMA-optimized Parallel
Breadth-first Search on Multicore Single-node System, IEEE
BigData 2013
• [ISC14] Y. Yasui, K. Fujisawa, and Y. Sato: Fast and Energy-efficient
Breadth-first Search on a single NUMA system, IEEE ISC'14, 2014
• [HPCS15] Y. Yasui and K. Fujisawa: Fast and scalable NUMA-based
thread parallel breadth-first search, HPCS 2015, ACM, IEEE, IFIP,
2015.
• [GraphCREST2015] K. Fujisawa, T. Suzumura, H. Sato, K. Ueno, Y.
Yasui, K. Iwabuchi, and T. Endo: Advanced Computing &
Optimization Infrastructure for Extremely Large-Scale Graphs on
Post Peta-Scale upercomputers, Proceedings of the Optimization in
the Real World -- Toward Solving Real-World Optimization Problems
--, Springer, 2015.
This talk
Other results of Our Graph500 team