This document summarizes a research paper that proposes a degree-aware breadth-first search (BFS) algorithm to improve the performance and energy efficiency of graph processing on non-uniform memory access (NUMA) systems. The paper introduces related work on BFS optimization. It then analyzes bottlenecks in previous NUMA-optimized BFS algorithms and proposes a degree-aware BFS approach. Experimental results show the proposal achieves faster performance on the Graph500 benchmark and improved energy efficiency on the Green Graph500 benchmark compared to prior work.
Fast and Scalable NUMA-based Thread Parallel Breadth-first SearchYuichiro Yasui
The 2015 International Conference on High Performance Computing & Simulation (HPCS2015)
Session 9A: July 22, 14:45 − 16:00
July 20 – 24, 2015, Amsterdam, the Netherlands
Fast and Scalable NUMA-based Thread Parallel Breadth-first SearchYuichiro Yasui
The 2015 International Conference on High Performance Computing & Simulation (HPCS2015)
Session 9A: July 22, 14:45 − 16:00
July 20 – 24, 2015, Amsterdam, the Netherlands
By Tobias Grosser, Scalable Parallel Computing Laboratory
The COSMO climate and weather model delivers daily forecasts for Switzerland and many other nations. As a traditional HPC application it was developed with SIMD-CPUs in mind and large manual efforts were required to enable the 2016 move to GPU acceleration. As today's high-performance computer systems increasingly rely on accelerators to reach peak performance and manual translation to accelerators is both costly and difficult to maintain, we propose a fully automatic accelerator compiler for the automatic translation of scientific Fortran codes to CUDA GPU accelerated systems. Several challenges had to be overcome to make this reality: 1) improved scalability, 2) automatic data placement using unified memory, 3) loop rescheduling to expose coarse-grained parallelism, 4) inter-procedural loop optimization, and 5) plenty of performance tuning. Our evaluation shows that end-to-end automatic accelerator compilation is possible for non-trivial portions of the COSMO climate model, despite the lack of complete static information. Non-trivial loop optimizations previously implemented manually are performed fully automatically and memory management happens fully transparently using unified memory. Our preliminary results show notable performance improvements over sequential CPU code (40s to 8s reduction in execution time) and we are currently working on closing the remaining gap to hand-tuned GPU code. This talk is a status update on our most recent efforts and also intended to gather feedback on future research plans towards automatically mapping COSMO to FPGAs.
Tobias Grosser Bio
Tobias Grosser is a senior researcher in the Scalable Parallel Computing Laboratory (SPCL) of Torsten Hoefler at the Computer Science Department of ETH Zürich. Supported by a Google PhD Fellowship he received his doctoral degree from Universite Pierre et Marie Curie under the supervision of Albert Cohen. Tobias' research is taking place at the border of low-level compilers and high-level program transformations with the goal of enabling complex - but highly-beneficial - program transformations in a production compiler environment. He develops with the Polly loop optimizer a loop transformation framework which today is a community project supported throught the Polly Labs research laboratory. Tobias also developed advanced tiling schemes for the efficient execution of iterated stencils. Today Tobias leads the heterogeneous compute efforts in the Swiss University funded ComPASC project and is about to start a three year NSF Ambizione project on advancing automatic compilation and heterogenization techniques at ETH Zurich.
Email
bgerofi@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Cycle’s topological optimizations and the iterative decoding problem on gener...Usatyuk Vasiliy
We consider several problem related to graph model related to error-correcting codes. From base problem of cycle broken, trapping set elliminating and bypass to fundamental problem of graph model. Thanks to the hard work of Michail Chertkov, Michail Stepanov and Andrea Montanari which inspirit me...
Slides presented at Applied Mathematics Day, Steklov Mathematical Institute of the Russian Academy of Sciences September 22, 2017 http://www.mathnet.ru/conf1249
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.Robo India
This article is about learning Global Positioning system.
In order to understand GPS, we need to communication protocol of GPS. GPS communicates in NMEA messages.
This document describes NMEA messages and algorithm to extract data.
We welcome all of your queries and views. We are found at-
website- http://roboindia.com
mail-info@roboindia.com
Area-Delay Efficient Binary Adders in QCAIJERA Editor
In this paper, a novel quantum-dot cellular automata (QCA) adder design is presented that decrease the number
of QCA cells compared to previously method designs. The proposed one-bit QCA adder is based on a new
algorithm that requires only three majority gates and two inverters for the QCA addition. A novel 128-bit adder
designed in QCA was implemented. It achieved speed performances higher than all the existing. QCA adders,
with an area requirement comparable with the low RCA and CFA established. The novel adder operates in the
RCA functional, but it could propagate a carry signal through a number of cascaded MGs significantly lower
than conventional RCA adders. In adding together, because of the adopted basic logic and layout strategy, the
number of clock cycles required for completing the explanation was limited. As transistors reduce in size more
and more of them can be accommodated in a single die, thus increasing chip computational capabilities.
However, transistors cannot find much smaller than their current size. The quantum-dot cellular automata
approach represents one of the possible solutions in overcome this physical limit, even though the design of
logic modules in QCA is not forever straightforward.
In this paper we propose Regularised Cross-Modal Hashing
(RCMH) a new cross-modal hashing model that projects
annotation and visual feature descriptors into a common
Hamming space. RCMH optimises the hashcode similarity
of related data-points in the annotation modality using an
iterative three-step hashing algorithm: in the first step each
training image is assigned a K-bit hashcode based on hyperplanes learnt at the previous iteration; in the second step the binary bits are smoothed by a formulation of graph regularisation so that similar data-points have similar bits; in the third step a set of binary classifiers are trained to predict the regularised bits with maximum margin. Visual descriptors are projected into the annotation Hamming space by a set of binary classifiers learnt using the bits of the corresponding annotations as labels. RCMH is shown to consistently improve retrieval effectiveness over state-of-the-art baselines.
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ AdderIJERA Editor
In this paper, we propse 16-bit sparse tree RSFQ adder (Rapid single flux quantam), kogge-stone adder, carry lookahead adder. In general N-bit adders like Ripple carry adder s(slow adders compare to other adders), and carry lookahead adders(area consuming adders) are used in earlier days. But now the most of industries are using parallel prefix adders because of their advantages compare to kogge-stone adder, carry lookahead adder, Our prefix sparse tree adders are faster and area efficient. Parallel prefix adder is a technique for increasing the speed in DSP processor while performing addition. We simulate and synthesis different types of 16-bit sparse tree RSFQ adders using Xilinx ISE10.1i tool, By using these synthesis results, We noted the performance parameters like number of LUT’s and delay. We compare these three adders interms of LUT’s represents area) and delay values.
Development of Routing for Car Navigation SystemsAtsushi Koike
Car navigation systems are devices that show us routes to our destination. Finding good routes is a key feature in the systems. I will explain the development of routing for the systems.
final Year Projects, Final Year Projects in Chennai, Software Projects, Embedded Projects, Microcontrollers Projects, DSP Projects, VLSI Projects, Matlab Projects, Java Projects, .NET Projects, IEEE Projects, IEEE 2009 Projects, IEEE 2009 Projects, Software, IEEE 2009 Projects, Embedded, Software IEEE 2009 Projects, Embedded IEEE 2009 Projects, Final Year Project Titles, Final Year Project Reports, Final Year Project Review, Robotics Projects, Mechanical Projects, Electrical Projects, Power Electronics Projects, Power System Projects, Model Projects, Java Projects, J2EE Projects, Engineering Projects, Student Projects, Engineering College Projects, MCA Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, Wireless Networks Projects, Network Security Projects, Networking Projects, final year projects, ieee projects, student projects, college projects, ieee projects in chennai, java projects, software ieee projects, embedded ieee projects, "ieee2009projects", "final year projects", "ieee projects", "Engineering Projects", "Final Year Projects in Chennai", "Final year Projects at Chennai", Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, Final Year Java Projects, Final Year ASP.NET Projects, Final Year VB.NET Projects, Final Year C# Projects, Final Year Visual C++ Projects, Final Year Matlab Projects, Final Year NS2 Projects, Final Year C Projects, Final Year Microcontroller Projects, Final Year ATMEL Projects, Final Year PIC Projects, Final Year ARM Projects, Final Year DSP Projects, Final Year VLSI Projects, Final Year FPGA Projects, Final Year CPLD Projects, Final Year Power Electronics Projects, Final Year Electrical Projects, Final Year Robotics Projects, Final Year Solor Projects, Final Year MEMS Projects, Final Year J2EE Projects, Final Year J2ME Projects, Final Year AJAX Projects, Final Year Structs Projects, Final Year EJB Projects, Final Year Real Time Projects, Final Year Live Projects, Final Year Student Projects, Final Year Engineering Projects, Final Year MCA Projects, Final Year MBA Projects, Final Year College Projects, Final Year BE Projects, Final Year BTech Projects, Final Year ME Projects, Final Year MTech Projects, Final Year M.Sc Projects, IEEE Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, IEEE 2009 Java Projects, IEEE 2009 ASP.NET Projects, IEEE 2009 VB.NET Projects, IEEE 2009 C# Projects, IEEE 2009 Visual C++ Projects, IEEE 2009 Matlab Projects, IEEE 2009 NS2 Projects, IEEE 2009 C Projects, IEEE 2009 Microcontroller Projects, IEEE 2009 ATMEL Projects, IEEE 2009 PIC Projects, IEEE 2009 ARM Projects, IEEE 2009 DSP Projects, IEEE 2009 VLSI Projects, IEEE 2009 FPGA Projects, IEEE 2009 CPLD Projects, IEEE 2009 Power Electronics Projects, IEEE 2009 Electrical Projects, IEEE 2009 Robotics Projects, IEEE 2009 Solor Projects, IEEE 2009 MEMS Projects, IEEE 2009 J2EE P
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...Andrea Tassi
In this paper, we propose a novel advanced multi-rate design for evolved Multimedia Multicast/Broadcast Service (eMBMS) in fourth generation (4G) Long-Term Evolution (LTE)/LTE-Advanced (LTE-A) networks. The proposed design provides: i) reliability, based on random network coded (RNC) transmission, and ii) efficiency, obtained by optimized rate allocation across multi-rate RNC streams. The paper provides an in-depth description of the system realization and demonstrates the feasibility of the proposed eMBMS design using both analytical and simulation results. The system performance is compared with popular multi-rate multicast approaches in a realistic simulated LTE/LTE-A environment.
Hadoop classes in mumbai
best android classes in mumbai with job assistance.
our features are:
expert guidance by it industry professionals
lowest fees of 5000
practical exposure to handle projects
well equiped lab
after course resume writing guidance
By Tobias Grosser, Scalable Parallel Computing Laboratory
The COSMO climate and weather model delivers daily forecasts for Switzerland and many other nations. As a traditional HPC application it was developed with SIMD-CPUs in mind and large manual efforts were required to enable the 2016 move to GPU acceleration. As today's high-performance computer systems increasingly rely on accelerators to reach peak performance and manual translation to accelerators is both costly and difficult to maintain, we propose a fully automatic accelerator compiler for the automatic translation of scientific Fortran codes to CUDA GPU accelerated systems. Several challenges had to be overcome to make this reality: 1) improved scalability, 2) automatic data placement using unified memory, 3) loop rescheduling to expose coarse-grained parallelism, 4) inter-procedural loop optimization, and 5) plenty of performance tuning. Our evaluation shows that end-to-end automatic accelerator compilation is possible for non-trivial portions of the COSMO climate model, despite the lack of complete static information. Non-trivial loop optimizations previously implemented manually are performed fully automatically and memory management happens fully transparently using unified memory. Our preliminary results show notable performance improvements over sequential CPU code (40s to 8s reduction in execution time) and we are currently working on closing the remaining gap to hand-tuned GPU code. This talk is a status update on our most recent efforts and also intended to gather feedback on future research plans towards automatically mapping COSMO to FPGAs.
Tobias Grosser Bio
Tobias Grosser is a senior researcher in the Scalable Parallel Computing Laboratory (SPCL) of Torsten Hoefler at the Computer Science Department of ETH Zürich. Supported by a Google PhD Fellowship he received his doctoral degree from Universite Pierre et Marie Curie under the supervision of Albert Cohen. Tobias' research is taking place at the border of low-level compilers and high-level program transformations with the goal of enabling complex - but highly-beneficial - program transformations in a production compiler environment. He develops with the Polly loop optimizer a loop transformation framework which today is a community project supported throught the Polly Labs research laboratory. Tobias also developed advanced tiling schemes for the efficient execution of iterated stencils. Today Tobias leads the heterogeneous compute efforts in the Swiss University funded ComPASC project and is about to start a three year NSF Ambizione project on advancing automatic compilation and heterogenization techniques at ETH Zurich.
Email
bgerofi@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Cycle’s topological optimizations and the iterative decoding problem on gener...Usatyuk Vasiliy
We consider several problem related to graph model related to error-correcting codes. From base problem of cycle broken, trapping set elliminating and bypass to fundamental problem of graph model. Thanks to the hard work of Michail Chertkov, Michail Stepanov and Andrea Montanari which inspirit me...
Slides presented at Applied Mathematics Day, Steklov Mathematical Institute of the Russian Academy of Sciences September 22, 2017 http://www.mathnet.ru/conf1249
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.Robo India
This article is about learning Global Positioning system.
In order to understand GPS, we need to communication protocol of GPS. GPS communicates in NMEA messages.
This document describes NMEA messages and algorithm to extract data.
We welcome all of your queries and views. We are found at-
website- http://roboindia.com
mail-info@roboindia.com
Area-Delay Efficient Binary Adders in QCAIJERA Editor
In this paper, a novel quantum-dot cellular automata (QCA) adder design is presented that decrease the number
of QCA cells compared to previously method designs. The proposed one-bit QCA adder is based on a new
algorithm that requires only three majority gates and two inverters for the QCA addition. A novel 128-bit adder
designed in QCA was implemented. It achieved speed performances higher than all the existing. QCA adders,
with an area requirement comparable with the low RCA and CFA established. The novel adder operates in the
RCA functional, but it could propagate a carry signal through a number of cascaded MGs significantly lower
than conventional RCA adders. In adding together, because of the adopted basic logic and layout strategy, the
number of clock cycles required for completing the explanation was limited. As transistors reduce in size more
and more of them can be accommodated in a single die, thus increasing chip computational capabilities.
However, transistors cannot find much smaller than their current size. The quantum-dot cellular automata
approach represents one of the possible solutions in overcome this physical limit, even though the design of
logic modules in QCA is not forever straightforward.
In this paper we propose Regularised Cross-Modal Hashing
(RCMH) a new cross-modal hashing model that projects
annotation and visual feature descriptors into a common
Hamming space. RCMH optimises the hashcode similarity
of related data-points in the annotation modality using an
iterative three-step hashing algorithm: in the first step each
training image is assigned a K-bit hashcode based on hyperplanes learnt at the previous iteration; in the second step the binary bits are smoothed by a formulation of graph regularisation so that similar data-points have similar bits; in the third step a set of binary classifiers are trained to predict the regularised bits with maximum margin. Visual descriptors are projected into the annotation Hamming space by a set of binary classifiers learnt using the bits of the corresponding annotations as labels. RCMH is shown to consistently improve retrieval effectiveness over state-of-the-art baselines.
An Efficient High Speed Design of 16-Bit Sparse-Tree RSFQ AdderIJERA Editor
In this paper, we propse 16-bit sparse tree RSFQ adder (Rapid single flux quantam), kogge-stone adder, carry lookahead adder. In general N-bit adders like Ripple carry adder s(slow adders compare to other adders), and carry lookahead adders(area consuming adders) are used in earlier days. But now the most of industries are using parallel prefix adders because of their advantages compare to kogge-stone adder, carry lookahead adder, Our prefix sparse tree adders are faster and area efficient. Parallel prefix adder is a technique for increasing the speed in DSP processor while performing addition. We simulate and synthesis different types of 16-bit sparse tree RSFQ adders using Xilinx ISE10.1i tool, By using these synthesis results, We noted the performance parameters like number of LUT’s and delay. We compare these three adders interms of LUT’s represents area) and delay values.
Development of Routing for Car Navigation SystemsAtsushi Koike
Car navigation systems are devices that show us routes to our destination. Finding good routes is a key feature in the systems. I will explain the development of routing for the systems.
final Year Projects, Final Year Projects in Chennai, Software Projects, Embedded Projects, Microcontrollers Projects, DSP Projects, VLSI Projects, Matlab Projects, Java Projects, .NET Projects, IEEE Projects, IEEE 2009 Projects, IEEE 2009 Projects, Software, IEEE 2009 Projects, Embedded, Software IEEE 2009 Projects, Embedded IEEE 2009 Projects, Final Year Project Titles, Final Year Project Reports, Final Year Project Review, Robotics Projects, Mechanical Projects, Electrical Projects, Power Electronics Projects, Power System Projects, Model Projects, Java Projects, J2EE Projects, Engineering Projects, Student Projects, Engineering College Projects, MCA Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, Wireless Networks Projects, Network Security Projects, Networking Projects, final year projects, ieee projects, student projects, college projects, ieee projects in chennai, java projects, software ieee projects, embedded ieee projects, "ieee2009projects", "final year projects", "ieee projects", "Engineering Projects", "Final Year Projects in Chennai", "Final year Projects at Chennai", Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, Final Year Java Projects, Final Year ASP.NET Projects, Final Year VB.NET Projects, Final Year C# Projects, Final Year Visual C++ Projects, Final Year Matlab Projects, Final Year NS2 Projects, Final Year C Projects, Final Year Microcontroller Projects, Final Year ATMEL Projects, Final Year PIC Projects, Final Year ARM Projects, Final Year DSP Projects, Final Year VLSI Projects, Final Year FPGA Projects, Final Year CPLD Projects, Final Year Power Electronics Projects, Final Year Electrical Projects, Final Year Robotics Projects, Final Year Solor Projects, Final Year MEMS Projects, Final Year J2EE Projects, Final Year J2ME Projects, Final Year AJAX Projects, Final Year Structs Projects, Final Year EJB Projects, Final Year Real Time Projects, Final Year Live Projects, Final Year Student Projects, Final Year Engineering Projects, Final Year MCA Projects, Final Year MBA Projects, Final Year College Projects, Final Year BE Projects, Final Year BTech Projects, Final Year ME Projects, Final Year MTech Projects, Final Year M.Sc Projects, IEEE Java Projects, ASP.NET Projects, VB.NET Projects, C# Projects, Visual C++ Projects, Matlab Projects, NS2 Projects, C Projects, Microcontroller Projects, ATMEL Projects, PIC Projects, ARM Projects, DSP Projects, VLSI Projects, FPGA Projects, CPLD Projects, Power Electronics Projects, Electrical Projects, Robotics Projects, Solor Projects, MEMS Projects, J2EE Projects, J2ME Projects, AJAX Projects, Structs Projects, EJB Projects, Real Time Projects, Live Projects, Student Projects, Engineering Projects, MCA Projects, MBA Projects, College Projects, BE Projects, BTech Projects, ME Projects, MTech Projects, M.Sc Projects, IEEE 2009 Java Projects, IEEE 2009 ASP.NET Projects, IEEE 2009 VB.NET Projects, IEEE 2009 C# Projects, IEEE 2009 Visual C++ Projects, IEEE 2009 Matlab Projects, IEEE 2009 NS2 Projects, IEEE 2009 C Projects, IEEE 2009 Microcontroller Projects, IEEE 2009 ATMEL Projects, IEEE 2009 PIC Projects, IEEE 2009 ARM Projects, IEEE 2009 DSP Projects, IEEE 2009 VLSI Projects, IEEE 2009 FPGA Projects, IEEE 2009 CPLD Projects, IEEE 2009 Power Electronics Projects, IEEE 2009 Electrical Projects, IEEE 2009 Robotics Projects, IEEE 2009 Solor Projects, IEEE 2009 MEMS Projects, IEEE 2009 J2EE P
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...Andrea Tassi
In this paper, we propose a novel advanced multi-rate design for evolved Multimedia Multicast/Broadcast Service (eMBMS) in fourth generation (4G) Long-Term Evolution (LTE)/LTE-Advanced (LTE-A) networks. The proposed design provides: i) reliability, based on random network coded (RNC) transmission, and ii) efficiency, obtained by optimized rate allocation across multi-rate RNC streams. The paper provides an in-depth description of the system realization and demonstrates the feasibility of the proposed eMBMS design using both analytical and simulation results. The system performance is compared with popular multi-rate multicast approaches in a realistic simulated LTE/LTE-A environment.
Hadoop classes in mumbai
best android classes in mumbai with job assistance.
our features are:
expert guidance by it industry professionals
lowest fees of 5000
practical exposure to handle projects
well equiped lab
after course resume writing guidance
NV_path_rendering is an OpenGL extension for CUDA-capable NVIDIA GPUs for performing resolution-independent 2D rendering. Standards such as Scalable Vector Graphics (SVG), PostScript, PDF, Adobe Flash, and TrueType fonts rely on path rendering. With NV_path_rendering, this important class of rendering is accelerated by the GPU in a way that co-exists with conventional 3D rendering.
For more information see:
http://developer.nvidia.com/nv-path-rendering
Data-Oriented Programming with Clojure and Jackdaw (Charles Reese, Funding Ci...confluent
When Funding Circle needed to scale its lending platform, we chose Kafka and Clojure. More than a programming language, Clojure is an interactive development environment with which you can build up an application function by function in a continuous unbroken flow. Since 2016 we have been developing our lending platform using Clojure and Kafka Streams, and today we process millions of transaction dollars daily. In 2018 we released "Jackdaw", our open-source Clojure library for working with Kafka Streams. In this talk, attendees will learn a radical new approach to building stream processing applications in a highly productive environment--one they can use immediately via Jackdaw or apply to their favorite programming system.
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
http://flink-forward.org/kb_sessions/to-petascale-and-beyond-apache-flink-in-the-clouds/
Apache Flink performs with low latency but can also scale to great heights. Gelly is Flink’s laboratory for building and tuning scalable graph algorithms and analytics. In this talk we’ll discuss writing algorithms optimized for the Flink architecture, assembling and configuring a cloud compute cluster, and boosting performance through benchmarking and system profiling. This talk will cover recent developments in the Gelly library to include scalable graph generators and a mixed collection of modular algorithms written with native Flink operators. We’ll think like a data stream, keep a cool cache, and send the garbage collector on holiday. To this we’ll add a lightweight benchmarking harness to stress and validate core Flink and to identify and refactor hot code with aplomb.
Locating objects in images (“detection”) quickly and efficiently enables object tracking and counting applications on embedded visual sensors (fixed and mobile). By 2012, progress on techniques for detecting objects in images – a topic of perennial interest in computer vision – had plateaued, and techniques based on histogram of oriented gradients (HOG) were state of the art. Soon, though, convolutional neural networks (CNNs), in addition to classifying objects, were also beginning to become effective at simultaneously detecting objects. Research in CNN-based object detection was jump-started by the groundbreaking region-based CNN (R-CNN). We’ll follow the evolution of neural network algorithms for object detection, starting with R-CNN and proceeding to Fast R-CNN, Faster R-CNN, “You Only Look Once” (YOLO), and up to the latest Single Shot Multibox detector. In this talk, we’ll examine the successive innovations in performance and accuracy embodied in these algorithms – which is a good way to understand the insights behind effective neural-network-based object localization. We’ll also contrast bounding-box approaches with pixel-level segmentation approaches and present pros and cons.
zkStudyClub: CirC and Compiling Programs to CircuitsAlex Pruden
The programming languages community, the cryptography community, and others rely on translating programs in high-level source languages (e.g., C) to logical constraint representations. Unfortunately, building compilers for this task is difficult and time consuming. In this work, Alex Ozdemir et al present CirC, an infrastructure for building compilers for SNARKs that build upon a common abstraction: stateless, non-deterministic computations called existentially quantified circuits, or EQCs.
We updated the DLA system introductions here, from design, add-on functions, and applications. During the 2018~2019, we developed the tools needed for IC simulation and verification, constructed a quantize-aware & HW-aware training flow, and improved the automation of the verification. We have verified this system through FPGA and solid-state SoC.
Graphs are the natural data structure to represent relations. Graph algorithms show irregular memory access pattern. This causes, distributed-memory parallel graph algorithms to do more communication than computation. When an algorithm generates more work the more communication they need to do. The amount of work can be reduced with frequent synchronization. However, the overhead of frequent synchronization reduces the performance of distributed-memory parallel graph algorithms. Abstract Graph Machine (AGM) is a model that can control the amount of synchronization and the amount of work generated by an algorithm,
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsPandey_G
Presentation for the paper C-SAW: A Framework for Graph Sampling and Random Walk on GPUs published in SC20.
Paper link: https://arxiv.org/pdf/2009.09103.pdf
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...Databricks
The increasing availability of mobile phones with embedded GPS devices and sensors has spurred the use of vehicle telematics in recent years. Telematics provides detailed and continuous information of a vehicle such as the location, speed, and movement. Vehicle telematics can be further linked with other spatial data to provide context to understand driving behaviors. The collection of high-frequency telematics data results in huge volumes of data that must be processed efficiently. We present a solution that uses Apache Spark to load and transform large-scaled telematics data. We then present how to use machine learning on telematics data to derive insights about driving safety.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Fast & Energy-Efficient Breadth-First Search on a Single NUMA System
1. Fast & Energy-Efficient
Breadth-First Search
on a Single NUMA System
Yuichiro Yasui & Katsuki Fujisawa
Kyushu University & JST CREST
Yukinori Sato
JAIST & JST CREST
ISC14 (International supercomputing conference 2014)
Research Papers 08 ‒ Energy Efficiency, June 26, 2014
2. Outline
1. Background
2. Fast computation of graph processing
– Related work and our previous contributions
3. Bottlenecks analysis for our previous NUMA-
optimized BFS
4. Our proposal : Degree-aware BFS
5. Performance evaluation of proposal BFS
– Fast for Graph500 benchmark
– Energy-efficient for Green graph500 benchmark
3. Background
• Large scale graphs in various fields
– US Road network : 58 million edges
– Twitter follow-ship : 1.47 billion edges
– Neuronal network : 100 trillion edges
89 billion vertices & 100 trillion edges
Neuronal network @ Human Brain Project
Cyber-security
Twitter
US road network
24 million vertices & 58 million edges 15 billion log entries / day
Social network
• Fast and scalable graph processing by using HPC
large
61.6 million vertices
& 1.47 billion edges
4. • Transportation
• Social network
• Cyber-security
• Bioinformatics
Graph analysis and important kernel BFS
• The cycle of graph analysis for understanding real-networks
• concurrent search (breadth-first search)
• optimization (single source shortest path)
• edge-oriented (maximal independent set)
graph
processing
Understanding
Application field
- SCALE
- edgefactor
- S
- e
- B
- T
- T
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
Relationships
- SCALE
- edgefactor
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
graph
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
results
Step1
Step2
Step3
• One of most important and fundamental processing
• Many algorithms and applications based on exists (Max.-flow and centrality)
• low arithmetic intensity & irregular memory accesses.
Breadth-first search (BFS)
Source
BFS Lv. 3
source Lv. 2
Lv. 1
Outputs:Distance (Lv.)
and Predecessor for each
vertex from source
Inputs:Graph,
and source vertex
5. Target: NUMA arch. system
RAM RAM
processor core & L2 cache
8-core Xeon E5 4640
shared L3 cache
RAM
RAM
CPU socket(16 logical cores)
+ Local RAM
Memory access for Local RAM(Fast)
Memory access for Remote RAM(Slow)
NUMA node
• Reduces and avoids memory accesses for Remote RAM
• 4-way Intel Xeon E5-4640 (Sandybridge-EP)
– 4 (# of CPU sockets)
– 8 (# of physical cores per socket)
– 2 (# of threads per core)
4 x 8 x 2 = 64 threads
NUMA node
Max.
NUMA-aware computation
6. Graph500 Benchmark
• Fast computation of graph processing is significant topic in HPC
• Graph500 benchmark measures computer performance using
TEPS ratio (# of Traversed edges per second) in graph processing
such as BFS (Breath-first search)
SCALE&&&edgefactor&(=16)
Median
TEPS
1. Generation
SCALE
edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
t parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed e
- TEPS
Input parameters ResultGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- S
- e
- B
- T
- T
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
3. BFS x 642. Construction
x 64
TEPS ratio
• Kronecker graph
– synthetic scale-free network which was generated by
using Recursive Kronecker product
– 2SCALE vertices and 2SCALE edgefactor edges
– e.g.) SCALE 30 and edgefactor 16 1 billion vertices
and 17.2 billion edges
www.graph500.org
7. Level-synchronized parallel BFS (Top-down)
• Started from source vertex
and executes following two
phases for each level
ns (timed).: This step iterates the timed
untimed verify-phase 64 times. The BFS-
BFS for each source, and the verify-phase
ut of the BFS.
k is based on the TEPS ratio, which is
ven graph and the BFS output. Submission
hmark must report five TEPS ratios: the
uartile, median, third quartile, and maxi-
ARALLEL BFS ALGORITHM
ized Parallel BFS
the input of a BFS is a graph G = (V, E)
et of vertices V and a set of edges E.
f G are contained as pairs (v, w), where
et of edges E corresponds to a set of
where an adjacency list A(v) contains
s (v, w) ∈ E for each vertex v ∈ V . A
various edges spanning all other vertices
he source vertex s ∈ V in a given graph
predecessor map π, which is a map from
Algorithm 1: Level-synchronized Parallel BFS.
Input : G = (V, A) : unweighted directed graph.
s : source vertex.
Variables: QF
: frontier queue.
QN
: neighbor queue.
visited : vertices already visited.
Output : π(v) : predecessor map of BFS tree.
1 π(v) ← −1, ∀v ∈ V
2 π(s) ← s
3 visited ← {s}
4 QF
← {s}
5 QN
← ∅
6 while QF
̸= ∅ do
7 for v ∈ QF
in parallel do
8 for w ∈ A(v) do
9 if w ̸∈ visited atomic then
10 π(w) ← v
11 visited ← visited ∪ {w}
12 QN
← QN
∪ {w}
13 QF
← QN
14 QN
← ∅
Traversal
Swap
Frontier
Neighbor
Level k
Level k+1
QF
QN
Swap … swaps the frontier
QF and the neighbor QN for
next level
Traversal … finds unvisited
adjacency vertices from
current frontier QF and
append to neighbor QN!
8. Candidates of
neighbors
前方探索と後方探索でのデータアクセスの観察
• 前方探索でのデータの書込み
v → w
v
w
Input : Directed graph G = (V, AF
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for v ∈ QF
in parallel do
for w ∈ AF
(v) do
if w visited atomic then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
QF
← QN
• 後方探索でのデータの書込み
w → v
v
w
Input : Directed graph G = (V, AB
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for w ∈ V visited in parallel do
for v ∈ AB
(w) do
if v ∈ QF
then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
break
QF
← QN
Hybrid-BFS (Direction-optimizing BFS)
Chooses one from Top-down or Bottom-up for frontier size at each level
Frontier
Neighbors
Level0k
Level0k+1
Frontier
Level0k
Level0k+1
neighbors
Top-down algorithm
• Efficient for small-frontier
• Uses out-going edges
Bottom-up algorithm
• Efficient for large-frontier
• Uses in-coming edges
前方探索と後方探索でのデータアクセスの観察
• 前方探索でのデータの書込み
v → w
v
w
Input : Directed graph G = (V, AF
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for v ∈ QF
in parallel do
for w ∈ AF
(v) do
if w visited atomic then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
QF
← QN
• 後方探索でのデータの書込み
w → v
v
w
Input : Directed graph G = (V, AB
), Queue QF
Data : Queue QN
, visited, Tree π(v)
QN
← ∅
for w ∈ V visited in parallel do
for v ∈ AB
(w) do
if v ∈ QF
then
π(w) ← v
visited ← visited ∪ {w}
QN
← QN
∪ {w}
break
QF
← QN
Current frontier
Unvisited
neighbors
Current frontier
Beamer2012
Candidates of
neighbors
Skips unnecessary edge traversal
9. Chooses one from Top-down or Bottom-up
for a number of traversed edges at each level
Number of traversal edges of Kronecker graph with SCALE 26
Hybrid-BFS reduces
unnecessary edge traversals
Beamer2012
Hybrid-BFS (Direction-optimizing BFS)
Top=down
探索に対する前方探索 (Top-down) と後方探索 (Bottom-up)
Level Top-down Bottom-up Hybrid
0 2 2,103,840,895 2
1 66,206 1,766,587,029 66,206
2 346,918,235 52,677,691 52,677,691
3 1,727,195,615 12,820,854 12,820,854
4 29,557,400 103,184 103,184
5 82,357 21,467 21,467
6 221 21,240 227
Total 2,103,820,036 3,936,072,360 65,689,631
Ratio 100.00% 187.09% 3.12%
Bottom=up&
Top=down
Distance from source
|V| = 226, |E| = 230
= |E|
10. NUMA-optimized BFS
• Clearly separated to accessing for local and remote memory
– Edge traversal on Local RAM
– All-gather of local queues and bitmaps for Remote RAM
NUMA=optimized&
Top=down
NUMA=optimized&
Bottom=up
Large&frontier? Aggregates&local&
frontier&queues
Yes
No
At each level,!
Traversal on local RAM Swap on Remote RAM
QN
1
QN
0
QN
3
QF
QN
2
QF QN
2
• Searches&local&neighbors&
from&local&copied&frontier&
• Out-going edges for Top-down
• In-coming edges for Bottom-up
NUMA-opt. requires two CSR graphs
※&Not&same&for&undirected&graph
12. CPU Affinity and local memory binding
• ULIBC: Ubiquity Library for Intelligently Binding Cores
– provides some routines for CPU affinity + Local memory binding
– manages each processor core (processor ID) by topology
information as a tuple of (SMT ID, core ID, package ID).
All processors Online processors
(allocated&to¤t&process)
CPU Affinity
1.&Detects&online&processors&
&&&&&&using&sched_getaffinity&system&call
NUMA node 0
NUMA node 1
core 0
core 1
core 2
core 3
RAM
RAM
Local RAM
Use&
Other&processes
Package0ID0:0index&of&CPU&socket&
Core0ID0:0index&of&physical&core&in&each&CPU&socket&
SMT0ID0:0index&of&thread&in&each&physical&core&
Processor0ID&
index&of&logical&processor&core&
2.&Binds&each&thread&to&logical&core&
&&&&&&&&using&sched_setaffinity&system&call&or&&
&&&&&&&&&&&&&&&Intel&compiler&Thread&Affinity&Interface
13. 0
5
10
15
20
25
30
35
20 21 22 23 24 25 26 27 28 29
GTEPS
SCALE
reference code
Agawal2010
Beamer2012
Yasui2013
Yasui2014
Related work: TEPS ratios on a single node
• Our BFS achieves 31.7 GTEPS for Kronecker graph (SCALE27)
Yasui2013
Yasui2014
x 2.2
x 2.6
x 5.9Agarwal2010
faster
Agarwal2010
NUMA-aware Top-down BFS
4-way Intel Xeon 7560
Beamer2012
Hybrid-BFS
4-way Intel Xeon E5-8870
Yasui2013
NUMA-opt. Hybrid-BFS
4-way Intel Xeon E5-4640
0.8 GTEPS
(m/n=64, 1.1GTEPS)
5.1 GTEPS
11.1 GTEPS
Yasui2014
Degree-aware NUMA-opt. BFS
4-way Intel Xeon E5-4650
31.7 GTEPS
This paper
This paper
Reference code 0.1 GTEPS
14. Visited vertices!
Zero-degree!
71,140,085!
53.0%
Top-down!
283!
0.0%!
Bottom-up
63,035,833!
47.0%
Level Step Hybrid-BFS
0 Top-down 22
1 Top-down 239,930
2 Bottom-up 150,006,673
3 Bottom-up 19,742,764
4 Bottom-up 139,817
5 Bottom-up 41,846
6 Top-down 260
Total – 170,171,312
% 4.0 %
Breakdown of hybrid-BFS
• Most of CPU time taken to
Bottom-up step in Hybrid BFS.
• In particular, Bottom-up step in
Level-2 has almost edge
traversals. 99.9 %
231 = 2,147,483,648 (100 %)
for Kronecker graph with SCALE27
#Traversed edges
+ +
Total vertices!
134,217,728!
100.0%!
=
• Most of vertex traversal taken to Bottom-up step in Hybrid BFS.
• A half of number of vertices is unvisited.
Breakdown of vertex traversal
Traversed edges
88.1 %
Unvisited vertices!
Isolated!
41,527!
0.0%
+
=227
( 8 %)
227 vertices and 231 edges
15. Influence of ordering for adjacency vertices
• Computation complexity of Bottom-up step depends on
the ordering of adjacency vertices for each vertex
Number of traversal edges for each ordering
# of traversed edges is strongly affected by each ordering in Lv. 2.
Descending&order
High-degree Low-degree
A(v)
Sorted adjacency list A(v)!
using out-degree of w!
w
aversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithm
Bottom-up Step Ascending Randomized Descending
223,250,243 T 22 22 22
258,645,723 T 239,930 239,930 239,930
83,878,899 B 848,743,124 150,006,673 83,878,899
19,616,130 B 19,935,737 19,742,764 19,616,130
139,606 B 139,868 139,817 139,606
41,846 B 41,846 41,846 41,846
41,586 T 260 260 260
585,614,033 – 869,100,787 170,171,312 103,916,693
179.6 % 20.6 % 4.0 % 2.5 %
108
Randomized
108
Descending
Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithm
Level Top-down Bottom-up Step Ascending Randomized Descending
0 22 4,223,250,243 T 22 22 22
1 239,930 3,258,645,723 T 239,930 239,930 239,930
2 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899
3 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130
4 37,007,608 139,606 B 139,868 139,817 139,606
5 98,339 41,846 B 41,846 41,846 41,846
6 260 41,586 T 260 260 260
Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693
% 100 % 179.6 % 20.6 % 4.0 % 2.5 %
108
Ascending
108
Randomized
108
Descending
Better
Loop0count0τ!
A(va)
A(vb)
finds frontier vertex and breaks this loop……
Bottom=up&
Skipped&adjacency&vertices
Traversed&adjacency&vertices
16. τ=1
Analysis of loop count for each vertices
Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithm
Level Top-down Bottom-up Step Ascending Randomized Descending
0 22 4,223,250,243 T 22 22 22
1 239,930 3,258,645,723 T 239,930 239,930 239,930
2 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899
3 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130
4 37,007,608 139,606 B 139,868 139,817 139,606
5 98,339 41,846 B 41,846 41,846 41,846
6 260 41,586 T 260 260 260
Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693
% 100 % 179.6 % 20.6 % 4.0 % 2.5 %
100
101
102
103
104
105
106
107
108
60001 10 100 1000
Numberoffixedvertices
Loop count ⌧
Ascending
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30 60
Numberoffixedvertices
Loop count ⌧
Randomized
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30
Numberoffixedvertices
Loop count ⌧
Descending
Lv.2
Lv.3
Lv.4
Lv.5
Fig. 3. Distribution of the loop count τ of the bottom-up step at each level in a BFS
Max: 5,873 Max: 58
Max: 28
19.0% + 27.8%
• Bottom-up found 46.8 % vertices
• Descending finds most vertices at first loop .Table 3. Number of traversed edges in a BFS for Kronecker graph with scale 27.
Hybrid algorithm
Level Top-down Bottom-up Step Ascending Randomized Descending
0 22 4,223,250,243 T 22 22 22
1 239,930 3,258,645,723 T 239,930 239,930 239,930
2 1,040,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899
3 3,145,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130
4 37,007,608 139,606 B 139,868 139,817 139,606
5 98,339 41,846 B 41,846 41,846 41,846
6 260 41,586 T 260 260 260
Total 4,223,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693
% 100 % 179.6 % 20.6 % 4.0 % 2.5 %
100
101
102
103
104
105
106
107
108
60001 10 100 1000
Numberoffixedvertices
Loop count ⌧
Ascending
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30 60
Numberoffixedvertices
Loop count ⌧
Randomized
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30
Numberoffixedvertices
Loop count ⌧
Descending
Lv.2
Lv.3
Lv.4
Lv.5
Fig. 3. Distribution of the loop count τ of the bottom-up step at each level in a BFS
22 4,223,250,243 T 22 22 22
239,930 3,258,645,723 T 239,930 239,930 239,930
,268,126 83,878,899 B 848,743,124 150,006,673 83,878,899
,608,885 19,616,130 B 19,935,737 19,742,764 19,616,130
,007,608 139,606 B 139,868 139,817 139,606
98,339 41,846 B 41,846 41,846 41,846
260 41,586 T 260 260 260
,223,170 7,585,614,033 – 869,100,787 170,171,312 103,916,693
100 % 179.6 % 20.6 % 4.0 % 2.5 %
6000100 1000
count ⌧
ending
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30 60
Numberoffixedvertices
Loop count ⌧
Randomized
Lv.2
Lv.3
Lv.4
Lv.5
100
101
102
103
104
105
106
107
108
1 2 3 4 5 10 20 30
Numberoffixedvertices
Loop count ⌧
Descending
Lv.2
Lv.3
Lv.4
Lv.528.1% + 18.7%
45.0%
τ = 1
Better
better
τ 2
First vertex of adjacency list
τ = 1 τ 2
τ = 1 τ 2
Descending order
Ascending Randomized
1.8%
17. 3 features and Degree-aware BFS
1. A half vertices has no adjacency vertices
Suppression of zero degree vertices using renumbering
technique for non-zero degree vertices
2. Computation complexity of Bottom-up depends on
the ordering of adjacency vertices for each vertex
Sorted adjacency list by out-degree in descending
3. Most vertices was found at first loop of Bottom-up
Separated graph representation; highest-degree
adjacency vertex list A+ and remaining CSR graph A-
High%degree Low%degree
i i+1
i
n
m-nn
Highest%degree
A-
High%degree Low%degree
i i+1
i
n
m-nn
Highest%degree
A+
=standard
CSR graph
+
Zero-degree opt.
High-degree opt.
21. Strong scaling on SGI Altix UV1000
0
100
200
300
400
500
600
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CPUtime(ms)
Level
Traversal (local)
Swap (all-gather)
BottomUp
BottomUp
BottomUpBottomUpBottomUp 0
100
200
300
400
500
600
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CPUtime(ms)
Level
Traversal (local)
Swap (all-gather)
BottomUpBottomUp
BottomUpBottomUpBottomUp 0
100
200
300
400
500
600
Init Lv.0 Lv.1 Lv.2 Lv.3 Lv.4 Lv.5 Lv.6 Lv.7
CPUtime(ms)
Level
Traversal (local)
Swap (all-gather)
BottomUpBottomUp
BottomUpBottomUp
512 threads (one-rack)
37.70 GE/s26.17 GE/s
256 threads128 threads
18.76 GE/s
Local Local
Local
Remote Remote
Remote
L : R = 67% : 33%L : R = 80% : 20% L : R = 57% : 43%
197 ms
218 ms182 ms
258 ms
438 ms733 ms
• As the number of threads increases,
– Improves the CPU time for Local memory access
– Keeps the CPU time for Remote memory access
for Kronecker graph with SCALE30
– 1.07 billion vertices, 17.18 billion edges
Rank.50
Fastest of single-node
on nov.2013 list
22. BFS Performances for Real networks
• Suitable for small-world networks
– efficient for a low-diameter and a large-edgefactor
Twitter follow-ship network in 2009
61.6 million vertices & 1.47 billion edges
10.90 GTEPS (max. 24.09 GTEPS)
US road network
24 million vertices & 58 million edges
0.09 GTEPS (max. 0.11 GTEPS)
Small=world
Non&small=world
faster than the former owing to its edgefactor being 1.68 times larger relatively.
In addition, twitter and friendster show similar BFS performances of approxi-
mately 10 GTEPS because they have similar edgefactor and similar diameters.
Therefore, we verify whether our BFS is affected by using both the edgefactor
and diameter of the network. From these numerical results, we could achieve
high performance for large-scale small-world networks with a large edgefactor.
Table 9. BFS performance of real-world network on Sandybridge-EP system.
Graph size edgefactor Diameter GTEPS
Instance n m m/n diam′
G min 1/4 median 3/4 max
wiki-Talk [23, 24] 2.39 M 5.02 M 2.1 8 0.29 0.61 0.75 0.87 1.26
USA-road-d [25] 23.95 M 58.33 M 2.4 8,098 0.07 0.08 0.09 0.09 0.11
LiveJournal [26, 27] 4.85 M 68.99 M 14.2 16 2.76 3.76 4.07 4.32 4.94
twitter [28] 61.58 M 1,468.37 M 23.8 16 7.58 10.02 10.90 12.68 24.09
friendster [29] 65.61 M 1,806.07 M 27.5 25 4.89 9.61 10.74 11.29 11.81
5 Energy Efficiency of Our BFS
23. The Green Graph500 list in Nov. 2013
• Measures power-efficient using TEPS/W ratio
• Our results on various systems such as Xeon servers
and Android devices
http://green.graph500.org
Median
TEPS
1. Generation
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
Input parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
Input parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
E
factor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
rameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
3. BFS phase
2. Construction
x 64
TEPS ratio
Watt
TEPS/W
Power measurement
Green Graph500
Graph500
Measuring power consumption
during the BFS phase
24. TEPS and TEPS/W on 4-way Xeon
16 8 8 4 4
44
7.92 GTEPS
364.9 W
21.71 MTEPS/W
11.83 GTEPS
452.6 W
26.13 MTEPS/W
13.96 GTEPS
517.8 W
26.96 MTEPS/W
Fast
16 16
1616
29.03 GTEPS
639.1 W
45.43 MTEPS/W
8 8
88
22.03 GTEPS
586.7 W
37.55 MTEPS/W
Energy efficient
#NUMA nodes = 4#threads = 16
0
5
10
15
20
25
30
1⇥1
(1)
4⇥1
(4)
4⇥2
(8)
1⇥16
(16)
2⇥8
(16)
4⇥4
(16)
2⇥16
(32)
4⇥8
(32)
4⇥16
(64)
w/o
(64)
4⇥16
(64)
GTEPS
` ⇥ t CPU Affinity (Number of threads)
Degree-aware (GTEPS)
Reference (GTEPS)
NUMA-opt. (GTEPS)
NUMA-opt.Ref.Degree-aware
25. 0
100
200
300
400
500
10 11 12 13 14 15 16 17 18 19 20
GTEPS
SCALE
Reference (p = 4)
Degree-aware BFS (p = 4)
7. MTEPS of reference BFS and Degree-aware BFS on XperiaA SO-04E.
10. Energy efficiency of BFS for Kronecker graph with on XperiaA SO-04E.
Implementation SCALE MTEPS watt MTEPS/W
Reference (p = 1) 20 3.25 3.15 1.03
Reference (p = 4) 20 4.58 3.22 1.42
Degree-aware (p = 1) 20 136.29 3.23 42.25
Green Graph500 on Xperia-A-SO-04E
Manage both fast and energy-efficient
on, suggesting that the effective power is not strongly affected by the number
hreads and the algorithm used. With regard to energy-efficient computation,
BFS is around 100 times faster than the reference code for roughly the
e effective power of 3.0 W; specifically, our BFS shows an energy-efficient
ormance of 153.17 MTESP/W.
ble 10. Energy efficiency of BFS for Kronecker graph with on XperiaA SO-04E.
Implementation SCALE MTEPS watt MTEPS/W
Reference (p = 1) 20 3.25 3.15 1.03
Reference (p = 4) 20 4.58 3.22 1.42
This study (p = 1) 20 136.29 3.23 42.25
This study (p = 2) 20 248.08 2.99 82.92
This study (p = 4) 20 477.63 3.12 153.17
153.17 MTEPS/W
(477.64 MTEPS)
Roughly same power-consumption
Smartphone
SONY Xperia-A-SO-04E
CPU : 4-core Snapdragon
RAM : 2 GB
#1 in Nov. 2013 list
# threads
1 MTEPS/W
Energy=efficient
x150
Faster and
energy efficient
26. Conclusion
• Degree-aware BFS
– Speedup techniques considering the vertex degree
– 1) Zero-degree vertex suppression
– 2) Separated graph representation
– 2.68 times faster than our previous algorithm
• Our BFS achieves fastest of single-node
– 37.7 GTEPS for SCALE30 on SGI Altix UV1000 (one rack)
• Investigates affinity and power consumption
– 4 sockets x 16 threads -affinity is the highest MTEPS
and MTEPS/W on 4-way Intel Xeon server.
– First position of small data category of 2nd Green
graph500 on Android device