Vector processors operate on arrays of data (vectors) with single instructions, unlike scalar processors which operate on single data items. This allows vector processors to greatly improve performance for tasks involving large datasets like numerical simulation. Early vector supercomputers dominated from the 1970s-1990s, but conventional microprocessors now use SIMD instructions to provide a form of vector processing. Modern GPUs also use vector-like processing for graphics and general computing.
This document discusses optimizations for TCP/IP networking performance on multicore systems. It describes several inefficiencies in the Linux kernel TCP/IP stack related to shared resources between cores, broken data locality, and per-packet processing overhead. It then introduces mTCP, a user-level TCP/IP stack that addresses these issues through a thread model with pairwise threading, batch packet processing from I/O to applications, and a BSD-like socket API. mTCP achieves a 2.35x performance improvement over the kernel TCP/IP stack on a web server workload.
From Rack scale computers to Warehouse scale computersRyousei Takano
This document discusses the transition from rack-scale computers to warehouse-scale computers through the disaggregation of technologies. It provides examples of rack-scale architectures like Open Compute Project and Intel Rack Scale Architecture. For warehouse-scale computers, it examines HP's The Machine project using application-specific cores, universal memory, and photonics fabric. It also outlines UC Berkeley's FireBox project utilizing 1 terabit/sec optical fibers, many-core systems-on-chip, and non-volatile memory modules connected via high-radix photonic switches.
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
The document evaluates the performance impact of virtualization on high-performance computing (HPC) clouds. Experiments were conducted on the AIST Super Green Cloud, a 155-node HPC cluster. Benchmark results show that while PCI passthrough mitigates I/O overhead, virtualization still incurs performance penalties for MPI collectives as node counts increase. Application benchmarks demonstrate overhead is limited to around 5%. The study concludes HPC clouds are promising due to utilization improvements from virtualization, but further optimization of virtual machine placement and pass-through technologies could help reduce overhead.
The document summarizes the author's participation report at the IEEE CloudCom 2014 conference. Some key points include:
- The author attended sessions on virtualization and HPC on cloud.
- Presentations had a strong academic focus and many presenters were Asian.
- Eight papers on HPC on cloud covered topics like reliability, energy efficiency, performance metrics, and applications like Monte Carlo simulations.
GPUs are specialized processors designed for graphics processing. CUDA (Compute Unified Device Architecture) allows general purpose programming on NVIDIA GPUs. CUDA programs launch kernels across a grid of blocks, with each block containing multiple threads that can cooperate. Threads have unique IDs and can access different memory types including shared, global, and constant memory. Applications that map well to this architecture include physics simulations, image processing, and other data-parallel workloads. The future of CUDA includes more general purpose uses through GPGPU and improvements in virtual memory, size, and cooling.
Early Benchmarking Results for Neuromorphic ComputingDESMOND YUEN
This document summarizes early benchmarking results for neuromorphic computing using Intel's Loihi chip. It finds that Loihi provides orders of magnitude gains over CPUs and GPUs for certain workloads that are directly trained on the chip or use novel bio-inspired algorithms. These include online learning, adaptive control, event-based vision and tactile sensing, constraint satisfaction problems, and nearest neighbor search. Larger networks and problems tend to provide greater performance gains with Loihi.
Hardware for deep learning includes CPUs, GPUs, FPGAs, and ASICs. CPUs are general purpose but support deep learning through instructions like AVX-512 and libraries. GPUs like NVIDIA and AMD models are commonly used due to high parallelism and memory bandwidth. FPGAs offer high efficiency but require specialized programming. ASICs like Google's TPU are customized for deep learning and provide high performance but limited flexibility. Emerging hardware aims to improve efficiency and better match neural network computations.
This document discusses optimizations for TCP/IP networking performance on multicore systems. It describes several inefficiencies in the Linux kernel TCP/IP stack related to shared resources between cores, broken data locality, and per-packet processing overhead. It then introduces mTCP, a user-level TCP/IP stack that addresses these issues through a thread model with pairwise threading, batch packet processing from I/O to applications, and a BSD-like socket API. mTCP achieves a 2.35x performance improvement over the kernel TCP/IP stack on a web server workload.
From Rack scale computers to Warehouse scale computersRyousei Takano
This document discusses the transition from rack-scale computers to warehouse-scale computers through the disaggregation of technologies. It provides examples of rack-scale architectures like Open Compute Project and Intel Rack Scale Architecture. For warehouse-scale computers, it examines HP's The Machine project using application-specific cores, universal memory, and photonics fabric. It also outlines UC Berkeley's FireBox project utilizing 1 terabit/sec optical fibers, many-core systems-on-chip, and non-volatile memory modules connected via high-radix photonic switches.
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
The document evaluates the performance impact of virtualization on high-performance computing (HPC) clouds. Experiments were conducted on the AIST Super Green Cloud, a 155-node HPC cluster. Benchmark results show that while PCI passthrough mitigates I/O overhead, virtualization still incurs performance penalties for MPI collectives as node counts increase. Application benchmarks demonstrate overhead is limited to around 5%. The study concludes HPC clouds are promising due to utilization improvements from virtualization, but further optimization of virtual machine placement and pass-through technologies could help reduce overhead.
The document summarizes the author's participation report at the IEEE CloudCom 2014 conference. Some key points include:
- The author attended sessions on virtualization and HPC on cloud.
- Presentations had a strong academic focus and many presenters were Asian.
- Eight papers on HPC on cloud covered topics like reliability, energy efficiency, performance metrics, and applications like Monte Carlo simulations.
GPUs are specialized processors designed for graphics processing. CUDA (Compute Unified Device Architecture) allows general purpose programming on NVIDIA GPUs. CUDA programs launch kernels across a grid of blocks, with each block containing multiple threads that can cooperate. Threads have unique IDs and can access different memory types including shared, global, and constant memory. Applications that map well to this architecture include physics simulations, image processing, and other data-parallel workloads. The future of CUDA includes more general purpose uses through GPGPU and improvements in virtual memory, size, and cooling.
Early Benchmarking Results for Neuromorphic ComputingDESMOND YUEN
This document summarizes early benchmarking results for neuromorphic computing using Intel's Loihi chip. It finds that Loihi provides orders of magnitude gains over CPUs and GPUs for certain workloads that are directly trained on the chip or use novel bio-inspired algorithms. These include online learning, adaptive control, event-based vision and tactile sensing, constraint satisfaction problems, and nearest neighbor search. Larger networks and problems tend to provide greater performance gains with Loihi.
Hardware for deep learning includes CPUs, GPUs, FPGAs, and ASICs. CPUs are general purpose but support deep learning through instructions like AVX-512 and libraries. GPUs like NVIDIA and AMD models are commonly used due to high parallelism and memory bandwidth. FPGAs offer high efficiency but require specialized programming. ASICs like Google's TPU are customized for deep learning and provide high performance but limited flexibility. Emerging hardware aims to improve efficiency and better match neural network computations.
This document discusses various types and implementations of parallel architectures. It covers parallelism concepts like data, thread, and instruction level parallelism. It also describes Flynn's taxonomy of parallel systems and different parallel machine designs like SIMD, vector, VLIW, and MIMD architectures. Specific examples of parallel supercomputers are provided like Cray, Connection Machine, and SGI Origin. Challenges in parallel programming and portability are also summarized.
The document discusses Nvidia's Tesla personal supercomputer, which uses multiple Tesla C1060 GPUs to provide supercomputer-level performance. Each C1060 GPU contains 240 processor cores running at 1.296GHz, 4GB of memory, and provides 933 gigaflops of processing power. The GPUs use Nvidia's CUDA parallel computing architecture and can accelerate applications up to 250 times compared to standard PCs. The supercomputers are aimed at scientific and medical research by providing affordable access to high-performance computing.
This document compares the performance of six supercomputers with over 1,000 processors each on various synthetic benchmarks and applications. The supercomputers have different node sizes, processor counts, and interconnect technologies. Performance is analyzed using a model that breaks down run time into computation, communication, and I/O components. Results show that different systems perform best for different benchmarks and applications, depending on factors like the communication requirements and how well the application scales. The Blue Gene supercomputer shows strong scaling and I/O performance but has limitations in processor speed and memory size per node.
1. Building exascale computers requires moving to sub-nanometer scales and steering individual electrons to solve problems more efficiently.
2. Moving data is a major challenge, as moving data off-chip uses 200x more energy than computing with it on-chip.
3. Future computers should optimize for data movement at all levels, from system design to microarchitecture, to minimize energy usage.
This document discusses GPU memory and how to optimize memory access patterns. It begins with an example of how a wide memory bus is used in GPUs. It describes the importance of coalescing memory accesses from multiple threads to fully utilize the bus bandwidth. It also discusses memory bank conflicts that can occur if multiple threads access the same memory bank, degrading performance. The key to high GPU memory bandwidth is coalescing accesses and avoiding bank conflicts.
The IBM POWER10 processor represents the 10th generation of the POWER family of enterprise computing engines. Its performance is a result of both powerful processing cores and high-bandwidth intra- and inter-chip interconnect. POWER10 systems can be configured with up to 16 processor chips and 1920 simultaneous threads of execution. Cross-system memory sharing, through the new Memory Inception technology, and 2 Petabytes of addressing space support an expansive memory system. The POWER10 processing core has been significantly enhanced over its POWER9 predecessor, including a doubling of vector units and the addition of an all-new matrix math engine. Throughput gains from POWER9 to POWER10 average 30% at the core level and three-fold at the socket level. Those gains can reach ten- or twenty-fold at the socket level for matrix-intensive computations.
This document discusses the history and evolution of supercomputer architectures from the 1960s to present. Early supercomputers relied on compact designs and local parallelism. Starting in the 1990s, massively parallel systems with thousands of processors became common. Modern supercomputers can use over 100,000 processors connected by fast interconnects and may utilize GPUs, computer clusters, or distributed computing networks to achieve petaflop-scale performance. Vector processing is also discussed as an important technique used in many historical supercomputers to improve performance.
High performance computing - building blocks, production & perspectiveJason Shih
This document provides an overview of high performance computing (HPC). It defines HPC as using supercomputers and computer clusters to solve advanced computation problems quickly and efficiently through parallel processing. The document discusses the building blocks of HPC systems including CPUs, memory, power consumption, and number of cores. It also outlines some common applications of HPC in fields like physics, engineering, and life sciences. Finally, it traces the evolution of HPC technologies over decades from early mainframes and supercomputers to today's clusters and parallel systems.
We leave in the era where the atomic building elements of silicon computers, e.g., transistors and wires, are no longer visible using traditional optical microscopes and their sizes are measured in just tens of Angstroms. In addition, power dissipation per unit volume is bounded by the laws of Physics that all resulted among others in stagnating processor clock frequencies. Adding more and more processor cores that perform simpler and simpler tasks in an attempt to efficiently fill the available on-chip area seems to be the current trend taken by the Industry.
This document provides an overview of supercomputers including their uses, history, architectures, and key systems. Supercomputers are the most powerful and expensive computers available designed to solve complex problems quickly. They are used for tasks like weather forecasting, nuclear simulation, and cryptography. They differ from personal computers in cost, environment, programming languages, and lack of common peripherals. Major early systems included the CDC 6600 and Cray-1. Current systems use clustering, symmetric multiprocessing, or massively parallel processing architectures. The top systems include BlueGene/L, Columbia, and Earth Simulator.
The field of artificial intelligence (AI) has witnessed tremendous growth in recent years with the advent of Deep Neural Networks (DNNs) that surpass humans in a variety of cognitive tasks.
Supercomputers are the fastest and most powerful computers designed to solve complex problems quickly. They were introduced in the 1960s and are used for nuclear simulation, structural analysis, crash analysis, climatic predictions, cryptography, and computational chemistry. Modern supercomputer architectures trade processor speed for low power consumption to support more processors at room temperature. The IBM Blue Gene supercomputer and K computer are examples of large, energy efficient supercomputing systems that use different processor and cooling approaches.
Flow-centric Computing - A Datacenter Architecture in the Post Moore EraRyousei Takano
1) The document proposes a new "flow-centric computing" data center architecture for the post-Moore era that focuses on data flows.
2) It involves disaggregating server components and reassembling them as "slices" consisting of task-specific processors and storage connected by an optical network to efficiently process data.
3) The authors expect optical networks to enable high-speed communication between processors, replacing general CPUs, and to potentially revolutionize how data is processed in future data centers.
Supercomputers are computers with a high level of computational capacity compared to general purpose computers. They are used for highly calculation intensive tasks like weather forecasting, nuclear weapon simulations, and data analysis. The history of supercomputing dates back to the 1960s. Modern supercomputers use parallel processing techniques like distributed and shared memory architectures to achieve very high performance measured in floating-point operations per second. Common supercomputer architectures include UMA, ccNUMA, massively parallel processing, and clusters.
This document provides an overview of supercomputers including their common uses, challenges, history and top systems. Some key points:
- Supercomputers are used for highly complex tasks like weather forecasting, climate modeling, and simulating nuclear weapons. They can process vast amounts of data and perform quadrillions of calculations per second.
- Major challenges include cooling systems to manage the large amounts of heat generated and high-speed data transfer between components.
- The US and Japan have historically dominated supercomputing. Early systems included the CDC 6600 (1964) and Cray-1 (1976). Modern systems use thousands of processors networked together.
- The top supercomputers today include China's Tianhe
Supercomputers are highly powerful computers that can perform massive calculations rapidly. They consist of tens of thousands of processors capable of billions or trillions of calculations per second. Supercomputers are used for data mining, predicting climate change, intelligence work, and nuclear weapon testing. They generate huge amounts of heat and data and consume large amounts of electricity. The fastest supercomputer is Summit, with 200 petaflops of power. In India, the Aaditya supercomputer ranks among the top 500 and is used for climate research, while Param Yuva II performs at 524 teraflops and will be used for various research areas. Supercomputers have numerous benefits and uses, and will likely continue advancing in the future.
This project deals with the warehouse scale computers that power all the internet services which we use today. The project covers the hardware blocks used in a Google WSC. Also, the project deals with the architecture of hardware accelerators such as the Graphical Processing Unit and the Tensor Processing Unit, which is highly useful for the warehouse scale machines to run heavy tasks and also to support application-specific machine learning and deep learning tasks. Also, the project explains about the energy efficiency of the processors used by the Google WSC to achieve high performance. The project also tries to explain about performance enhancement mechanism used by Google WSC.
This document discusses parallel processing and the evolution of computer systems. It covers several topics:
- The evolution of computer systems from vacuum tubes to integrated circuits, organized into generations.
- Concepts of parallel processing including Flynn's classification of computer architectures based on instruction and data streams.
- Parallel processing mechanisms in uniprocessor systems including pipelining and memory hierarchies.
- Three classes of parallel computer structures: pipeline computers, array processors, and multiprocessor systems.
- Architectural classification schemes including Flynn's, Feng's based on serial vs parallel processing, and Handler's based on parallelism levels.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This document discusses various types and implementations of parallel architectures. It covers parallelism concepts like data, thread, and instruction level parallelism. It also describes Flynn's taxonomy of parallel systems and different parallel machine designs like SIMD, vector, VLIW, and MIMD architectures. Specific examples of parallel supercomputers are provided like Cray, Connection Machine, and SGI Origin. Challenges in parallel programming and portability are also summarized.
The document discusses Nvidia's Tesla personal supercomputer, which uses multiple Tesla C1060 GPUs to provide supercomputer-level performance. Each C1060 GPU contains 240 processor cores running at 1.296GHz, 4GB of memory, and provides 933 gigaflops of processing power. The GPUs use Nvidia's CUDA parallel computing architecture and can accelerate applications up to 250 times compared to standard PCs. The supercomputers are aimed at scientific and medical research by providing affordable access to high-performance computing.
This document compares the performance of six supercomputers with over 1,000 processors each on various synthetic benchmarks and applications. The supercomputers have different node sizes, processor counts, and interconnect technologies. Performance is analyzed using a model that breaks down run time into computation, communication, and I/O components. Results show that different systems perform best for different benchmarks and applications, depending on factors like the communication requirements and how well the application scales. The Blue Gene supercomputer shows strong scaling and I/O performance but has limitations in processor speed and memory size per node.
1. Building exascale computers requires moving to sub-nanometer scales and steering individual electrons to solve problems more efficiently.
2. Moving data is a major challenge, as moving data off-chip uses 200x more energy than computing with it on-chip.
3. Future computers should optimize for data movement at all levels, from system design to microarchitecture, to minimize energy usage.
This document discusses GPU memory and how to optimize memory access patterns. It begins with an example of how a wide memory bus is used in GPUs. It describes the importance of coalescing memory accesses from multiple threads to fully utilize the bus bandwidth. It also discusses memory bank conflicts that can occur if multiple threads access the same memory bank, degrading performance. The key to high GPU memory bandwidth is coalescing accesses and avoiding bank conflicts.
The IBM POWER10 processor represents the 10th generation of the POWER family of enterprise computing engines. Its performance is a result of both powerful processing cores and high-bandwidth intra- and inter-chip interconnect. POWER10 systems can be configured with up to 16 processor chips and 1920 simultaneous threads of execution. Cross-system memory sharing, through the new Memory Inception technology, and 2 Petabytes of addressing space support an expansive memory system. The POWER10 processing core has been significantly enhanced over its POWER9 predecessor, including a doubling of vector units and the addition of an all-new matrix math engine. Throughput gains from POWER9 to POWER10 average 30% at the core level and three-fold at the socket level. Those gains can reach ten- or twenty-fold at the socket level for matrix-intensive computations.
This document discusses the history and evolution of supercomputer architectures from the 1960s to present. Early supercomputers relied on compact designs and local parallelism. Starting in the 1990s, massively parallel systems with thousands of processors became common. Modern supercomputers can use over 100,000 processors connected by fast interconnects and may utilize GPUs, computer clusters, or distributed computing networks to achieve petaflop-scale performance. Vector processing is also discussed as an important technique used in many historical supercomputers to improve performance.
High performance computing - building blocks, production & perspectiveJason Shih
This document provides an overview of high performance computing (HPC). It defines HPC as using supercomputers and computer clusters to solve advanced computation problems quickly and efficiently through parallel processing. The document discusses the building blocks of HPC systems including CPUs, memory, power consumption, and number of cores. It also outlines some common applications of HPC in fields like physics, engineering, and life sciences. Finally, it traces the evolution of HPC technologies over decades from early mainframes and supercomputers to today's clusters and parallel systems.
We leave in the era where the atomic building elements of silicon computers, e.g., transistors and wires, are no longer visible using traditional optical microscopes and their sizes are measured in just tens of Angstroms. In addition, power dissipation per unit volume is bounded by the laws of Physics that all resulted among others in stagnating processor clock frequencies. Adding more and more processor cores that perform simpler and simpler tasks in an attempt to efficiently fill the available on-chip area seems to be the current trend taken by the Industry.
This document provides an overview of supercomputers including their uses, history, architectures, and key systems. Supercomputers are the most powerful and expensive computers available designed to solve complex problems quickly. They are used for tasks like weather forecasting, nuclear simulation, and cryptography. They differ from personal computers in cost, environment, programming languages, and lack of common peripherals. Major early systems included the CDC 6600 and Cray-1. Current systems use clustering, symmetric multiprocessing, or massively parallel processing architectures. The top systems include BlueGene/L, Columbia, and Earth Simulator.
The field of artificial intelligence (AI) has witnessed tremendous growth in recent years with the advent of Deep Neural Networks (DNNs) that surpass humans in a variety of cognitive tasks.
Supercomputers are the fastest and most powerful computers designed to solve complex problems quickly. They were introduced in the 1960s and are used for nuclear simulation, structural analysis, crash analysis, climatic predictions, cryptography, and computational chemistry. Modern supercomputer architectures trade processor speed for low power consumption to support more processors at room temperature. The IBM Blue Gene supercomputer and K computer are examples of large, energy efficient supercomputing systems that use different processor and cooling approaches.
Flow-centric Computing - A Datacenter Architecture in the Post Moore EraRyousei Takano
1) The document proposes a new "flow-centric computing" data center architecture for the post-Moore era that focuses on data flows.
2) It involves disaggregating server components and reassembling them as "slices" consisting of task-specific processors and storage connected by an optical network to efficiently process data.
3) The authors expect optical networks to enable high-speed communication between processors, replacing general CPUs, and to potentially revolutionize how data is processed in future data centers.
Supercomputers are computers with a high level of computational capacity compared to general purpose computers. They are used for highly calculation intensive tasks like weather forecasting, nuclear weapon simulations, and data analysis. The history of supercomputing dates back to the 1960s. Modern supercomputers use parallel processing techniques like distributed and shared memory architectures to achieve very high performance measured in floating-point operations per second. Common supercomputer architectures include UMA, ccNUMA, massively parallel processing, and clusters.
This document provides an overview of supercomputers including their common uses, challenges, history and top systems. Some key points:
- Supercomputers are used for highly complex tasks like weather forecasting, climate modeling, and simulating nuclear weapons. They can process vast amounts of data and perform quadrillions of calculations per second.
- Major challenges include cooling systems to manage the large amounts of heat generated and high-speed data transfer between components.
- The US and Japan have historically dominated supercomputing. Early systems included the CDC 6600 (1964) and Cray-1 (1976). Modern systems use thousands of processors networked together.
- The top supercomputers today include China's Tianhe
Supercomputers are highly powerful computers that can perform massive calculations rapidly. They consist of tens of thousands of processors capable of billions or trillions of calculations per second. Supercomputers are used for data mining, predicting climate change, intelligence work, and nuclear weapon testing. They generate huge amounts of heat and data and consume large amounts of electricity. The fastest supercomputer is Summit, with 200 petaflops of power. In India, the Aaditya supercomputer ranks among the top 500 and is used for climate research, while Param Yuva II performs at 524 teraflops and will be used for various research areas. Supercomputers have numerous benefits and uses, and will likely continue advancing in the future.
This project deals with the warehouse scale computers that power all the internet services which we use today. The project covers the hardware blocks used in a Google WSC. Also, the project deals with the architecture of hardware accelerators such as the Graphical Processing Unit and the Tensor Processing Unit, which is highly useful for the warehouse scale machines to run heavy tasks and also to support application-specific machine learning and deep learning tasks. Also, the project explains about the energy efficiency of the processors used by the Google WSC to achieve high performance. The project also tries to explain about performance enhancement mechanism used by Google WSC.
This document discusses parallel processing and the evolution of computer systems. It covers several topics:
- The evolution of computer systems from vacuum tubes to integrated circuits, organized into generations.
- Concepts of parallel processing including Flynn's classification of computer architectures based on instruction and data streams.
- Parallel processing mechanisms in uniprocessor systems including pipelining and memory hierarchies.
- Three classes of parallel computer structures: pipeline computers, array processors, and multiprocessor systems.
- Architectural classification schemes including Flynn's, Feng's based on serial vs parallel processing, and Handler's based on parallelism levels.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A multi-core processor contains two or more independent processing units called cores that can execute program instructions simultaneously. This increases overall speed for programs that can be parallelized. Manufacturers integrate multiple cores onto a single chip using chip multiprocessing. Each core functions similarly to a single-core processor, running threads through time-slicing, while the operating system perceives each core as a separate processor and maps threads across cores. Multi-core architecture provides performance gains through parallel computing but software must be optimized for parallelism.
Parallel computing is computing architecture paradigm ., in which processing required to solve a problem is done in more than one processor parallel way.
Pragmatic optimization in modern programming - modern computer architecture c...Marina Kolpakova
There are three key aspects of computer architecture: instruction set architecture, microarchitecture, and hardware design. Modern architectures aim to either hide latency or maximize throughput. Reduced instruction set computers (RISC) became popular due to simpler decoding and pipelining allowing higher clock speeds. While complex instruction set computers (CISC) focused on code density, RISC architectures are now dominant due to their efficiency. Very long instruction word (VLIW) and vector processors targeted specialized workloads but their concepts influence modern designs. Load-store RISC architectures with fixed-width instructions and minimal addressing modes provide an optimal balance between performance and efficiency.
I understand that physics and hardware emmaded on the use of finete .pdfanil0878
I understand that physics and hardware emmaded on the use of finete element methods to predict
fluid flow over airplane wings,that progress is likely to continue. However, in recent years, this
progress has been achieved through greatly increased hardware complexity with the rise of
multicore and manycore processors, and this is affecting the ability of application developers to
achieve the full potential of these systems. currently performance is measured on a dense
matrix–matrix multiplication test which has questionable relevance to real applications.the
incredible advances in processor technology and all of the accompanying aspects of computer
system design, such as the memory subsystem and networking
In embedded it seems to combination of both hardware and the software , it is used to be
combined function of action in the systems .while we do that the application to developed in the
achieve the full potential of the systems in advanced processer technology.
Hardware
(1) Memory
Advances in memory technology have struggled to keep pace with the phenomenal advances in
processors. This difficulty in improving the main memory bandwidth led to the development of a
cache hierarchy with data being held in different cache levels within the processor. The idea is
that instead of fetching the required data multiple times from the main memory, it is instead
brought into the cache once and re-used multiple times. Intel allocates about half of the chip to
cache, with the largest LLC (last-level cache) being 30MB in size. IBM\'s new Power8 CPU has
an even larger L3 cache of up to 96MB [4]. By contrast, the largest L2 cache in NVIDIA\'s
GPUs is only 1.5MB.These different hardware design choices are motivated by careful
consideration of the range of applications being run by typical users.
One complication which has become more common and more important in the past few years is
non-uniform memory access. Ten years ago, most shared-memory multiprocessors would have
several CPUs sharing a memory bus to access a single main memory. A final comment on the
memory subsystem concerns the energy cost of moving data compared to performing a single
floating point computation.
(2) Processors
CPUs had a single processing core, and the increase in performance came partly from an increase
in the number of computational pipelines, but mainly through an increase in clock frequency.
Unfortunately, the power consumption is approximately proportional to the cube of the
frequency and this led to CPUs with a power consumption of up to 250W.CPUs address memory
bandwidth limitations by devoting half or more of the chip to LLC, so that small applications can
be held entirely within the cache. They address the 200-cycle latency issue by using very
complex cores which are capable of out-of-order execution , By contrast, GPUs adopt a very
different design philosophy because of the different needs of the graphical applications they
target. A GPU usually has a number of functional u.
Design and Implementation of Quintuple Processor Architecture Using FPGAIJERA Editor
The advanced quintuple processor core is a design philosophy that has become a mainstream in Scientific and engineering applications. Increasing performance and gate capacity of recent FPGA devices permit complex logic systems to be implemented on a single programmable device. The embedded multiprocessors face a new problem with thread synchronization. It is caused by the distributed memory, when thread synchronization is violated the processors can access the same value at the same time. Basically the processor performance can be increased by adopting clock scaling technique and micro architectural Enhancements. Therefore, Designed a new Architecture called Advanced Concurrent Computing. This is implemented on the FPGA chip using VHDL. The advanced Concurrent Computing architecture performs a simultaneous use of both parallel and distributed computing. The full architecture of quintuple processor core designed for realistic to perform arithmetic, logical, shifting and bit manipulation operations. The proposed advanced quintuple processor core contains Homogeneous RISC processors, added with pipelined processing units, multi bus organization and I/O ports along with the other functional elements required to implement embedded SOC solutions. The designed quintuple performance issues like area, speed and power dissipation and propagation delay are analyzed at 90nm process technology using Xilinx tool.
The document compares computers of the past and present. It discusses how computers have evolved from early mechanical calculating devices to modern electronic computers. In the past, computers were only used for calculations but now they are used for a wide variety of tasks. The document then summarizes the history of computers from the abacus and Napier's bones to early electronic computers like the ENIAC. It also discusses the classification of computers from supercomputers to microcomputers and provides examples from each category along with their specifications.
Highlighted notes of article while studying Concurrent Data Structures, CSE:
Real-world Concurrency
Bryan Cantrill and Jeff Bonwick,
Sun Microsystems
ACM Queue, September 2008 https://doi.org/10.1145/1454456.1454462
BRYAN CANTRILL is a Distinguished Engineer at Sun Microsystems, where he has worked on concurrent systems since coming to Sun to work with Jeff Bonwick on Solaris performance in 1996. Along with colleagues Mike Shapiro and Adam Leventhal, Cantrill developed DTrace, a facility for dynamic instrumentation of production systems that was directly inspired by his frustration in understanding the behavior of concurrent systems.
JEFF BONWICK is a Fellow at Sun Microsystems, where he has worked on concurrent systems since 1990. He is best known for inventing and leading the development of Sun’s ZFS (Zettabyte File System), but prior to this he was known for having written (or rather, rewritten) many of the most parallel subsystems in the Solaris kernel, including the syn- chronization primitives, the kernel memory allocator, and the thread-blocking mechanism.
https://dl.acm.org/doi/10.1145/1454456.1454462
The document provides information about processors and CPU terminology. It defines terms like data bus, address bus, registers, instruction set, and cache. It describes how CPUs work using transistors and how manufacturers like Intel and AMD make CPUs. It outlines the components of CPUs like execution cores, arithmetic logic units, and memory controllers. The document provides a timeline of CPUs from the 1970s to recent years to show advances in processing power and core counts.
The document provides an introduction and overview of the Crusoe microprocessor developed by Transmeta Corp. Some key points:
- Crusoe uses a hybrid software/hardware approach, with software (Code Morphing) translating x86 binaries to native VLIW instructions at runtime.
- This decouples the x86 ISA from the underlying hardware, allowing the hardware to be simplified and changed without affecting software compatibility.
- The Crusoe processor uses a VLIW architecture with a 128-bit instruction word that can contain up to 4 "atoms" to be executed in parallel.
- Code Morphing software resides in ROM and acts as an emulator, caching translations in a translation cache for improved
The document provides information about processors:
- A processor, also known as the central processing unit (CPU), is an electronic circuit that executes instructions of a computer program and processes data. It handles the central management functions of a computer.
- The main components of a processor include the execution unit, branch predictor, floating point unit, primary cache, and bus interfaces.
- Processor speed is measured by its clock speed in gigahertz (GHz), and it comes in different architectures like AMD and Intel. Parallel processing uses multiple CPUs or processor cores simultaneously.
A machine cycle consists of fetch, decode, and execute steps performed by a CPU. During fetch, an instruction is retrieved from memory and placed in an instruction register. In decode, the instruction is broken down into components. Execute then carries out the specified operation. Machine cycles occur millions of times per second as the CPU continuously runs programs stored in memory.
Implementation of RISC-Based Architecture for Low power applicationsIOSR Journals
This document describes the design and implementation of a 32-bit RISC processor intended for low power applications. The processor was developed using VHDL and implemented on a Xilinx Spartan-3E FPGA. It has a simple instruction set, program and data memories, general purpose registers, and an ALU. The processor follows a multi-cycle execution model. Simulation results show the processor executing instructions correctly in a single clock cycle, as intended for a RISC design. Power analysis shows the proposed RISC architecture consumes 132mW of power, half of previous designs, demonstrating its efficiency for low power applications.
Stream processing is a computer programming paradigm that allows for parallel processing of data streams. It involves applying the same kernel function to each element in a stream. Stream processing is suitable for applications involving large datasets where each data element can be processed independently, such as audio, video, and signal processing. Modern GPUs use a stream processing approach to achieve high performance by running kernels on multiple data elements simultaneously.
This document provides an introduction to parallel computing. It begins with definitions of parallel computing as using multiple compute resources simultaneously to solve problems. Popular parallel architectures include shared memory, where all processors can access a common memory, and distributed memory, where each processor has its own local memory and they communicate over a network. The document discusses key parallel computing concepts and terminology such as Flynn's taxonomy, parallel overhead, scalability, and memory models including uniform memory access (UMA), non-uniform memory access (NUMA), and distributed memory. It aims to provide background on parallel computing topics before examining how to parallelize different types of programs.
This document provides an overview of modern computer memory systems and discusses how programmers can optimize their code to improve performance. It begins with an introduction explaining that memory access has become the limiting factor for most programs as CPUs have gotten faster. It then covers the basic structure of memory subsystems, including CPU caches, memory controllers, RAM hardware, and non-uniform memory access (NUMA) architectures. The document aims to explain these concepts so programmers can understand how to optimize their code to utilize memory resources more efficiently.
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...Subhajit Sahu
TrueTime is a service that enables the use of globally synchronized clocks, with bounded error. It returns a time interval that is guaranteed to contain the clock’s actual time for some time during the call’s execution. If two intervals do not overlap, then we know calls were definitely ordered in real time. In general, synchronized clocks can be used to avoid communication in a distributed system.
The underlying source of time is a combination of GPS receivers and atomic clocks. As there are “time masters” in every datacenter (redundantly), it is likely that both sides of a partition would continue to enjoy accurate time. Individual nodes however need network connectivity to the masters, and without it their clocks will drift. Thus, during a partition their intervals slowly grow wider over time, based on bounds on the rate of local clock drift. Operations depending on TrueTime, such as Paxos leader election or transaction commits, thus have to wait a little longer, but the operation still completes (assuming the 2PC and quorum communication are working).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting Bitset for graph : SHORT REPORT / NOTESSubhajit Sahu
Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is commonly used for efficient graph computations. Unfortunately, using CSR for dynamic graphs is impractical since addition/deletion of a single edge can require on average (N+M)/2 memory accesses, in order to update source-offsets and destination-indices. A common approach is therefore to store edge-lists/destination-indices as an array of arrays, where each edge-list is an array belonging to a vertex. While this is good enough for small graphs, it quickly becomes a bottleneck for large graphs. What causes this bottleneck depends on whether the edge-lists are sorted or unsorted. If they are sorted, checking for an edge requires about log(E) memory accesses, but adding an edge on average requires E/2 accesses, where E is the number of edges of a given vertex. Note that both addition and deletion of edges in a dynamic graph require checking for an existing edge, before adding or deleting it. If edge lists are unsorted, checking for an edge requires around E/2 memory accesses, but adding an edge requires only 1 memory access.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Experiments with Primitive operations : SHORT REPORT / NOTESSubhajit Sahu
This includes:
- Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
- Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
- Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
- Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...Subhajit Sahu
Below are the important points I note from the 2020 paper by Martin Grohe:
- 1-WL distinguishes almost all graphs, in a probabilistic sense
- Classical WL is two dimensional Weisfeiler-Leman
- DeepWL is an unlimited version of WL graph that runs in polynomial time.
- Knowledge graphs are essentially graphs with vertex/edge attributes
ABSTRACT:
Vector representations of graphs and relational structures, whether handcrafted feature vectors or learned representations, enable us to apply standard data analysis and machine learning techniques to the structures. A wide range of methods for generating such embeddings have been studied in the machine learning and knowledge representation literature. However, vector embeddings have received relatively little attention from a theoretical point of view.
Starting with a survey of embedding techniques that have been used in practice, in this paper we propose two theoretical approaches that we see as central for understanding the foundations of vector embeddings. We draw connections between the various approaches and suggest directions for future research.
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESSubhajit Sahu
https://gist.github.com/wolfram77/54c4a14d9ea547183c6c7b3518bf9cd1
There exist a number of dynamic graph generators. Barbasi-Albert model iteratively attach new vertices to pre-exsiting vertices in the graph using preferential attachment (edges to high degree vertices are more likely - rich get richer - Pareto principle). However, graph size increases monotonically, and density of graph keeps increasing (sparsity decreasing).
Gorke's model uses a defined clustering to uniformly add vertices and edges. Purohit's model uses motifs (eg. triangles) to mimick properties of existing dynamic graphs, such as growth rate, structure, and degree distribution. Kronecker graph generators are used to increase size of a given graph, with power-law distribution.
To generate dynamic graphs, we must choose a metric to compare two graphs. Common metrics include diameter, clustering coefficient (modularity?), triangle counting (triangle density?), and degree distribution.
In this paper, the authors propose Dygraph, a dynamic graph generator that uses degree distribution as the only metric. The authors observe that many real-world graphs differ from the power-law distribution at the tail end. To address this issue, they propose binning, where the vertices beyond a certain degree (minDeg = min(deg) s.t. |V(deg)| < H, where H~10 is the number of vertices with a given degree below which are binned) are grouped into bins of degree-width binWidth, max-degree localMax, and number of degrees in bin with at least one vertex binSize (to keep track of sparsity). This helps the authors to generate graphs with a more realistic degree distribution.
The process of generating a dynamic graph is as follows. First the difference between the desired and the current degree distribution is calculated. The authors then create an edge-addition set where each vertex is present as many times as the number of additional incident edges it must recieve. Edges are then created by connecting two vertices randomly from this set, and removing both from the set once connected. Currently, authors reject self-loops and duplicate edges. Removal of edges is done in a similar fashion.
Authors observe that adding edges with power-law properties dominates the execution time, and consider parallelizing DyGraph as part of future work.
My notes on shared memory parallelism.
Shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Using memory for communication inside a single program, e.g. among its multiple threads, is also referred to as shared memory [REF].
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESSubhajit Sahu
**Community detection methods** can be *global* or *local*. **Global community detection methods** divide the entire graph into groups. Existing global algorithms include:
- Random walk methods
- Spectral partitioning
- Label propagation
- Greedy agglomerative and divisive algorithms
- Clique percolation
https://gist.github.com/wolfram77/b4316609265b5b9f88027bbc491f80b6
There is a growing body of work in *detecting overlapping communities*. **Seed set expansion** is a **local community detection method** where a relevant *seed vertices* of interest are picked and *expanded to form communities* surrounding them. The quality of each community is measured using a *fitness function*.
**Modularity** is a *fitness function* which compares the number of intra-community edges to the expected number in a random-null model. **Conductance** is another popular fitness score that measures the community cut or inter-community edges. Many *overlapping community detection* methods **use a modified ratio** of intra-community edges to all edges with atleast one endpoint in the community.
Andersen et al. use a **Spectral PageRank-Nibble method** which minimizes conductance and is formed by adding vertices in order of decreasing PageRank values. Andersen and Lang develop a **random walk approach** in which some vertices in the seed set may not be placed in the final community. Clauset gives a **greedy method** that *starts from a single vertex* and then iteratively adds neighboring vertices *maximizing the local modularity score*. Riedy et al. **expand multiple vertices** via maximizing modularity.
Several algorithms for **detecting global, overlapping communities** use a *greedy*, *agglomerative approach* and run *multiple separate seed set expansions*. Lancichinetti et al. run **greedy seed set expansions**, each with a *single seed vertex*. Overlapping communities are produced by a sequentially running expansions from a node not yet in a community. Lee et al. use **maximal cliques as seed sets**. Havemann et al. **greedily expand cliques**.
The authors of this paper discuss a dynamic approach for **community detection using seed set expansion**. Simply marking the neighbours of changed vertices is a **naive approach**, and has *severe shortcomings*. This is because *communities can split apart*. The simple updating method *may fail even when it outputs a valid community* in the graph.
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESSubhajit Sahu
A **community** (in a network) is a subset of nodes which are _strongly connected among themselves_, but _weakly connected to others_. Neither the number of output communities nor their size distribution is known a priori. Community detection methods can be divisive or agglomerative. **Divisive methods** use _betweeness centrality_ to **identify and remove bridges** between communities. **Agglomerative methods** greedily **merge two communities** that provide maximum gain in _modularity_. Newman and Girvan have introduced the **modularity metric**. The problem of community detection is then reduced to the problem of modularity maximization which is **NP-complete**. **Louvain method** is a variant of the _agglomerative strategy_, in that is a _multi-level heuristic_.
https://gist.github.com/wolfram77/917a1a4a429e89a0f2a1911cea56314d
In this paper, the authors discuss **four heuristics** for Community detection using the _Louvain algorithm_ implemented upon recently developed **Grappolo**, which is a parallel variant of the Louvain algorithm. They are:
- Vertex following and Minimum label
- Data caching
- Graph coloring
- Threshold scaling
With the **Vertex following** heuristic, the _input is preprocessed_ and all single-degree vertices are merged with their corresponding neighbours. This helps reduce the number of vertices considered in each iteration, and also help initial seeds of communities to be formed. With the **Minimum label heuristic**, when a vertex is making the decision to move to a community and multiple communities provided the same modularity gain, the community with the smallest id is chosen. This helps _minimize or prevent community swaps_. With the **Data caching** heuristic, community information is stored in a vector instead of a map, and is reused in each iteration, but with some additional cost. With the **Vertex ordering via Graph coloring** heuristic, _distance-k coloring_ of graphs is performed in order to group vertices into colors. Then, each set of vertices (by color) is processed _concurrently_, and synchronization is performed after that. This enables us to mimic the behaviour of the serial algorithm. Finally, with the **Threshold scaling** heuristic, _successively smaller values of modularity threshold_ are used as the algorithm progresses. This allows the algorithm to converge faster, and it has been observed a good modularity score as well.
From the results, it appears that _graph coloring_ and _threshold scaling_ heuristics do not always provide a speedup and this depends upon the nature of the graph. It would be interesting to compare the heuristics against baseline approaches. Future work can include _distributed memory implementations_, and _community detection on streaming graphs_.
Application Areas of Community Detection: A Review : NOTESSubhajit Sahu
This is a short review of Community detection methods (on graphs), and their applications. A **community** is a subset of a network whose members are *highly connected*, but *loosely connected* to others outside their community. Different community detection methods *can return differing communities* these algorithms are **heuristic-based**. **Dynamic community detection** involves tracking the *evolution of community structure* over time.
https://gist.github.com/wolfram77/09e64d6ba3ef080db5558feb2d32fdc0
Communities can be of the following **types**:
- Disjoint
- Overlapping
- Hierarchical
- Local.
The following **static** community detection **methods** exist:
- Spectral-based
- Statistical inference
- Optimization
- Dynamics-based
The following **dynamic** community detection **methods** exist:
- Independent community detection and matching
- Dependent community detection (evolutionary)
- Simultaneous community detection on all snapshots
- Dynamic community detection on temporal networks
**Applications** of community detection include:
- Criminal identification
- Fraud detection
- Criminal activities detection
- Bot detection
- Dynamics of epidemic spreading (dynamic)
- Cancer/tumor detection
- Tissue/organ detection
- Evolution of influence (dynamic)
- Astroturfing
- Customer segmentation
- Recommendation systems
- Social network analysis (both)
- Network summarization
- Privary, group segmentation
- Link prediction (both)
- Community evolution prediction (dynamic, hot field)
<br>
<br>
## References
- [Application Areas of Community Detection: A Review : PAPER](https://ieeexplore.ieee.org/document/8625349)
This paper discusses a GPU implementation of the Louvain community detection algorithm. Louvain algorithm obtains hierachical communities as a dendrogram through modularity optimization. Given an undirected weighted graph, all vertices are first considered to be their own communities. In the first phase, each vertex greedily decides to move to the community of one of its neighbours which gives greatest increase in modularity. If moving to no neighbour's community leads to an increase in modularity, the vertex chooses to stay with its own community. This is done sequentially for all the vertices. If the total change in modularity is more than a certain threshold, this phase is repeated. Once this local moving phase is complete, all vertices have formed their first hierarchy of communities. The next phase is called the aggregation phase, where all the vertices belonging to a community are collapsed into a single super-vertex, such that edges between communities are represented as edges between respective super-vertices (edge weights are combined), and edges within each community are represented as self-loops in respective super-vertices (again, edge weights are combined). Together, the local moving and the aggregation phases constitute a stage. This super-vertex graph is then used as input fof the next stage. This process continues until the increase in modularity is below a certain threshold. As a result from each stage, we have a hierarchy of community memberships for each vertex as a dendrogram.
Approaches to perform the Louvain algorithm can be divided into coarse-grained and fine-grained. Coarse-grained approaches process a set of vertices in parallel, while fine-grained approaches process all vertices in parallel. A coarse-grained hybrid-GPU algorithm using multi GPUs has be implemented by Cheong et al. which grabbed my attention. In addition, their algorithm does not use hashing for the local moving phase, but instead sorts each neighbour list based on the community id of each vertex.
https://gist.github.com/wolfram77/7e72c9b8c18c18ab908ae76262099329
Survey for extra-child-process package : NOTESSubhajit Sahu
Useful additions to inbuilt child_process module.
📦 Node.js, 📜 Files, 📰 Docs.
Please see attached PDF for literature survey.
https://gist.github.com/wolfram77/d936da570d7bf73f95d1513d4368573e
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERSubhajit Sahu
This paper presents two algorithms for efficiently computing PageRank on dynamically updating graphs in a batched manner: DynamicLevelwisePR and DynamicMonolithicPR. DynamicLevelwisePR processes vertices level-by-level based on strongly connected components and avoids recomputing converged vertices on the CPU. DynamicMonolithicPR uses a full power iteration approach on the GPU that partitions vertices by in-degree and skips unaffected vertices. Evaluation on real-world graphs shows the batched algorithms provide speedups of up to 4000x over single-edge updates and outperform other state-of-the-art dynamic PageRank algorithms.
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Subhajit Sahu
For the PhD forum an abstract submission is required by 10th May, and poster by 15th May. The event is on 30th May.
https://gist.github.com/wolfram77/1c1f730d20b51e0d2c6d477fd3713024
Fast Incremental Community Detection on Dynamic Graphs : NOTESSubhajit Sahu
In this paper, the authors describe two approaches for dynamic community detection using the CNM algorithm. CNM is a hierarchical, agglomerative algorithm that greedily maximizes modularity. They define two approaches: BasicDyn and FastDyn. BasicDyn backtracks merges of communities until each marked (changed) vertex is its own singleton community. FastDyn undoes a merge only if the quality of merge, as measured by the induced change in modularity, has significantly decreased compared to when the merge initially took place. FastDyn also allows more than two vertices to contract together if in the previous time step these vertices eventually ended up contracted in the same community. In the static case, merging several vertices together in one contraction phase could lead to deteriorating results. FastDyn is able to do this, however, because it uses information from the merges of the previous time step. Intuitively, merges that previously occurred are more likely to be acceptable later.
https://gist.github.com/wolfram77/1856b108334cc822cdddfdfa7334792a
Building a Raspberry Pi Robot with Dot NET 8, Blazor and SignalRPeter Gallagher
In this session delivered at NDC Oslo 2024, I talk about how you can control a 3D printed Robot Arm with a Raspberry Pi, .NET 8, Blazor and SignalR.
I also show how you can use a Unity app on an Meta Quest 3 to control the arm VR too.
You can find the GitHub repo and workshop instructions here;
https://bit.ly/dotnetrobotgithub
1. 06/10/2020 Vector processor - Wikipedia
https://en.wikipedia.org/wiki/Vector_processor 1/6
Vector processor
In computing, a vector processor or array processor is a central processing unit (CPU) that implements an
instruction set containing instructions that operate on one-dimensional arrays of data called vectors, compared
to the scalar processors, whose instructions operate on single data items. Vector processors can greatly improve
performance on certain workloads, notably numerical simulation and similar tasks. Vector machines appeared
in the early 1970s and dominated supercomputer design through the 1970s into the 1990s, notably the various
Cray platforms. The rapid fall in the price-to-performance ratio of conventional microprocessor designs led to
the vector supercomputer's demise in the later 1990s.
As of 2016 most commodity CPUs implement architectures that feature SIMD instructions for a form of vector
processing on multiple (vectorized) data sets. Common examples include Intel x86's MMX, SSE and AVX
instructions, AMD's 3DNow! extensions, Sparc's VIS extension, PowerPC's AltiVec and MIPS' MSA. Vector
processing techniques also operate in video-game console hardware and in graphics accelerators. In 2000, IBM,
Toshiba and Sony collaborated to create the Cell processor.
Other CPU designs include some multiple instructions for vector processing on multiple (vectorised) data sets,
typically known as MIMD (Multiple Instruction, Multiple Data) and realized with VLIW (Very Long
Instruction Word). The Fujitsu FR-V VLIW/vector processor combines both technologies.
History
Early work
Supercomputers
GPU
Description
Vector instructions
Performance and speed up
Programming heterogeneous computing architectures
See also
References
External links
Vector processing development began in the early 1960s at Westinghouse in their "Solomon" project.
Solomon's goal was to dramatically increase math performance by using a large number of simple math co-
processors under the control of a single master CPU. The CPU fed a single common instruction to all of the
arithmetic logic units (ALUs), one per cycle, but with a different data point for each one to work on. This
allowed the Solomon machine to apply a single algorithm to a large data set, fed in the form of an array.
Contents
History
Early work
2. 06/10/2020 Vector processor - Wikipedia
https://en.wikipedia.org/wiki/Vector_processor 2/6
Cray J90 processor module with
four scalar/vector processors
In 1962, Westinghouse cancelled the project, but the effort was restarted at the University of Illinois as the
ILLIAC IV. Their version of the design originally called for a 1 GFLOPS machine with 256 ALUs, but, when
it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless,
it showed that the basic concept was sound, and, when used on data-intensive applications, such as
computational fluid dynamics, the ILLIAC was the fastest machine in the world. The ILLIAC approach of
using separate ALUs for each data element is not common to later designs, and is often referred to under a
separate category, massively parallel computing.
A computer for operations with functions was presented and developed by Kartsev in 1967.[1]
The first successful implementation of vector processing occurred in 1966, when both the Control Data
Corporation STAR-100 and the Texas Instruments Advanced Scientific Computer (ASC) were introduced.
The basic ASC (i.e., "one pipe") ALU used a pipeline architecture that supported both scalar and vector
computations, with peak performance reaching approximately 20 MFLOPS, readily achieved when processing
long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with a corresponding 2X or
4X performance gain. Memory bandwidth was sufficient to support these expanded modes.
The STAR was otherwise slower than CDC's own supercomputers like the CDC 7600, but at data related tasks
they could keep up while being much smaller and less expensive. However the machine also took considerable
time decoding the vector instructions and getting ready to run the process, so it required very specific data sets
to work on before it actually sped anything up.
The vector technique was first fully exploited in 1976 by the famous Cray-1. Instead of leaving the data in
memory like the STAR and ASC, the Cray design had eight vector registers, which held sixty-four 64-bit words
each. The vector instructions were applied between registers, which is much faster than talking to main
memory. Whereas the STAR would apply a single operation across a long vector in memory and then move on
to the next operation, the Cray design would load a smaller section of the vector into registers and then apply as
many operations as it could to that data, thereby avoiding many of the much slower memory access operations.
The Cray design used pipeline parallelism to implement vector instructions rather than multiple ALUs. In
addition, the design had completely separate pipelines for different instructions, for example,
addition/subtraction was implemented in different hardware than multiplication. This allowed a batch of vector
instructions to be pipelined into each of the ALU subunits, a technique they called vector chaining. The Cray-1
normally had a performance of about 80 MFLOPS, but with up to three chains running it could peak at
240 MFLOPS and averaged around 150 – far faster than any machine of the era.
Other examples followed. Control Data Corporation tried to re-enter the
high-end market again with its ETA-10 machine, but it sold poorly and
they took that as an opportunity to leave the supercomputing field
entirely. In the early and mid-1980s Japanese companies (Fujitsu,
Hitachi and Nippon Electric Corporation (NEC) introduced register-
based vector machines similar to the Cray-1, typically being slightly
faster and much smaller. Oregon-based Floating Point Systems (FPS)
built add-on array processors for minicomputers, later building their
own minisupercomputers.
Throughout, Cray continued to be the performance leader, continually
beating the competition with a series of machines that led to the Cray-2,
Cray X-MP and Cray Y-MP. Since then, the supercomputer market has
focused much more on massively parallel processing rather than better
Supercomputers
3. 06/10/2020 Vector processor - Wikipedia
https://en.wikipedia.org/wiki/Vector_processor 3/6
implementations of vector processors. However, recognising the benefits of vector processing IBM developed
Virtual Vector Architecture for use in supercomputers coupling several scalar processors to act as a vector
processor.
Although vector supercomputers resembling the Cray-1 are less popular these days, NEC has continued to
make this type of computer up to the present day, with their SX series of computers. Most recently, the SX-
Aurora TSUBASA places the processor and either 24 or 48 gigabytes of memory on an HBM 2 module within
a card that physically resembles a graphics coprocessor, but instead of serving as a co-processor, it is the main
computer with the PC-compatible computer into which it is plugged serving support functions.
Modern graphics processing units (GPUs) include an array of shader pipelines which may be driven by
compute kernels, which can be considered vector processors (using a similar strategy for hiding memory
latencies).
In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, most CPUs
have an instruction that essentially says "add A to B and put the result in C". The data for A, B and C could be
—in theory at least—encoded directly into the instruction. However, in efficient implementation things are
rarely that simple. The data is rarely sent in raw form, and is instead "pointed to" by passing in an address to a
memory location that holds the data. Decoding this address and getting the data out of the memory takes some
time, during which the CPU traditionally would sit idle waiting for the requested data to show up. As CPU
speeds have increased, this memory latency has historically become a large impediment to performance; see
Memory wall.
In order to reduce the amount of time consumed by these steps, most modern CPUs use a technique known as
instruction pipelining in which the instructions pass through several sub-units in turn. The first sub-unit reads
the address and decodes it, the next "fetches" the values at those addresses, and the next does the math itself.
With pipelining the "trick" is to start decoding the next instruction even before the first has left the CPU, in the
fashion of an assembly line, so the address decoder is constantly in use. Any particular instruction takes the
same amount of time to complete, a time known as the latency, but the CPU can process an entire batch of
operations much faster and more efficiently than if it did so one at a time.
Vector processors take this concept one step further. Instead of pipelining just the instructions, they also
pipeline the data itself. The processor is fed instructions that say not just to add A to B, but to add all of the
numbers "from here to here" to all of the numbers "from there to there". Instead of constantly having to decode
instructions and then fetch the data needed to complete them, the processor reads a single instruction from
memory, and it is simply implied in the definition of the instruction itself that the instruction will operate again
on another item of data, at an address one increment larger than the last. This allows for significant savings in
decoding time.
To illustrate what a difference this can make, consider the simple task of adding two groups of 10 numbers
together. In a normal programming language one would write a "loop" that picked up each of the pairs of
numbers in turn, and then added them. To the CPU, this would look something like this:
; Hypothetical RISC machine
; add 10 numbers in a to 10 numbers in b, storing results in c
; assume a, b, and c are memory locations in their respective registers
move $10, count ; count := 10
loop:
load r1, a
load r2, b
add r3, r1, r2 ; r3 := r1 + r2
store r3, c
GPU
Description
4. 06/10/2020 Vector processor - Wikipedia
https://en.wikipedia.org/wiki/Vector_processor 4/6
add a, a, $4 ; move on
add b, b, $4
add c, c, $4
dec count ; decrement
jnez count, loop ; loop back if count is not yet 0
ret
But to a vector processor, this task looks considerably different:
; assume we have vector registers v1-v3 with size larger than 10
move $10, count ; count = 10
vload v1, a, count
vload v2, b, count
vadd v3, v1, v2
vstore v3, c, count
ret
There are several savings inherent in this approach. For one, only two address translations are needed.
Depending on the architecture, this can represent a significant savings by itself. Another saving is fetching and
decoding the instruction itself, which has to be done only one time instead of ten. The code itself is also
smaller, which can lead to more efficient memory use.
But more than that, a vector processor may have multiple functional units adding those numbers in parallel. The
checking of dependencies between those numbers is not required as a vector instruction specifies multiple
independent operations. This simplifies the control logic required, and can improve performance by avoiding
stalls. The math operations thus completed far faster overall, the limiting factor being the time required to fetch
the data from memory.
Not all problems can be attacked with this sort of solution. Including these types of instructions necessarily
adds complexity to the core CPU. That complexity typically makes other instructions run slower—i.e.,
whenever it is not adding up many numbers in a row. The more complex instructions also add to the
complexity of the decoders, which might slow down the decoding of the more common instructions such as
normal adding.
In fact, vector processors work best only when there are large amounts of data to be worked on. For this reason,
these sorts of CPUs were found primarily in supercomputers, as the supercomputers themselves were, in
general, found in places such as weather prediction centers and physics labs, where huge amounts of data are
"crunched".
The vector pseudocode example above comes with a big assumption that the vector computer can process more
than ten numbers in one batch. For a greater number of numbers, it becomes unfeasible for the computer to
have a register that large. As a result, the vector processor either gains the ability to perform loops itself, or
exposes some sort of vector register to the programmer.
The self-repeating instructions are found in early vector computers like the STAR, where the above action
would be described in a single instruction (somewhat like vadd c, a, b, $10). They are also found in
the x86 architecture as the REP perfix. However, only very simple calculations can be done effectively in
hardware without a very large cost increase. Since all operands has to be in-memory, the latency caused by
access becomes huge too.
The Cray-1 introduced the idea of using processor registers to hold vector data in batches. This way, a lot more
work can be done in each batch, at the cost of requiring the programmer to manually load/store data from/to the
memory for each batch. Modern SIMD computers improve on Cray by directly using multiple ALUs, for a
higher degree of parallelism compared to only using the normal scalar pipeline. Masks can be used to
selectively load or store memory locations for a version of parallel logic.
Vector instructions
5. 06/10/2020 Vector processor - Wikipedia
https://en.wikipedia.org/wiki/Vector_processor 5/6
GPUs, which have many small compute units, use a variant of SIMD called Single Instruction Multiple
Threads (SIMT). This is similar to modern SIMD, with the excption that the "vector registers" are very wide
and the pipelines tend to be long. The "threading" part affects the way data are swapped between the compute
units. In addition, GPUs and other external vector processors like the NEC SX-Aurora TSUBASA may use
fewer vector units than the width implies: instead of having 64 units for a 64-number-wide register, the
hardware might instead do a pipelined loop over 16 units for a hybrid approach.
The difference between a traditional vector processor and a modern SIMD one can be illustrated with this
variant of the "DAXPY" function:
void iaxpy(size_t n, int a, const int x[], int y[]) {
for (size_t i = 0; i < n; i++)
y[i] = a * x[i] + y[i];
}
The STAR-like code remains concise, but we now require an extra slot of memory to process the information.
Two times the latency is also needed due to the extra requirement of memory access.
; Assume tmp is pre-allocated
vmul tmp, a, x, n ; tmp[i] = a * x[i]
vadd y, y, tmp, n ; y[i] = y[i] + tmp[i]
ret
This modern SIMD machine can do most of the operation in batches. The code is mostly similar to the scalar
version. We are assuming that both x and y are properly aligned here and that n is a multiple of 4, as otherwise
some setup code would be needed to calculate a mask or to run a scalar version. The time taken would be
basically the same as a vector implementation of c = a + b described above.
vloop:
load32x4 v1, x
load32x4 v2, y
mul32x4 v1, a, v1 ; v1 := v1 * a
add32x4 v3, v1, v2 ; v3 := v1 + v2
store32x4 v3, y
addl x, x, $16 ; a := a + 16
addl y, y, $16
subl n, n, $4 ; n := n - 4
jgz n, vloop ; loop back if n > 0
out:
ret
Let r be the vector speed ratio and f be the vectorization ratio. If the time taken for the vector unit to add an
array of 64 numbers is 10 times faster than its equivalent scalar counterpart, r = 10. Also, if the total number of
operations in a program is 100, out of which only 10 are scalar (after vectorization), then f = 0.9, i.e., 90% of
the work is done by the vector unit. It follows the achievable speed up of:
So, even if the performance of the vector unit is very high ( ) we get a speedup less than ,
which suggests that the ratio f is crucial to the performance. This ratio depends on the efficiency of the
compilation like adjacency of the elements in memory.
Performance and speed up
Programming heterogeneous computing architectures
6. 06/10/2020 Vector processor - Wikipedia
https://en.wikipedia.org/wiki/Vector_processor 6/6
Various machines were designed to include both traditional processors and vector processors, such as the
Fujitsu AP1000 and AP3000. Programming such heterogeneous machines can be difficult since developing
programs that make best use of characteristics of different processors increases the programmer's burden. It
increases code complexity and decreases portability of the code by requiring hardware specific code to be
interleaved throughout application code.[2] Balancing the application workload across processors can be
problematic, especially given that they typically have different performance characteristics. There are different
conceptual models to deal with the problem, for example using a coordination language and program building
blocks (programming libraries or higher order functions). Each block can have a different native
implementation for each processor type. Users simply program using these abstractions and an intelligent
compiler chooses the best implementation based on the context.[3]
SX architecture
GPGPU
Compute kernel
Stream processing
SIMD
Automatic vectorization
Chaining (vector processing)
Computer for operations with functions
RISC-V, an open ISA standard with an associated variable width vector extension.
Barrel processor
Tensor processing unit
1. Malinovsky, B.N. (1995). The history of computer technology in their faces (in Russian). Kiew:
Firm "KIT". ISBN 5-7707-6131-8.
2. Kunzman, D. M.; Kale, L. V. (2011). "Programming Heterogeneous Systems". 2011 IEEE
International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
p. 2061. doi:10.1109/IPDPS.2011.377 (https://doi.org/10.1109%2FIPDPS.2011.377). ISBN 978-1-
61284-425-1.
3. John Darlinton; Moustafa Ghanem; Yike Guo; Hing Wing To (1996), "Guided Resource
Organisation in Heterogeneous Parallel Computing", Journal of High Performance Computing, 4
(1): 13–23, CiteSeerX 10.1.1.37.4309 (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.
37.4309)
The History of the Development of Parallel Computing (http://ei.cs.vt.edu/~history/Parallel.html)
(from 1955 to 1993)
Retrieved from "https://en.wikipedia.org/w/index.php?title=Vector_processor&oldid=980296064"
This page was last edited on 25 September 2020, at 18:11 (UTC).
Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may apply. By using this
site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia
Foundation, Inc., a non-profit organization.
See also
References
External links