Week 1 in the OpenHPI course on parallel programming concepts is about hardware and software trends that lead to the rise of parallel programming for ordinary developers.
Find the whole course at http://bit.ly/1l3uD4h.
CSS in JS - Escrevendo CSS no JavaScript - Dan VitorianoDan Vitoriano
CSS in JS é uma técnica que envolve escrever CSS dentro de JavaScript ao invés de arquivos separados. Isso permite que cada componente tenha seu próprio CSS encapsulado e escrito de forma declarativa. Bibliotecas populares como JSS, Glamor e Styled-Components facilitam a implementação de CSS in JS.
Benefits of Multi-rail Cluster Architectures for GPU-based Nodesinside-BigData.com
Craig Tierney from NVIDIA presented this deck at the MVAPICH User Group meeting.
"As high performance computing moves toward GPU-accelerated architectures, single node application performance can be between 3x and 75x faster than the CPUs alone. Performance increases of this size will require increases in network bandwidth and message rate to prevent the network from becoming the bottleneck in scalability. In this talk, we will present results from NVLink enabled systems connected via quad-rail EDR Infiniband."
Watch the video: https://wp.me/p3RLHQ-hkr
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses minimizing crosstalk in VLSI routing. It begins with an overview of routing and discusses global routing versus detailed routing. It then covers crosstalk effects, including inductive and capacitive coupling between wires. Approaches to avoid crosstalk include segregating wires, increasing spacing, assigning wires to different layers, and estimating and minimizing crosstalk during routing. Techniques for detailed routing include net ordering, layer assignment, and rip-up and reroute to meet crosstalk constraints.
RA TechED 2019 - CL02 - Integrated Architecture System Software What's NewRockwell Automation
The document provides an overview of new and updated system software from Rockwell Automation, including Studio 5000 Logix Designer, View Designer, Application Code Manager, Logix Emulate, and connectivity to CAD software packages. Key updates include expanded controller support, enhanced Logix tag-based alarming, productivity improvements, and digital engineering tools to enable virtual commissioning and operator training.
This document discusses packet capture, traceflows, and live logs for troubleshooting NSX. It provides examples of commands to capture packets on distributed NSX firewalls at different points including the switchport, uplink, and output. It also discusses using traceflow to trace packet flows and viewing live logs to see firewall packet logs.
Learn how Aerospike's Hybrid Memory Architecture brings transactions and analytics together to power real-time Systems of Engagement ( SOEs) for companies across AdTech, financial services, telecommunications, and eCommerce. We take a deep dive into the architecture including use cases, topology, Smart Clients, XDR and more. Aerospike delivers predictable performance, high uptime and availability at the lowest total cost of ownership (TCO).
This document discusses the architectures and applications of CPLDs and FPGAs. It begins by classifying programmable logic devices and describing simple programmable logic devices like PLDs, PALs, and GALs. It then discusses more complex programmable logic devices like CPLDs, describing their architecture which consists of logic blocks, I/O blocks, and a global interconnect. Finally, it covers field programmable gate arrays including their architecture of configurable logic blocks, I/O blocks, and a programmable interconnect, as well as describing Xilinx's logic cell array architecture for FPGAs.
CSS in JS - Escrevendo CSS no JavaScript - Dan VitorianoDan Vitoriano
CSS in JS é uma técnica que envolve escrever CSS dentro de JavaScript ao invés de arquivos separados. Isso permite que cada componente tenha seu próprio CSS encapsulado e escrito de forma declarativa. Bibliotecas populares como JSS, Glamor e Styled-Components facilitam a implementação de CSS in JS.
Benefits of Multi-rail Cluster Architectures for GPU-based Nodesinside-BigData.com
Craig Tierney from NVIDIA presented this deck at the MVAPICH User Group meeting.
"As high performance computing moves toward GPU-accelerated architectures, single node application performance can be between 3x and 75x faster than the CPUs alone. Performance increases of this size will require increases in network bandwidth and message rate to prevent the network from becoming the bottleneck in scalability. In this talk, we will present results from NVLink enabled systems connected via quad-rail EDR Infiniband."
Watch the video: https://wp.me/p3RLHQ-hkr
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The document discusses minimizing crosstalk in VLSI routing. It begins with an overview of routing and discusses global routing versus detailed routing. It then covers crosstalk effects, including inductive and capacitive coupling between wires. Approaches to avoid crosstalk include segregating wires, increasing spacing, assigning wires to different layers, and estimating and minimizing crosstalk during routing. Techniques for detailed routing include net ordering, layer assignment, and rip-up and reroute to meet crosstalk constraints.
RA TechED 2019 - CL02 - Integrated Architecture System Software What's NewRockwell Automation
The document provides an overview of new and updated system software from Rockwell Automation, including Studio 5000 Logix Designer, View Designer, Application Code Manager, Logix Emulate, and connectivity to CAD software packages. Key updates include expanded controller support, enhanced Logix tag-based alarming, productivity improvements, and digital engineering tools to enable virtual commissioning and operator training.
This document discusses packet capture, traceflows, and live logs for troubleshooting NSX. It provides examples of commands to capture packets on distributed NSX firewalls at different points including the switchport, uplink, and output. It also discusses using traceflow to trace packet flows and viewing live logs to see firewall packet logs.
Learn how Aerospike's Hybrid Memory Architecture brings transactions and analytics together to power real-time Systems of Engagement ( SOEs) for companies across AdTech, financial services, telecommunications, and eCommerce. We take a deep dive into the architecture including use cases, topology, Smart Clients, XDR and more. Aerospike delivers predictable performance, high uptime and availability at the lowest total cost of ownership (TCO).
This document discusses the architectures and applications of CPLDs and FPGAs. It begins by classifying programmable logic devices and describing simple programmable logic devices like PLDs, PALs, and GALs. It then discusses more complex programmable logic devices like CPLDs, describing their architecture which consists of logic blocks, I/O blocks, and a global interconnect. Finally, it covers field programmable gate arrays including their architecture of configurable logic blocks, I/O blocks, and a programmable interconnect, as well as describing Xilinx's logic cell array architecture for FPGAs.
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...Numenta
Nick Ni (Xilinx) and Lawrence Spracklen (Numenta) presented a talk at the FGPA Conference Europe on July 8th, 2021. In this talk, they presented a neuroscience approach to optimize state-of-the-art deep learning networks into sparse topology and how it can unlock significant performance gains on FPGAs without major loss of accuracy. They then walked through the FPGA implementation where they exploited the advantage of sparse networks with a unique Domain Specific Architecture (DSA).
The document describes a NOR gate. A NOR gate is a logic gate that produces an output of 0 (low) only if all its inputs are 1 (high). If any input is 0, the output is 1 (high). The document lists two graphs that would characterize a NOR gate: voltage over time and voltage over voltage.
Signal Integrity - A Crash Course [R Lott]Ryan Lott
This document provides an introduction to signal integrity for interconnects. It discusses typical interconnects like PCB traces, cables, and connectors and the signal integrity problems they can cause, such as loss, reflections, crosstalk, and ringing. It also introduces concepts like characteristic impedance, frequency-dependent loss, and how signals propagate as electromagnetic waves. Measurement techniques like S-parameters and using a vector network analyzer are discussed as ways to characterize devices in the frequency domain.
This document provides a reference architecture for implementing a Virtual SAN Ready Node environment using Dell hardware and VMware software. It describes the physical and logical architecture, including networking, storage, and server node components. Specific hardware models are recommended, such as Dell R730 servers and Dell networking switches. The architecture supports VMware Horizon, including hybrid deployments with Horizon Air.
Cloud Computing & Its Impact on Project ManagementVSR *
This document discusses the potential of cloud computing and project management in India. It notes that India is expected to see $1 trillion in infrastructure investments over the next decade. However, many projects currently fail due to issues like inadequate planning, scope creep, and lack of real-time information. Cloud computing could help by providing reliable computing power and project management applications on an on-demand, pay-per-use basis without large upfront capital costs. This could minimize technology divides and allow for improved infrastructure delivery, business innovation, and learning opportunities. However, security, availability, performance, and payment models must still be addressed carefully with cloud computing.
CKAN is an open-source data management solution for open data. It provides a platform for publishing and exposing metadata through an API and front-end interface. Major governments and communities use CKAN to organize large numbers of datasets. While it has advantages like organizing data in a structured way and providing APIs, its data model does not work for all use cases and there are no strict guidelines for dataset publishing. Extensions allow additional functionality and it can be deployed in various ways.
The document discusses Timothy Spann, a Senior Solutions Engineer at Cloudera who has been running Big Data meetups in Princeton since 2015. It provides links to his profiles on various websites and details some of the topics he has spoken about at conferences, including Apache NiFi, Deep Learning, and Streaming. The rest of the document focuses on use cases for data integration and movement using Apache NiFi, as well as concepts like blockchain, distributed data stores, and accessing blockchain and Ethereum data.
The document discusses the benefits of using tape storage for backup and archiving large amounts of data. Tape provides low cost, high capacity storage when compared to disk and flash alternatives. Features such as air gaps between live systems and offline tape backups provide strong protection against ransomware and other cyber threats. With continued improvements in areal density, a single tape cartridge can now hold over 200 terabytes of data, growing cheaper and more scalable over time. Tape remains a critical technology for cost-effectively storing the massive amounts of cold and archived data being generated.
SimfiaNeo - Workbench for Safety Analysis powered by SiriusObeo
The document discusses the dependability engineering tool SimfiaNeo. It defines dependability as a system property where users have justified confidence in the system's services. Dependability engineering involves predicting failures, assessing risks, and mitigating consequences. SimfiaNeo allows modeling systems using the AltaRica language, validating models for consistency, simulating models, generating cuts of failure scenarios, and producing documentation reports. Existing components from Simfia like the AltaRica engine are reused in SimfiaNeo.
This document provides an overview of an "Analog VLSI Design" course. The goals of the course are to introduce principles of analog integrated circuit design and CMOS technology. Students will learn about CMOS layout design using CAD tools and complete a design project. The course covers topics including CMOS technology, resistors, capacitors, MOSFETs, current mirrors, amplifiers, and data converters. Assessment includes homework, a project, and a final exam.
Travis Cox from Inductive Automation and Arlen Nipper from Cirrus Link Solutions discusses the various ways that tag data can be leveraged through cloud services provided by Amazon Web Services and Microsoft Azure. These experts will also show you different ways to get data up to the cloud in a simple, efficient, and secure manner.
Learn more about cloud services such as:
- Machine learning
- Analytics
- Business intelligence
- Data lakes
- Cloud databases
- And more
Lessons Learned Running Hadoop and Spark in Docker ContainersBlueData, Inc.
Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale production environments poses interesting challenges, especially when deploying distributed big data applications like Apache Hadoop and Apache Spark. This session at Strata + Hadoop World in New York City (September 2016) explores various solutions and tips to address the challenges encountered while deploying multi-node Hadoop and Spark production workloads using Docker containers.
Some of these challenges include container life-cycle management, smart scheduling for optimal resource utilization, network configuration and security, and performance. BlueData is "all in” on Docker containers—with a specific focus on big data applications. BlueData has learned firsthand how to address these challenges for Fortune 500 enterprises and government organizations that want to deploy big data workloads using Docker.
This session by Thomas Phelan, co-founder and chief architect at BlueData, discusses how to securely network Docker containers across multiple hosts and discusses ways to achieve high availability across distributed big data applications and hosts in your data center. Since we’re talking about very large volumes of data, performance is a key factor, so Thomas shares some of the storage options implemented at BlueData to achieve near bare-metal I/O performance for Hadoop and Spark using Docker as well as lessons learned and some tips and tricks on how to Dockerize your big data applications in a reliable, scalable, and high-performance environment.
http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52042
Hardware Acceleration for Machine LearningCastLabKAIST
This document provides an overview of a lecture on hardware acceleration for machine learning. The lecture will cover deep neural network models like convolutional neural networks and recurrent neural networks. It will also discuss various hardware accelerators developed for machine learning, including those designed for mobile/edge and cloud computing environments. The instructor's background and the agenda topics are also outlined.
VMworld 2017 - Top 10 things to know about vSANDuncan Epping
In this session Cormac Hogan and I go over the top 10 things to know about vSAN. This is based on two years of questions/answers from our field and customers. Useful for any VMware vSAN customer!
#STO1264BU #STO1264BE
This document discusses built-in self-test (BIST) techniques for integrated circuits. It provides an overview of BIST architecture, which includes a test pattern generator, test application to the circuit under test, and a response verification component. The document outlines different methods for test pattern generation, such as exhaustive, pseudo-exhaustive, pseudo-random, and test pattern augmentation. It also describes various response compaction techniques like parity testing, one counting, transition counting, and signature analysis that are used to compact the circuit response due to the large amount of test data produced. Benefits of BIST include reduced testing costs and ability to test at operating speeds, while costs include increased chip area and testing of the BIST hardware
ATPG technology can be used for applications beyond just generating tests. It can be used for design verification and optimization during the design process. Specifically, ATPG can be used for delay fault testing, noise fault testing, logic optimization, design verification through equivalence checking, property checking, and timing analysis. It generates test vectors to detect timing defects, noise issues like power supply and crosstalk faults, and can help identify redundant logic for optimization.
Introduction to CMOS VLSI Design:
This Presentations is design in way to provide basic summary of CMOS Vlsi design
This Presentation is Made at Eutectics.blogspot.in
the following is the structure of presentation :
2: Outline
3: Introduction
4: MOS capacitor
5: Terminal Voltage
6: nMOS Cutoff
7: nMOS Linear
8: nMOS Saturation
9: I-V Characteristics
10 : Channel Charge
11: Carrier velocity
12: nMOS Linear I-V
13: nMOS Saturation
14: nMOS I-V Summary
15: Example
16: pMOS I-V
17: Capacitance
18: Gate Capacitance
19: Diffusion Capacitane
20: Pass Transistor
21: Pass transistor ckts
22: Effective Resistance
23: RC Delay Model
24: RC values
25: Inverter Delay Estimate
This document discusses Xen cache coloring and real-time performance in embedded systems. It introduces cache interference between virtual machines and the hypervisor solution of cache partitioning via cache coloring. Each VM is allocated its own portion of cache entries to prevent interference. Benchmark results show that with cache coloring, a motor control application execution time and interrupt response time remain stable even under high interference, whereas without coloring performance degrades significantly. Cache coloring effectively isolates workloads and enables mixed criticality systems on the same device.
The document discusses VLSI design methodologies and limitations using CAD tools. It provides an overview of different VLSI design methodologies such as full custom design, semi-custom design, gate array design, standard cell design, FPGA-based design and CPLD-based design. It also discusses the evolution of VLSI design flows from past to present technologies. Furthermore, it describes the complexities in VLSI design and how CAD tools help manage these complexities and automate the design process. Finally, it summarizes different types of VLSI CAD tools and compares various open source and licensed CAD tool vendors.
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD
This document discusses AMD's chiplet architecture for high-performance server and desktop processors. Key points include:
- AMD partitions the system-on-a-chip design, using 7nm technology for CPU cores while leaving I/O interfaces in older process nodes. This improves performance and lowers costs.
- CPU dies ("chiplets") are connected using high-speed SerDes links both on-package and between dies. This allows for more chiplets and cores than traditional monolithic designs.
- Innovations in packaging, power distribution, and operating system scheduling were required to enable the multi-chiplet design and improve performance.
Week 2 in the OpenHPI course on parallel programming concepts is about foundational aspects of concurrency.
Find the whole course at http://bit.ly/1l3uD4h.
Week 5 in the OpenHPI course on parallel programming concepts is about parallel applications in distributed systems.
Find the whole course at http://bit.ly/1l3uD4h.
FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...Numenta
Nick Ni (Xilinx) and Lawrence Spracklen (Numenta) presented a talk at the FGPA Conference Europe on July 8th, 2021. In this talk, they presented a neuroscience approach to optimize state-of-the-art deep learning networks into sparse topology and how it can unlock significant performance gains on FPGAs without major loss of accuracy. They then walked through the FPGA implementation where they exploited the advantage of sparse networks with a unique Domain Specific Architecture (DSA).
The document describes a NOR gate. A NOR gate is a logic gate that produces an output of 0 (low) only if all its inputs are 1 (high). If any input is 0, the output is 1 (high). The document lists two graphs that would characterize a NOR gate: voltage over time and voltage over voltage.
Signal Integrity - A Crash Course [R Lott]Ryan Lott
This document provides an introduction to signal integrity for interconnects. It discusses typical interconnects like PCB traces, cables, and connectors and the signal integrity problems they can cause, such as loss, reflections, crosstalk, and ringing. It also introduces concepts like characteristic impedance, frequency-dependent loss, and how signals propagate as electromagnetic waves. Measurement techniques like S-parameters and using a vector network analyzer are discussed as ways to characterize devices in the frequency domain.
This document provides a reference architecture for implementing a Virtual SAN Ready Node environment using Dell hardware and VMware software. It describes the physical and logical architecture, including networking, storage, and server node components. Specific hardware models are recommended, such as Dell R730 servers and Dell networking switches. The architecture supports VMware Horizon, including hybrid deployments with Horizon Air.
Cloud Computing & Its Impact on Project ManagementVSR *
This document discusses the potential of cloud computing and project management in India. It notes that India is expected to see $1 trillion in infrastructure investments over the next decade. However, many projects currently fail due to issues like inadequate planning, scope creep, and lack of real-time information. Cloud computing could help by providing reliable computing power and project management applications on an on-demand, pay-per-use basis without large upfront capital costs. This could minimize technology divides and allow for improved infrastructure delivery, business innovation, and learning opportunities. However, security, availability, performance, and payment models must still be addressed carefully with cloud computing.
CKAN is an open-source data management solution for open data. It provides a platform for publishing and exposing metadata through an API and front-end interface. Major governments and communities use CKAN to organize large numbers of datasets. While it has advantages like organizing data in a structured way and providing APIs, its data model does not work for all use cases and there are no strict guidelines for dataset publishing. Extensions allow additional functionality and it can be deployed in various ways.
The document discusses Timothy Spann, a Senior Solutions Engineer at Cloudera who has been running Big Data meetups in Princeton since 2015. It provides links to his profiles on various websites and details some of the topics he has spoken about at conferences, including Apache NiFi, Deep Learning, and Streaming. The rest of the document focuses on use cases for data integration and movement using Apache NiFi, as well as concepts like blockchain, distributed data stores, and accessing blockchain and Ethereum data.
The document discusses the benefits of using tape storage for backup and archiving large amounts of data. Tape provides low cost, high capacity storage when compared to disk and flash alternatives. Features such as air gaps between live systems and offline tape backups provide strong protection against ransomware and other cyber threats. With continued improvements in areal density, a single tape cartridge can now hold over 200 terabytes of data, growing cheaper and more scalable over time. Tape remains a critical technology for cost-effectively storing the massive amounts of cold and archived data being generated.
SimfiaNeo - Workbench for Safety Analysis powered by SiriusObeo
The document discusses the dependability engineering tool SimfiaNeo. It defines dependability as a system property where users have justified confidence in the system's services. Dependability engineering involves predicting failures, assessing risks, and mitigating consequences. SimfiaNeo allows modeling systems using the AltaRica language, validating models for consistency, simulating models, generating cuts of failure scenarios, and producing documentation reports. Existing components from Simfia like the AltaRica engine are reused in SimfiaNeo.
This document provides an overview of an "Analog VLSI Design" course. The goals of the course are to introduce principles of analog integrated circuit design and CMOS technology. Students will learn about CMOS layout design using CAD tools and complete a design project. The course covers topics including CMOS technology, resistors, capacitors, MOSFETs, current mirrors, amplifiers, and data converters. Assessment includes homework, a project, and a final exam.
Travis Cox from Inductive Automation and Arlen Nipper from Cirrus Link Solutions discusses the various ways that tag data can be leveraged through cloud services provided by Amazon Web Services and Microsoft Azure. These experts will also show you different ways to get data up to the cloud in a simple, efficient, and secure manner.
Learn more about cloud services such as:
- Machine learning
- Analytics
- Business intelligence
- Data lakes
- Cloud databases
- And more
Lessons Learned Running Hadoop and Spark in Docker ContainersBlueData, Inc.
Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale production environments poses interesting challenges, especially when deploying distributed big data applications like Apache Hadoop and Apache Spark. This session at Strata + Hadoop World in New York City (September 2016) explores various solutions and tips to address the challenges encountered while deploying multi-node Hadoop and Spark production workloads using Docker containers.
Some of these challenges include container life-cycle management, smart scheduling for optimal resource utilization, network configuration and security, and performance. BlueData is "all in” on Docker containers—with a specific focus on big data applications. BlueData has learned firsthand how to address these challenges for Fortune 500 enterprises and government organizations that want to deploy big data workloads using Docker.
This session by Thomas Phelan, co-founder and chief architect at BlueData, discusses how to securely network Docker containers across multiple hosts and discusses ways to achieve high availability across distributed big data applications and hosts in your data center. Since we’re talking about very large volumes of data, performance is a key factor, so Thomas shares some of the storage options implemented at BlueData to achieve near bare-metal I/O performance for Hadoop and Spark using Docker as well as lessons learned and some tips and tricks on how to Dockerize your big data applications in a reliable, scalable, and high-performance environment.
http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52042
Hardware Acceleration for Machine LearningCastLabKAIST
This document provides an overview of a lecture on hardware acceleration for machine learning. The lecture will cover deep neural network models like convolutional neural networks and recurrent neural networks. It will also discuss various hardware accelerators developed for machine learning, including those designed for mobile/edge and cloud computing environments. The instructor's background and the agenda topics are also outlined.
VMworld 2017 - Top 10 things to know about vSANDuncan Epping
In this session Cormac Hogan and I go over the top 10 things to know about vSAN. This is based on two years of questions/answers from our field and customers. Useful for any VMware vSAN customer!
#STO1264BU #STO1264BE
This document discusses built-in self-test (BIST) techniques for integrated circuits. It provides an overview of BIST architecture, which includes a test pattern generator, test application to the circuit under test, and a response verification component. The document outlines different methods for test pattern generation, such as exhaustive, pseudo-exhaustive, pseudo-random, and test pattern augmentation. It also describes various response compaction techniques like parity testing, one counting, transition counting, and signature analysis that are used to compact the circuit response due to the large amount of test data produced. Benefits of BIST include reduced testing costs and ability to test at operating speeds, while costs include increased chip area and testing of the BIST hardware
ATPG technology can be used for applications beyond just generating tests. It can be used for design verification and optimization during the design process. Specifically, ATPG can be used for delay fault testing, noise fault testing, logic optimization, design verification through equivalence checking, property checking, and timing analysis. It generates test vectors to detect timing defects, noise issues like power supply and crosstalk faults, and can help identify redundant logic for optimization.
Introduction to CMOS VLSI Design:
This Presentations is design in way to provide basic summary of CMOS Vlsi design
This Presentation is Made at Eutectics.blogspot.in
the following is the structure of presentation :
2: Outline
3: Introduction
4: MOS capacitor
5: Terminal Voltage
6: nMOS Cutoff
7: nMOS Linear
8: nMOS Saturation
9: I-V Characteristics
10 : Channel Charge
11: Carrier velocity
12: nMOS Linear I-V
13: nMOS Saturation
14: nMOS I-V Summary
15: Example
16: pMOS I-V
17: Capacitance
18: Gate Capacitance
19: Diffusion Capacitane
20: Pass Transistor
21: Pass transistor ckts
22: Effective Resistance
23: RC Delay Model
24: RC values
25: Inverter Delay Estimate
This document discusses Xen cache coloring and real-time performance in embedded systems. It introduces cache interference between virtual machines and the hypervisor solution of cache partitioning via cache coloring. Each VM is allocated its own portion of cache entries to prevent interference. Benchmark results show that with cache coloring, a motor control application execution time and interrupt response time remain stable even under high interference, whereas without coloring performance degrades significantly. Cache coloring effectively isolates workloads and enables mixed criticality systems on the same device.
The document discusses VLSI design methodologies and limitations using CAD tools. It provides an overview of different VLSI design methodologies such as full custom design, semi-custom design, gate array design, standard cell design, FPGA-based design and CPLD-based design. It also discusses the evolution of VLSI design flows from past to present technologies. Furthermore, it describes the complexities in VLSI design and how CAD tools help manage these complexities and automate the design process. Finally, it summarizes different types of VLSI CAD tools and compares various open source and licensed CAD tool vendors.
AMD Chiplet Architecture for High-Performance Server and Desktop ProductsAMD
This document discusses AMD's chiplet architecture for high-performance server and desktop processors. Key points include:
- AMD partitions the system-on-a-chip design, using 7nm technology for CPU cores while leaving I/O interfaces in older process nodes. This improves performance and lowers costs.
- CPU dies ("chiplets") are connected using high-speed SerDes links both on-package and between dies. This allows for more chiplets and cores than traditional monolithic designs.
- Innovations in packaging, power distribution, and operating system scheduling were required to enable the multi-chiplet design and improve performance.
Week 2 in the OpenHPI course on parallel programming concepts is about foundational aspects of concurrency.
Find the whole course at http://bit.ly/1l3uD4h.
Week 5 in the OpenHPI course on parallel programming concepts is about parallel applications in distributed systems.
Find the whole course at http://bit.ly/1l3uD4h.
The document provides an introduction to data science at scale and distributed thinking. It discusses the motivation for data science at scale due to increasing data volumes, varieties, and velocities. It distinguishes between data science, which focuses on accuracy, and data engineering, which focuses on scale, performance, and reliability. The document then provides a crash course on data engineering concepts like distributed computation and the SMACK stack. It introduces Spark as a framework that can scale data processing. Finally, it discusses probabilistic algorithms as an approach for processing large datasets that may be inexact but use less resources than exact algorithms.
The document provides an overview of accelerator technology and OpenCL. It discusses how accelerators like GPUs use SIMD parallelism to speed up computations by processing multiple data items in parallel. GPUs have thousands of lightweight threads that hide memory latency. OpenCL provides a standardized programming model to access the parallel capabilities of CPUs, GPUs, and other accelerators. It executes kernels across a problem domain for data-parallel applications.
This document provides an agenda and overview for a hands-on introduction to multi-threaded programming and Pthreads. The tutorial will cover fundamental concepts of concurrency and multi-threading, and illustrate these concepts through simple C programs that utilize Pthreads. Attendees will learn about thread creation, synchronization methods like mutexes and barriers, and how to compile and run basic Pthreads applications.
Intro to open source observability with grafana, prometheus, loki, and tempo(...LibbySchulze
This document provides an introduction to open source observability tools including Grafana, Prometheus, Loki, and Tempo. It summarizes each tool and how they work together. Prometheus is introduced as a time series database that collects metrics. Loki is described as a log aggregation system that handles logs at scale without high costs. Tempo is explained as a tracing system that allows tracing from logs, metrics, and between services. The document emphasizes that these tools can be run together to gain observability across an entire system from logs to metrics to traces.
CK: from ad hoc computer engineering to collaborative and reproducible data s...Grigori Fursin
Designing novel computer systems and optimizing their software is becoming too tedious, ad hoc, time consuming and error prone due to enormous number of available design and optimization choices. Empirical autotuning combined with run-time adaptation and machine learning has been demonstrating some potential to address above challenges for several decades but is still far from the widespread production. The main reasons include unbearably long exploration and training times, ever changing tools and their interfaces, lack of a common experimental methodology, lack of diverse and representative benchmarks, and lack of unified mechanisms for knowledge building and exchange apart from publications where reproducibility and reusability of results is often not even considered.
I will present our community-driven solution to above problems based on our open-source Collective Knowledge technology (CK) that can gradually organize, exchange and reuse knowledge and experience in computer engineering. CK helps share various artifacts (benchmarks, data sets, libraries, tools) as unified, reusable and Python-based components with JSON meta description via GITHUB. Researchers can then quickly prototype and crowdsource various experimental workflows such as performance and energy autotuning, design space exploration and run-time adaptation. At the same time, CK continuously analyzes and extrapolates all collected knowledge using powerful data science techniques to automatically model computer systems' behavior, predict better optimizations or hardware configurations, and eventually enable faster, more power efficient, reliable and self-tuning software and hardware. Furthermore, CK can record any unexpected behavior in a reproducible way and expose it to an interdisciplinary community to find missing features and improve models. Live demo of our approach is available at http://cknowledge.org/repo .
This document provides an overview of computer science as an academic discipline. It discusses how computer science involves rigorous problem solving and follows the scientific method, though it may not be considered a science in the same way as biology or chemistry. The document outlines three main themes of computer science: hardware, software, and theory. It also describes several subfields of computer science like algorithms and data structures, architecture, operating systems and networks, software engineering, and artificial intelligence.
The document discusses computer architecture and its objectives. It aims to help students understand the basic structure and operation of computers. Some key topics covered include the functional units of a computer system, arithmetic operations, the processor and control unit, parallelism, memory systems, and input/output devices. The document outlines 5 units that will be covered, including basic computer structure, arithmetic, the processor, parallelism, and memory and I/O systems.
Data Science in Production: Technologies That Drive Adoption of Data Science ...Nir Yungster
Critical to a data science team’s ability to drive impact is its effectiveness in incorporating its solutions into new or existing products. When collaborating with other engineering teams, and especially when solutions must operate at scale, technological choices can be critical factors in determining what type of outcome you'll have. We walk through strategies and specific technologies - Airflow, Docker, Kubernetes - that can help promote successful collaboration between data science and engineering.
OpenCL & the Future of Desktop High Performance Computing in CADDesign World
Modern desktop computers have more compute capabilities than ever before. Most of these systems include both a central processing unit (CPU) and a graphics processing unit (GPU), each consisting of multiple computing cores providing tremendous processing power. To date, harnessing the total processing power of a desktop workstation, fully utilizing both the CPU and GPU, has proven difficult for software developers. CPUs and GPUs have few similarities in both design and programming models. OpenCL is the tool that bridges the gap for software developers and enables them to fully tap into the power of both processors with a single software programming interface.
This presentation will examine the details of CPUs and GPUs, explore their differences and similarities, and highlight the computing power they can provide. We will also take a look OpenCL, what it is, what it does, and how this new computing interface will change the way software developers create software and help end users fully realize the compute power contained within today’s modern desktop computers.
The document provides information about available HPC resources at CSUC. It summarizes the hardware facilities which include the Canigó and Pirineus II clusters totaling 3,888 cores and 391 TFlops of computing power. It describes the working environment including the Slurm workload manager, storage units, and development tools available. It also outlines how users can access the services through RES projects, pricing, and the EuroCC Spain testbed for national HPC competence.
This document discusses instrumentation and analysis of the NAS Parallel Benchmarks (NPB) application using the Extrae tracing library. It summarizes the tests performed on local and remote machines using 2, 4, 8, 16, and 32 processes. Key metrics like computation time, communication time, load imbalance, and bottlenecks are measured. The analysis shows the NPB application scales well on the remote server but not the local laptop beyond 16 processes due to increased communication and wait times.
Introduction to various data science. From the very beginning of data science idea, to latest designs, changing trends, technologies what make then to the application that are already in real world use as we of now.
The document discusses parallel algorithms and their analysis. It introduces a simple parallel algorithm for adding n numbers using log n steps. Parallel algorithms are analyzed based on their time complexity, processor complexity, and work complexity. For adding n numbers in parallel, the time complexity is O(log n), processor complexity is O(n), and work complexity is O(n log n). The document also discusses models of parallel computation like PRAM and designs of parallel architectures like meshes and hypercubes.
This document provides an introduction to data science. It discusses why data science is important and covers key techniques like statistics, data mining, and visualization. It also reviews popular tools and platforms for data science like R, Hadoop, and real-time systems. Finally, it discusses how data science can be applied across different business domains such as financial services, telecom, retail, and healthcare.
This document discusses parallel computing. It provides examples of tasks that can be solved faster through parallel processing by dividing the work among multiple processors. The key benefits of parallel computing are speeding up tasks and solving problems too large for a single processor. It also discusses limits of parallel computing such as load balancing and Amdahl's law, which places theoretical limits on speedup from additional processors.
Dr. Peter Tröger discusses cloud standards and virtualization. He outlines three basic cloud service models and describes challenges with cloud dependability from the customer and provider perspectives. The document also reviews several standards organizations and specifications relevant to cloud computing, including OVF, OCCI, CDMI, and the Cloud Security Alliance. It emphasizes that while infrastructure standards are maturing, more work remains regarding platforms, software, data models, and billing standards.
Distributed Resource Management Application API (DRMAA) Version 2Peter Tröger
The document describes the Distributed Resource Management Application API (DRMAA) version 2. It provides an overview of DRMAA and its goals of providing a standardized API for distributed resource management systems. It discusses the key components involved in DRMAA including distributed resource management systems, DRMAA implementations/libraries, submission hosts, and execution hosts. It also summarizes the success of DRMAA version 1 and outlines the status and design approach of the new DRMAA version 2 specification.
Design of Software for Embedded SystemsPeter Tröger
This document provides an overview of the Design of Software for Embedded Systems (SWES) course. It discusses the course organization, project requirements, and introduces some basic concepts and terminology related to embedded systems and real-time software. Specifically, it describes the challenges in embedded system design, different types of hardware platforms, characteristics of embedded software, issues related to timeliness and real-time scheduling, and how real-time operating systems address these issues. The document aims to equip students with foundational knowledge on embedded systems and real-time systems engineering.
Human users should not be forced to edit XML documents. Sometimes, they may want to read it.
The presentation persists some arguments I stated about this topic again and again in the past. Discussions and opinions are more than welcome.
What activates a bug? A refinement of the Laprie terminology model.Peter Tröger
The document proposes refinements to the Laprie terminology model for describing software bugs. It introduces concepts of a fault model describing faulty code, a fault condition model describing enabling system states, and an error model describing states where faults are activated and may lead to failures. A failure automaton is presented with states for disabled, dormant, and active faults, as well as detected errors and outages. Events are defined for when fault conditions are fulfilled or no longer fulfilled, faulty code is executed, and failures occur. The refinement aims to separately consider investigated software layers and their environment in order to better describe what activates bugs.
This document provides an overview of dependability and dependable systems. It defines dependability as an umbrella term that includes reliability, availability, maintainability, and other attributes that allow systems to be trusted. Dependability addresses how systems can continue operating correctly even when faults occur. Key topics covered include fault tolerance techniques, error processing, failure modes, and modeling approaches for analyzing dependability. The goal of the course is to understand how to design systems that can be relied upon to deliver their services as specified, even in the presence of faults or unexpected events.
Dependable Systems - Hardware Dependability with Redundancy (14/16)Peter Tröger
1) The document discusses hardware dependability through the use of redundancy. It provides examples of static redundancy like voting and N-modular redundancy as well as dynamic redundancy using techniques like back-up sparing and duplex systems.
2) IBM's zSeries mainframe computers are highlighted as an example of a highly redundant system, using techniques like machine check handling, error correction codes, unit deletion for degradation, and fully redundant I/O subsystems.
3) Redundancy comes at a cost but can effectively improve reliability through techniques that either mask faults or allow systems to reconfigure around faults. The level of redundancy must be weighed against associated costs and design complexity.
Dependable Systems -Fault Tolerance Patterns (4/16)Peter Tröger
The document discusses various patterns for achieving fault tolerance in dependable systems. It covers architectural patterns like units of mitigation and error containment barriers. It also discusses detection patterns such as fault correlation, system monitoring, acknowledgments, voting, and audits. Finally, it discusses error recovery patterns like quarantine, concentrated recovery, and checkpointing to avoid data loss during recovery. The patterns provide reusable solutions for commonly occurring problems in building fault tolerant systems.
Dependable Systems -Dependability Means (3/16)Peter Tröger
This document provides an overview of dependability and dependable systems. It defines dependability as the trustworthiness of a system such that reliance can be placed on the service it delivers. The key aspects of dependability discussed include fault prevention, fault tolerance, fault removal, and fault forecasting. Fault tolerance techniques aim to provide service even in the presence of faults through methods like redundancy, error detection, error processing through recovery, and fault treatment. Dependable system design involves assessing risks, adding redundancy, and designing error detection and recovery capabilities.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
2. Course Content
■ Overview of theoretical and practical concepts
■ This course is for you if …
□ … you have skills in software development,
regardless of the programming language.
□ … you want to get an overview of parallelization concepts.
□ … you want to assess the feasibility of parallel hardware,
software and libraries for your parallelization problem.
■ This course is not for you if …
□ … you have no practical experience with software
development at all.
□ … you want a solution for a specific parallelization problem.
□ … you want to learn one specific parallel programming tool
or language in detail.
2
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
4. Course Organization
■ Six lecture weeks, final exam in week 7
■ Several lecture units per week, per unit:
□ Video, slides, non-graded self-test
□ Sometimes mandatory and optional readings
□ Sometimes optional programming tasks
□ Week finished with a graded assignment
■ Six graded assignments sum up to max. 90 points
■ Graded final exam with max. 90 points
■ OpenHPI certificate awarded for getting ≥90 points in total
■ Forum can be used to discuss with other participants
■ FAQ is constantly updated
4
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
7. Computer Markets
■ Embedded and Mobile Computing
□ Cars, smartphones, entertainment industry, medical devices, …
□ Power/performance and price as relevant issues
■ Desktop Computing
□ Price/performance ratio and extensibility as relevant issues
■ Server Computing
□ Business service provisioning as typical goal
□ Web servers, banking back-end, order processing, ...
□ Performance and availability as relevant issues
■ Most software benefits from having better performance
■ The computer hardware industry is constantly delivering
7
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
9. Three Ways Of Doing Anything Faster
[Pfister]
■ Work harder
(clock speed)
□ Hardware solution
□ No longer feasible
■ Work smarter
(optimization, caching)
□ Hardware solution
□ No longer feasible
as only solution
■ Get help
(parallelization)
□ Hardware + Software
in cooperation
Application
Instructions
t
9
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
10. Parallel Programming Concepts
OpenHPI Course
Week 1 : Terminology and fundamental concepts
Unit 1.2: Moore’s Law and the Power Wall
Dr. Peter Tröger + Teaching Team
11. Processor Hardware
■ First computers had fixed programs (e.g. electronic calculator)
■ Von Neumann architecture (1945)
□ Instructions for central processing unit (CPU) in memory
□ Program is treated as data
□ Loading of code during runtime, self-modification
■ Multiple such processors: Symmetric multiprocessing (SMP)
CPU
Memory
Control Unit
Arithmetic Logic UnitInput
Output
Bus
11
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
12. Moore’s Law
■ “...the number of transistors that can be inexpensively placed on
an integrated circuit is increasing exponentially, doubling
approximately every two years. ...” (Gordon Moore, 1965)
□ CPUs contain different hardware parts, such as logic gates
□ Parts are built from transistors
□ Rule of exponential growth for the number
of transistors on one CPU chip
□ Meanwhile a self-fulfilling prophecy
□ Applied not only in processor industry,
but also in other areas
□ Sometimes misinterpreted as
performance indication
□ May still hold for the next 10-20 years
[Wikipedia]
12
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
14. Moore’s Law vs. Software
■ Nathan P. Myhrvold, “The Next Fifty Years of Software”, 1997
□ “Software is a gas. It expands to fit the container it is in.”
◊ Constant increase in the amount of code
□ “Software grows until it becomes limited by Moore’s law.”
◊ Software often grows faster than hardware capabilities
□ “Software growth makes Moore’s Law possible.”
◊ Software and hardware market stimulate each other
□ “Software is only limited by human ambition & expectation.”
◊ People will always find ways for exploiting performance
■ Jevon’s paradox:
□ “Technological progress that increases the efficiency with
which a resource is used tends to increase (rather than
decrease) the rate of consumption of that resource.”
14
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
16. A Physics Problem
■ Power: Energy needed to run the processor
■ Static power (SP): Leakage in transistors while being inactive
■ Dynamic power (DP): Energy needed to switch a transistor
■ Moore’s law: N goes up exponentially, C goes down with size
■ Power dissipation demands cooling
□ Power density: Watt/cm2
■ Make dynamic power increase less dramatic:
□ Bringing down V reduces energy consumption, quadratically!
□ Don’t use N only for logic gates
■ Industry was able to increase the frequency (F) for decades
DP (approx.) = Number of Transistors (N) x Capacitance (C) x
Voltage2 (V2) x Frequency (F)
16
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
17. Processor Supply Voltage
1
10
100
1970 1980 1990 2000 2010
PowerSupply(Volt)
Processor Supply VoltageProcessor Supply Voltage
[Moore,ISSCC]
17
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
18. Power Density
■ Growth of watts per square centimeter in microprocessors
■ Higher temperatures: Increased leakage, slower transistors
0 W
20 W
40 W
60 W
80 W
100 W
120 W
140 W
1992 1995 1997 2000 2002 2005
Hot Plate
Air Cooling Limit
18
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
19. Power Density
[Kevin Skadron, 2007]
“Cooking-Aware” Computing?
19
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
20. Second Problem: Leakage Increase
0.001
0.01
0.1
1
10
100
1000
1960 1970 1980 1990 2000 2010
Power(W)
Processor Power (Watts)Processor Power (Watts) -- Active & LeakageActive & Leakage
ActiveActive
LeakageLeakage
[www.ieeeghn.org]
■ Static leakage today: Up to 40% of CPU power consumption
20
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
21. The Power Wall
■ Air cooling capabilities are limited
□ Maximum temperature of 100-125 °C, hot spot problem
□ Static and dynamic power consumption must be limited
■ Power consumption increases with Moore‘s law,
but grow of hardware performance is expected
■ Further reducing voltage as compensation
□ We can’t do that endlessly, lower limit around 0.7V
□ Strange physical effects
■ Next-generation processors need to use even less power
□ Lower the frequencies, scale them dynamically
□ Use only parts of the processor at a time (‘dark silicon’)
□ Build energy-efficient special purpose hardware
■ No chance for faster processors through frequency increase
21
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
22. The Free Lunch Is Over
■ Clock speed curve
flattened in 2003
□ Heat, power,
leakage
■ Speeding up the serial
instruction execution
through clock speed
improvements no
longer works
■ Additional issues
□ ILP wall
□ Memory wall
[HerbSutter,2009]
22
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
23. Parallel Programming Concepts
OpenHPI Course
Week 1 : Terminology and fundamental concepts
Unit 1.3: ILP Wall and Memory Wall
Dr. Peter Tröger + Teaching Team
24. Three Ways Of Doing Anything Faster
[Pfister]
■ Work harder
(clock speed)
□ Hardware solution
! Power wall problem
■ Work smarter
(optimization, caching)
□ Hardware solution
■ Get help
(parallelization)
□ Hardware + Software
Application
Instructions
24
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
25. Instruction Level Parallelism
■ Increasing the frequency is no longer an option
■ Provide smarter instruction processing for better performance
■ Instruction level parallelism (ILP)
□ Processor hardware optimizes low-level instruction execution
□ Instruction pipelining
◊ Overlapped execution of serial instructions
□ Superscalar execution
◊ Multiple units of one processor are used in parallel
□ Out-of-order execution
◊ Reorder instructions that do not have data dependencies
□ Speculative execution
◊ Control flow speculation and branch prediction
■ Today’s processors are packed with such ILP logic
25
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
26. The ILP Wall
■ No longer cost-effective to dedicate
new transistors to ILP mechanisms
■ Deeper pipelines make the
power problem worse
■ High ILP complexity effectively
reduces the processing
speed for a given frequency
(e.g. misprediction)
■ More aggressive ILP
technologies too risky due to
unknown real-world workloads
■ No ground-breaking new ideas
■ " “ILP wall”
■ Ok, let’s use the transistors for better caching
[Wikipedia]
26
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
27. Caching
■ von Neumann architecture
□ Instructions are stored in main memory
□ Program is treated as data
□ For each instruction execution, data must be fetched
■ When the frequency increases, main memory becomes a
performance bottleneck
■ Caching: Keep data copy in very fast, small memory on the CPU
CPU
Memory
Control Unit
Arithmetic Logic UnitInput
Output
Bus
Cache
27
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
29. Memory Hardware Hierarchy
CPU core CPU core CPU core CPU core
L2 Cache L2 Cache
L3 Cache
L1 Cache L1 Cache L1 Cache L1 Cache
Bus
Bus Bus
L = Level
29
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
30. Caching for Performance
■ Well established optimization technique for performance
■ Caching relies on data locality
□ Some instructions are often used (e.g. loops)
□ Some data is often used (e.g. local variables)
□ Hardware keeps a copy of the data in the faster cache
□ On read attempts, data is taken directly from the cache
□ On write, data is cached and eventually written to memory
■ Similar to ILP, the potential is limited
□ Larger caches do not help automatically
□ At some point, all data locality in the
code is already exploited
□ Manual vs. compiler-driven optimization
[arstechnica.com]
30
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
31. Memory Wall
■ If caching is limited, we simply need faster memory
■ The problem: Shared memory is ‘shared’
□ Interconnect contention
□ Memory bandwidth
◊ Memory transfer speed is limited by the power wall
◊ Memory transfer size is limited by the power wall
■ Transfer technology cannot
keep up with GHz processors
■ Memory is too slow, effects
cannot be hidden through
caching completely
" “Memory wall”
[dell.com]
31
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
32. Problem Summary
■ Hardware perspective
□ Number of transistors N is still increasing
□ Building larger caches no longer helps (memory wall)
□ ILP is out of options (ILP wall)
□ Voltage / power / frequency is at the limit (power wall)
◊ Some help with dynamic scaling approaches
□ Remaining option: Use N for more cores per processor chip
■ Software perspective
□ Performance must come from the utilization of this increasing
core count per chip, since F is now fixed
□ Software must tackle the memory wall
32
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
33. Three Ways Of Doing Anything Faster
[Pfister]
■ Work harder
(clock speed)
! Power wall problem
! Memory wall problem
■ Work smarter
(optimization, caching)
! ILP wall problem
! Memory wall problem
■ Get help
(parallelization)
□ More cores per single CPU
□ Software needs to exploit
them in the right way
! Memory wall problem
Problem
CPU
Core
Core
Core
Core
Core
33
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
34. Parallel Programming Concepts
OpenHPI Course
Week 1 : Terminology and fundamental concepts
Unit 1.4: Parallel Hardware Classification
Dr. Peter Tröger + Teaching Team
35. Parallelism [Mattson et al.]
■ Task
□ Parallel program breaks a problem into tasks
■ Execution unit
□ Representation of a concurrently running task (e.g. thread)
□ Tasks are mapped to execution units
■ Processing element (PE)
□ Hardware element running one execution unit
□ Depends on scenario - logical processor vs. core vs. machine
□ Execution units run simultaneously on processing elements,
controlled by some scheduler
■ Synchronization - Mechanism to order activities of parallel tasks
■ Race condition - Program result depends on the scheduling order
35
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
36. Faster Processing through Parallelization
Program
Task
Task
Task
Task
Task
36
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
37. OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
Flynn‘s Taxonomy (1966)
■ Classify parallel hardware architectures according to their
capabilities in the instruction and data processing dimension
Single Instruction,
Single Data (SISD)
Single Instruction,
Multiple Data (SIMD)
37
Processing Step
Instruction
Data Item
Output
Processing Step
Instruction
Data Items
Output
Multiple Instruction,
Single Data (MISD)
Processing Step
Instructions
Data Item
Output
Multiple Instruction,
Multiple Data (MIMD)
Processing Step
Instructions
Data Items
Output
38. Flynn‘s Taxonomy (1966)
■ Single Instruction, Single Data (SISD)
□ No parallelism in the execution
□ Old single processor architectures
■ Single Instruction, Multiple Data (SIMD)
□ Multiple data streams processed with one instruction stream
at the same time
□ Typical in graphics hardware and GPU accelerators
□ Special SIMD machines in high-performance computing
■ Multiple Instructions, Single Data (MISD)
□ Multiple instructions applied to the same data in parallel
□ Rarely used in practice, only for fault tolerance
■ Multiple Instructions, Multiple Data (MIMD)
□ Every modern processor, compute clusters
38
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
39. Parallelism on Different Levels
ProgramProgramProgram
ProcessProcessProcessProcessTask
PE
ProcessProcessProcessProcessTask
ProcessProcessProcessProcessTask
PE
PE
PE
Memory
Node
Network
PE
PE
PE
Memory
PE
PE
PE
Memory
PE
PE
PE
Memory
PE
PE
PE
Memory
39
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
40. Parallelism on Different Levels
■ A processor chip (socket)
□ Chip multi-processing (CMP)
◊ Multiple CPU’s per chip, called cores
◊ Multi-core / many-core
□ Simultaneous multi-threading (SMT)
◊ Interleaved execution of tasks on one core
◊ Example: Intel Hyperthreading
□ Chip multi-threading (CMT) = CMP + SMT
□ Instruction-level parallelism (ILP)
◊ Parallel processing of single instructions per core
■ Multiple processor chips in one machine (multi-processing)
□ Symmetric multi-processing (SMP)
■ Multiple processor chips in many machines (multi-computer)
40
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
41. Parallelism on Different Levels
[arstechnica.com]
ILP, SMT ILP, SMTILP, SMTILP, SMT
ILP, SMT ILP, SMT ILP, SMT ILP, SMT
CMPArchitecture
41
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
42. Parallel Programming Concepts
OpenHPI Course
Week 1 : Terminology and fundamental concepts
Unit 1.5: Memory Architectures
Dr. Peter Tröger + Teaching Team
43. Parallelism on Different Levels
ProgramProgramProgram
ProcessProcessProcessProcessTask
PE
ProcessProcessProcessProcessTask
ProcessProcessProcessProcessTask
PE
PE
PE
Memory
Node
Network
PE
PE
PE
Memory
PE
PE
PE
Memory
PE
PE
PE
Memory
PE
PE
PE
Memory
43
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
44. Shared Memory vs. Shared Nothing
■ Organization of parallel processing hardware as …
□ Shared memory system
◊ Tasks can directly access a common address space
◊ Implemented as memory hierarchy with different cache levels
□ Shared nothing system
◊ Tasks can only access local memory
◊ Global coordination of parallel execution by explicit
communication (e.g. messaging) between tasks
□ Hybrid architectures possible in practice
◊ Cluster of shared memory systems
◊ Accelerator hardware in a shared memory system
● Dedicated local memory on the accelerator
● Example: SIMD GPU hardware in SMP computer system
44
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
45. Shared Memory vs. Shared Nothing
■ Pfister: “shared memory” vs. “distributed memory”
■ Foster: “multiprocessor” vs. “multicomputer”
■ Tannenbaum: “shared memory” vs. “private memory”
Processing
Element
Task
Shared Memory
Processing
Element
Task
Processing
Element
Task
Processing
Element
Task
Message
Message
Message
Message
Data DataData Data
45
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
46. Shared Memory
■ Processing elements act independently
■ Use the same global address space
■ Changes are visible for all processing elements
■ Uniform memory access (UMA) system
□ Equal access time for all PE’s to all memory locations
□ Default approach for SMP systems of the past
■ Non-uniform memory access (NUMA) system
□ Delay on memory access according to the accessed region
□ Typically due to core / processor interconnect technology
■ Cache-coherent NUMA (CC-NUMA) system
◊ NUMA system that keeps all caches consistent
◊ Transparent hardware mechanisms
◊ Became standard approach with recent X86 chips
46
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
47. Socket
UMA Example
■ Two dual-core processor chips in an SMP system
■ Level 1 cache (fast, small), Level 2 cache (slower, larger)
■ Hardware manages cache coherency among all cores
Core Core
L1 Cache L1 Cache
L2 Cache
RAM
Chipset / Memory Controller
System Bus
Socket
Core Core
L1 Cache L1 Cache
L2 Cache
47
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
RAM RAM RAM
48. Socket
NUMA Example
■ Eight cores on 2 sockets in an SMP system
■ Memory controllers + chip interconnect realize a single memory
address space for the software
Core Core
L1 L1
L3 Cache
RAM
L2 L2
Core Core
L1
L2
L1
L2
Memory Controller
RAM
Chip
Interconnect
Socket
Core Core
L1 L1
L3 Cache
L2 L2
Core Core
L1
L2
L1
L2
Memory Controller
48
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
49. NUMA Example: 4-way Intel Nehalem SMP
Core
Core
Core
Core
Q
P
I
Core
Core
Core
Core
Q
P
I
Core
Core
Core
Core
Q
P
I
Core
Core
Core
Core
Q
P
I
L3Cache
L3Cache
L3Cache
MemoryController
MemoryController
MemoryController
L3Cache
MemoryController
I/O
I/O
I/O
I/O
Memory
Memory
Memory
Memory
49
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
50. Shared Nothing
■ Processing elements no longer share a common global memory
■ Easy scale-out by adding machines to the messaging network
■ Cluster computing: Combine machines with cheap interconnect
□ Compute cluster: Speedup for an application
◊ Batch processing, data parallelism
□ Load-balancing cluster: Better throughput for some service
□ High Availability (HA) cluster: Fault tolerance
■ Cluster to the extreme
□ High Performance Computing (HPC)
□ Massively Parallel Processing (MPP) hardware
□ TOP500 list of the fastest supercomputers
50
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
54. Example: Cluster of Nehalem SMPs
Network
54
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
55. The Parallel Programming Problem
■ Execution environment has a particular type
(SIMD, MIMD, UMA, NUMA, …)
■ Execution environment maybe configurable (number of resources)
■ Parallel application must be mapped to available resources
Execution EnvironmentParallel Application Match ?
Configuration
Flexible
Type
55
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
56. Parallel Programming Concepts
OpenHPI Course
Week 1 : Terminology and fundamental concepts
Unit 1.6: Speedup and Scaleup
Dr. Peter Tröger + Teaching Team
57. Which One Is Faster ?
■ Usage scenario
□ Transporting a fridge
■ Usage environment
□ Driving through a forest
■ Perception of performance
□ Maximum speed
□ Average speed
□ Acceleration
■ We need some kind of
application-specific benchmark
57
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
58. Parallelism for …
■ Speedup – compute faster
■ Throughput – compute more in the same time
■ Scalability – compute faster / more with additional resources
■ …
Processing Element A1
Processing Element A2
Processing Element A3
Processing Element B1
Processing Element B2
Processing Element B3
ScalingUp
Scaling Out
MainMemory
MainMemory
58
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
59. Metrics
■ Parallelization metrics are application-dependent,
but follow a common set of concepts
□ Speedup: Adding more resources leads to less time for
solving the same problem.
□ Linear speedup:
n times more resources " n times speedup
□ Scaleup: Adding more resources solves a larger version of the
same problem in the same time.
□ Linear scaleup:
n times more resources " n times larger problem solvable
■ The most important goal depends on the application
□ Throughput demands scalability of the software
□ Response time demands speedup of the processing
59
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
60. Tasks: v=12
Processing elements: N= 3
Time needed: T3= 4
(Linear) Speedup: T1/T3=12/4=3
Speedup
■ Idealized assumptions
□ All tasks are equal sized
□ All code parts can run in parallel
Application
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
t t
Tasks: v=12
Processing elements: N=1
Time needed: T1=12
60
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
61. Speedup with Load Imbalance
■ Assumptions
□ Tasks have different size,
best-possible speedup depends
on optimized resource usage
□ All code parts can run in parallel
Application
2
3
4
5
6
7
8
9
10
11
12
t t
1
2
3
4
1
5
6
7
8
9
10
11
12
Tasks: v=12
Processing elements: N= 3
Time needed: T3= 6
Speedup: T1/T3=16/6=2.67
Tasks: v=12
Processing elements: N=1
Time needed: T1=16
61
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
62. Speedup with Serial Parts
■ Each application has inherently non-parallelizable serial parts
□ Algorithmic limitations
□ Shared resources acting as bottleneck
□ Overhead for program start
□ Communication overhead in shared-nothing systems
23
45
6
7
8
9
10
11
12
tSER1
1
tPAR1
tSER2 tPAR2
tSER3
62
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
63. Amdahl’s Law
■ Gene Amdahl. “Validity of the single processor approach to achieving
large scale computing capabilities”. AFIPS 1967
□ Serial parts TSER = tSER1 + tSER2 + tSER3 + …
□ Parallelizable parts TPAR = tPAR1 + tPAR2 + tPAR3 + …
□ Execution time with one processing element:
T1 = TSER+TPAR
□ Execution time with N parallel processing elements:
TN >= TSER + TPAR / N
◊ Equal only on perfect parallelization,
e.g. no load imbalance
□ Amdahl’s Law for maximum speedup with N processing elements
S =
T1
TN
63
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
S =
TSER + TP AR
TSER + TP AR/N
65. Amdahl’s Law
■ Speedup through parallelism is hard to achieve
■ For unlimited resources, speedup is bound by the serial parts:
□ Assume T1=1
■ Parallelization problem relates to all system layers
□ Hardware offers some degree of parallel execution
□ Speedup gained is bound by serial parts:
◊ Limitations of hardware components
◊ Necessary serial activities in the operating system,
virtual runtime system, middleware and the application
◊ Overhead for the parallelization itself
65
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
SN!1 =
T1
TN!1
SN!1 =
1
TSER
66. Amdahl’s Law
■ “Everyone knows Amdahl’s law, but quickly forgets it.”
[Thomas Puzak, IBM]
■ 90% parallelizable code leads to not more than 10x speedup
□ Regardless of the number of processing elements
■ Parallelism is only useful …
□ … for small number of processing elements
□ … for highly parallelizable code
■ What’s the sense in big parallel / distributed hardware setups?
■ Relevant assumptions
□ Put the same problem on different hardware
□ Assumption of fixed problem size
□ Only consideration of execution time for one problem
66
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
67. Gustafson-Barsis’ Law (1988)
■ Gustafson and Barsis: People are typically not interested in the
shortest execution time
□ Rather solve a bigger problem in reasonable time
■ Problem size could then scale with the number of processors
□ Typical in simulation and farmer / worker problems
□ Leads to larger parallel fraction with increasing N
□ Serial part is usually fixed or grows slower
■ Maximum scaled speedup by N processors:
■ Linear speedup now becomes possible
■ Software needs to ensure that serial parts remain constant
■ Other models exist (e.g. Work-Span model, Karp-Flatt metric)
67
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger
S =
TSER + N · TP AR
TSER + TP AR
68. Summary: Week 1
■ Moore’s Law and the Power Wall
□ Processing element speed no longer increases
■ ILP Wall and Memory Wall
□ Memory access is not fast enough for modern hardware
■ Parallel Hardware Classification
□ From ILP to SMP, SIMD vs. MIMD
■ Memory Architectures
□ UMA vs. NUMA
■ Speedup and Scaleup
□ Amdahl’s Law and Gustavson’s Law
Since we need parallelism for speedup,
how can we express it in software?
68
OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger