We leave in the era where the atomic building elements of silicon computers, e.g., transistors and wires, are no longer visible using traditional optical microscopes and their sizes are measured in just tens of Angstroms. In addition, power dissipation per unit volume is bounded by the laws of Physics that all resulted among others in stagnating processor clock frequencies. Adding more and more processor cores that perform simpler and simpler tasks in an attempt to efficiently fill the available on-chip area seems to be the current trend taken by the Industry.
Machine Learning with New Hardware ChallegensOscar Law
Describe basic neural network design and focus on Convolutional Neural Network architecture. Explain why CPU and GPU can't fulfill CNN hardware requirement. List out three hardware examples: Nvidia, Microsoft and Google. Finally highlight optimization approach for CNN design.
Machine Learning with New Hardware ChallegensOscar Law
Describe basic neural network design and focus on Convolutional Neural Network architecture. Explain why CPU and GPU can't fulfill CNN hardware requirement. List out three hardware examples: Nvidia, Microsoft and Google. Finally highlight optimization approach for CNN design.
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...AMD Developer Central
Presentation HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Matrix Factorization, by Huming Zhu at the AMD Developer Summit (APU13) November 11-13, 2013.
Trip down the GPU lane with Machine LearningRenaldas Zioma
What Machine Learning professional should know about GPU!
Brief outline of the deck:
* GPU architecture explained with simple images
* memory bandwidth cheat-sheats for common hardware configuration,
* overview of GPU programming model
* under the hood peek at the main building block of ML - matrix multiplication
* effect of mini-batch size on performance
Originally I gave this talk at the internal Machine Learning Workshop in Unity Seattle
HIGH QUALITY pdf slides: http://bit.ly/2iQxm7X (on Dropbox)
This talk is given at Vizianagaram where many Engineering college faculty were attended. I have introduced developments in multi-core computers along with their architectural developments. Also, I have explained about high performance computing, where these are used. I have introduced the concept of pipelining, Amdahl's law, issues related to pipelining, MIPS architecture.
Parallel Implementation of K Means Clustering on CUDAprithan
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be
time consuming, and in an attempt to minimize this time, our project is a parallel implementation of KMeans
clustering algorithm on CUDA using C. We present the performance analysis and implementation
of our approach to parallelizing K-Means clustering.
This project deals with the warehouse scale computers that power all the internet services which we use today. The project covers the hardware blocks used in a Google WSC. Also, the project deals with the architecture of hardware accelerators such as the Graphical Processing Unit and the Tensor Processing Unit, which is highly useful for the warehouse scale machines to run heavy tasks and also to support application-specific machine learning and deep learning tasks. Also, the project explains about the energy efficiency of the processors used by the Google WSC to achieve high performance. The project also tries to explain about performance enhancement mechanism used by Google WSC.
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
https://sites.google.com/view/itri-icl-dla/
(Public Information Share) This is our lightweight DNN inference processor presentation, including a system solution (from Caffe prototxt to HW controls files), hardware features, and an example of object detection (Tiny YOLO) RTL simulation results. We modified open-source NVDLA, small configuration, and developed a RISC-V MCU in this accelerating system.
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
Presentation CC-4005, Performance analysis of 3D Finite Difference computational stencils on Seamicro fabric compute systems, by Joshua Mora from the AMD Developer Summit (APU13) November 2013.
GPU programing
The Brick Wall -- UC Berkeley's View
Power Wall: power expensive, transistors free
Memory Wall: Memory slow, multiplies fast ILP Wall: diminishing returns on more ILP HW
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)inside-BigData.com
In this video from the Open Compute Summit, Siamak Tavallaei from Microsoft presents an overview of the Microsoft Project Olympus AI Accelerator Chassis, also known as the HGX-1.
Watch the presentation video: http://wp.me/p3RLHQ-guX
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontierinside-BigData.com
In this deck from the Stanford Colloquium on Computer Systems Seminar, Brian Boucher from Maxeler Technologies presents: Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier.
"Maxeler Multiscale Dataflow computing is at the leading edge of energy-efficient high performance computing, providing competitive advantage in industries from energy to finance to defense. Maxeler builds the computer around the problem to maximize performance density, eliminating the elaborate caching and decoding machinery occupying most silicon in a standard processor. This talk will explain the motivation behind dataflow computing to escape the end of frequency scaling in the push to exascale machines, introduce the Maxeler dataflow ecosystem including MaxJ code and DFE hardware, and demonstrate the application of dataflow principles to a specific HPC software package (Quantum ESPRESSO)."
Watch the video: https://wp.me/p3RLHQ-hq1
Learn more: http://maxeler.com/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...AMD Developer Central
Presentation HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Matrix Factorization, by Huming Zhu at the AMD Developer Summit (APU13) November 11-13, 2013.
Trip down the GPU lane with Machine LearningRenaldas Zioma
What Machine Learning professional should know about GPU!
Brief outline of the deck:
* GPU architecture explained with simple images
* memory bandwidth cheat-sheats for common hardware configuration,
* overview of GPU programming model
* under the hood peek at the main building block of ML - matrix multiplication
* effect of mini-batch size on performance
Originally I gave this talk at the internal Machine Learning Workshop in Unity Seattle
HIGH QUALITY pdf slides: http://bit.ly/2iQxm7X (on Dropbox)
This talk is given at Vizianagaram where many Engineering college faculty were attended. I have introduced developments in multi-core computers along with their architectural developments. Also, I have explained about high performance computing, where these are used. I have introduced the concept of pipelining, Amdahl's law, issues related to pipelining, MIPS architecture.
Parallel Implementation of K Means Clustering on CUDAprithan
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be
time consuming, and in an attempt to minimize this time, our project is a parallel implementation of KMeans
clustering algorithm on CUDA using C. We present the performance analysis and implementation
of our approach to parallelizing K-Means clustering.
This project deals with the warehouse scale computers that power all the internet services which we use today. The project covers the hardware blocks used in a Google WSC. Also, the project deals with the architecture of hardware accelerators such as the Graphical Processing Unit and the Tensor Processing Unit, which is highly useful for the warehouse scale machines to run heavy tasks and also to support application-specific machine learning and deep learning tasks. Also, the project explains about the energy efficiency of the processors used by the Google WSC to achieve high performance. The project also tries to explain about performance enhancement mechanism used by Google WSC.
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
https://sites.google.com/view/itri-icl-dla/
(Public Information Share) This is our lightweight DNN inference processor presentation, including a system solution (from Caffe prototxt to HW controls files), hardware features, and an example of object detection (Tiny YOLO) RTL simulation results. We modified open-source NVDLA, small configuration, and developed a RISC-V MCU in this accelerating system.
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
Presentation CC-4005, Performance analysis of 3D Finite Difference computational stencils on Seamicro fabric compute systems, by Joshua Mora from the AMD Developer Summit (APU13) November 2013.
GPU programing
The Brick Wall -- UC Berkeley's View
Power Wall: power expensive, transistors free
Memory Wall: Memory slow, multiplies fast ILP Wall: diminishing returns on more ILP HW
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)inside-BigData.com
In this video from the Open Compute Summit, Siamak Tavallaei from Microsoft presents an overview of the Microsoft Project Olympus AI Accelerator Chassis, also known as the HGX-1.
Watch the presentation video: http://wp.me/p3RLHQ-guX
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontierinside-BigData.com
In this deck from the Stanford Colloquium on Computer Systems Seminar, Brian Boucher from Maxeler Technologies presents: Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier.
"Maxeler Multiscale Dataflow computing is at the leading edge of energy-efficient high performance computing, providing competitive advantage in industries from energy to finance to defense. Maxeler builds the computer around the problem to maximize performance density, eliminating the elaborate caching and decoding machinery occupying most silicon in a standard processor. This talk will explain the motivation behind dataflow computing to escape the end of frequency scaling in the push to exascale machines, introduce the Maxeler dataflow ecosystem including MaxJ code and DFE hardware, and demonstrate the application of dataflow principles to a specific HPC software package (Quantum ESPRESSO)."
Watch the video: https://wp.me/p3RLHQ-hq1
Learn more: http://maxeler.com/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...EUDAT
Giuseppe will present the differences between high-performance and high-throughput applications. High-throughput computing (HTC) refers to computations where individual tasks do not need to interact while running. It differs from High-performance (HPC) where frequent and rapid exchanges of intermediate results is required to perform the computations. HPC codes are based on tightly coupled MPI, OpenMP, GPGPU, and hybrid programs and require low latency interconnected nodes. HTC makes use of unreliable components distributing the work out to every node and collecting results at the end of all parallel tasks.
Visit: https://www.eudat.eu/eudat-summer-school
Even though there have been a large number of proposals to accelerate databases using specialized hardware, often the opinion of the community is pessimistic: the performance and energy efficiency benefits of specialization are seen to be outweighed by the limitations of the proposed solutions and the additional complexity of including specialized hardware, such as field programmable gate arrays (FPGAs), in servers. Recently, however, as an effect of stagnating CPU performance, server architectures started to incorporate various programmable hardware and the availability of such components brings opportunities to databases. In the light of a shifting hardware landscape and emerging analytics workloads, it is time to revisit our stance on hardware acceleration. In this talk we highlight several challenges that have traditionally hindered the deployment of hardware acceleration in databases and explain how they have been alleviated or removed altogether by recent research results and the changing hardware landscape. We also highlight a new set of questions that emerge around deep integration of heterogeneous programmable hardware in tomorrow’s databases.
The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud)Jose Antonio Coarasa Perez
The CMS online cluster consists of more than 3000 computers. It has been exclusively used for the Data Acquisition of the CMS experiment at CERN, archiving around 20Tbytes of data per day.
An openstack cloud layer has been deployed on part of the cluster (totalling more than 13000 cores) as a minimal overlay so as to leave the primary role of the computers untouched while allowing an opportunistic usage of the cluster. This allows running offline computing jobs on the online infrastructure while it is not (fully) used.
We will present the architectural choices made to deploy an unusual, as opposed to dedicated, "overlaid cloud infrastructure". These architectural choices ensured a minimal impact on the running cluster configuration while giving a maximal segregation of the overlaid virtual computer infrastructure. Openvswitch was chosen during the proof of concept phase in order to avoid changes on the network infrastructure. Its use will be illustrated as well as the final networking configuration used. The design and performance of the openstack cloud controlling layer will be also presented together with new developments and experience from the first year of usage.
Flink Forward Berlin 2018: George Theodorakis - "Hardware-efficient Stream Pr...Flink Forward
In the era of big data and AI, many data-intensive applications, such as streaming, exhibit requirements that cannot be satisfied by traditional batch processing models. In response, distributed stream processing systems, such as Spark Streaming or Apache Flink, exploit the resources of a compute cluster for streaming applications. As with any distributed system, this raises the question of how efficiently these systems utilise the available hardware resources on each node. At the same time, with highly-parallel heterogeneous architectures becoming commonplace in data centres, stream processing systems can exploit previously unseen levels of parallel processing even from single nodes. A step towards these changes is our engine SABER, which exploits the parallelism that both multi-core CPUs and GPGPUs offer in a single node, to achieve high processing throughput while maintaining low latency. We use the modified version of Yahoo Streaming Benchmark to measure the performance of aforementioned systems and compare the performance of distributed and centralised computation. SABER processes 79 million tuples per second with 8 CPU cores, outperforming Flink (3x), Spark Streaming (7x) and StreamBox (7x). It exhibits better performance than a cluster-based deployment with 40 CPU cores.
However, even these results are not satisfactory, as there is still a large performance gap between handwritten code and current stream processing systems. By comparing a handwritten C++ program with the single-core implementations of these systems, we notice more than (2x) speedup that is hindered by the systems' design. Thus, we have to recalibrate the way we approach stream processing and focus on hardware-conscious techniques. In our current research, we have begun designing highly efficient streaming operator implementations that exploit superscalar execution and SIMD parallelism. We also envision to introduce compilation-based techniques to keep data in CPU registers as long as possible, while taking into consideration the non-uniform memory access (NUMA) caused by multiple CPU sockets on modern scale-up architectures.
El Barcelona Supercomputing Center (BSC) fue establecido en 2005 y alberga el MareNostrum, uno de los superordenadores más potentes de España. Somos el centro pionero de la supercomputación en España. Nuestra especialidad es la computación de altas prestaciones - también conocida como HPC o High Performance Computing- y nuestra misión es doble: ofrecer infraestructuras y servicio de supercomputación a los científicos españoles y europeos, y generar conocimiento y tecnología para transferirlos a la sociedad. Somos Centro de Excelencia Severo Ochoa, miembros de primer nivel de la infraestructura de investigación europea PRACE (Partnership for Advanced Computing in Europe), y gestionamos la Red Española de Supercomputación (RES). Como centro de investigación, contamos con más de 456 expertos de 45 países, organizados en cuatro grandes áreas de investigación: Ciencias de la computación, Ciencias de la vida, Ciencias de la tierra y aplicaciones computacionales en ciencia e ingeniería.
In this deck, Torsten Hoefler from ETH Zurich presents: Data-Centric Parallel Programming.
"The ubiquity of accelerators in high-performance computing has driven programming complexity beyond the skill-set of the average domain scientist. To maintain performance portability in the future, it is imperative to decouple architecture-specific programming paradigms from the underlying scientific computations. We present the Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that enables separating code definition from its optimization. We show how to tune several applications in this model and IR. Furthermore, we show a global, datacentric view of a state-of-the-art quantum transport simulator to optimize its execution on supercomputers. The approach yields coarse and fine-grained data-movement characteristics, which are used for performance and communication modeling, communication avoidance, and data-layout transformations. The transformations are tuned for the Piz Daint and Summit supercomputers, where each platform requires different caching and fusion strategies to perform optimally. We show that SDFGs deliver competitive performance, allowing domain scientists to develop applications naturally and port them to approach peak hardware performance without modifying the original scientific code."
Watch the video: https://wp.mep3RLHQ-kup
Learn more: http://htor.inf.ethz.ch
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Similar to Programmable Exascale Supercomputer (20)
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
JMeter webinar - integration with InfluxDB and Grafana
Programmable Exascale Supercomputer
1. Building affordable and
programmable exascale capable
computers
SURFsara, 18 April 2019
Georgi Gaydadjiev, Director of Maxeler IoT-Labs BV, Delft
Honorary Visiting Professor at the Department of Computing, Imperial College London
2. Think Ångströms not nanometers in 2019
Ideally we should steer movements of almost each individual electron to solve our problems
0.1 nm 1 Å
14 nm 140 Å
DNA
C-C
bond
1Å 10Å 102Å 103Å 104Å
glucose
hemoglobin
ribosome
…
cells
100s of Si atoms in 14nm
very few atoms
(e.g., 3nm / 30Å 6 to 12 atoms)
light microscope
resolution
2
3. Moving data on-chip will use as much energy as computing with it
Moving data off-chip will use 200x more energy!
and is much slower as well
The power challenge
Today* 2020
Double precision Float Op ~20pJ <10pJ
Moving data on-chip: 1mm 6pJ
Moving data on-chip: 20mm 120pJ
Moving data to off-chip
memory
4,000pJ 2,000pJ
3
The data movement challenge
Next generation computer systems should take data movements at all levels very seriously
3
4. Data movement challenge confirmed
4
Wires that carry the data (and instructions) become more and more important
(Courtesy: NVIDIA and ITRS)
BUT
4
5. 5
“… without dramatic increases in
efficiency, ICT industry could
use 20% of all electricity and
emit up to 5.5% of the world’s
carbon emissions by 2025.”
“We have a tsunami of data
approaching. Everything which
can be is being digitalised. It is
a perfect storm.”
“ … a single $1bn Apple data
centre planned for Athenry in Co
Galway, expects to eventually
use 300MW of electricity, or over
8% of the national capacity and
more than the daily entire usage
of Dublin. It will require 144 large
diesel generators as back up for
when the wind does not blow.”
Why is all of this important?
6. 6
Computing in Time:
Follow a recipe step by step
one at the time
Computing in Space:
Build a “recipe specific” factory with multiple
paths performed simultaneously
One result per clock cycle
Efficient, predictable, reliable “mass production” of huge data amounts
Build Computers for your Problem and Data
7. 1. Describe Conjugate Gradient
as dataflow graph
3. Stream data through the
Custom Accelerator
2. Compile dataflow structure and load to hardware
Create customized mega accelerators with massive inherent throughputs
Programming a Dataflow “mass production” Engine
7
8. Program in HL Language
Machine Architecture
Implementation
Circuits
Algorithm
Devices
Problem
Solutions
Solutions
Co-optimise the HW and the SW
stack for the performance critical
areas of the application
8
Solving Computing Problems Vertically
8
9. From Equations to Dataflow Hardware
u
x
s
x
vd
x
F
ah
p
ah
TRuu
a
vu
ah
u
t
u
1ln
9
10. Real data flow graph as
generated by MaxCompiler
4,866 nodes;
10,000s of stages/cycles
Full Customization in:
Space, Value and Time
(SVT)
1010
11. Easy it is not (and not really new)
Slotnick’s law (of effort):
“The parallel approach to computing does require that
some original thinking be done about numerical
analysis and data management in order to secure
efficient use.
In an environment which has represented the
absence of the need to think as the highest virtue this
is a decided disadvantage.”
Daniel Slotnick (1931-1985)
Chief Architect of Illiac IV
11
12. Programing in Space basics
12
• Control and Data-flows are decoupled
– Both are fully programmable
• Operations exist in space and by default run in parallel
– Their number is limited only by the available space
• All operations can be customized at various levels
– e.g., from algorithm down to the number representation
• Multiple operations constitute kernels
• Data streams through the operations / kernels
• The data transport and processing can be balanced
• All resources work all of the time for max performance
• The In/Out data rates determine the operating frequency
Equally spread the available “forces” and move no faster than required by the application
12
13. The Computational Model
13
• Dataflow sub-system (DataFlow Engine- DFE)
– Spatial arithmetic chip “hardware” technology with flexible arithmetic units
and programmable interconnect (looks like FPGAs but is not limited to)
– Programmable Static Dataflow
– Systolic Execution at kernel level
– Streaming Custom Computing at system level
– Implicit GALS* IO and kernel-to-kernel communication
• Dedicated software (MaxCompiler, MaxelerOS and SLiC)
– compilation toolchain and design methodology
– Incorporated simulation and debug environment for rapid development
– Linux fully integrated runtime system and low level software support
– Help designer focus on the data/algorithm and the system architecture
• Only three basic memory types (explicitly exposed)
– Scalars (exposed to the CPU)
– Fast Memory (FMEM): small and fast (on-chip)
– Large Memory (LMEM): large and slow (off-chip)
* GALS – Globally Asynchronous Locally Synchronous
13
14. Maxeler’s DataFlow Engine (DFE, MAX4)
14
MaxRing
Interconnect
Dataflow Engine (DFE)LMEM
(Large Memory)
4-96GB
Reconfigurable
compute fabric
Dataflow cores &
FMEM (Fast Memory)
High bandwidth
memory link
Link to main data network
(e.g., PCIe, Infiniband)
MaxRing links
• 48GB DRAM (LMEM)
• Stratix V D8
• MaxRing interconnect
• 4,000 multipliers
• 700K logic cells
• 6.25MB of FMEM
14
16. MaxJ: Moving Average of three numbers
Dataflow computing in hardware using a language you know
16
17. x
x
+
30
y
DFEVar x = io.input("x", dfeFloat(10,31));
DFEVar result = x * x + 30;
io.output("y", result, dfeFloat(10,31));
17
Simple example: y = x2 + 30
17
18. MaxJ example: Control in Space
18
x
+
1
y
-
1
>
10
class SimpleKernel extends Kernel {
SimpleKernel() {
DFEVar x = io.input(“x”, dfeInt(24));
DFEVar result = (x>10) ? x+1 : x-1;
io.output(“y”, result, dfeInt(25));
}
}
18
19. 19
SIMULATE AND DEBUG
GENERATE DATAFLOWPROGRAMARCHITECTANALYSE
Used to build real systems, however, very difficult to learn/educate
Non Traditional Design Process
OK?many hours …
Custom
HW
20. Multiple scales of
computing
Important features for
optimization
complete system level balance compute, storage
and IO
parallel node level maximize utilization of
compute and interconnect
microarchitecture level minimize data movement
arithmetic level tradeoff range, precision
and accuracy
= discretize in Time, Space
and Value
bit level encode and add
redundancy
transistor level => manipulate ‘0’ and ‘1’
and more, e.g., trade/hide Communication (Time) for/behind Computation (Space)
20
Optimizations at all levelsFlow/Time
Space
20
21. 21
1. Higher chip / system price compared to microprocessors
2. Lead design times (3 months in the best case)
a. Complex numerical transformations
b. Non-trivial area and data movement optimizations
3. “Painful” Place & Route times (12 hours to 24 hours)
a. Expensive Vendor Specific tools
b. Serious developments ask for dedicated build clusters
4. Need to compete at 200MHz with processors at 3GHz
5. Current HW technology is sub-optimal
• On-chip memory not built for stream processing
• On-chip interconnect overdesigned for Dataflow
6. Long learning curve (Tools and Methods needed)
7. Designer’s productivity should improve (Tools and Methods)
8. …
Some of the challenges
Ongoing effort on improved methodologies and tools
22. MaxRing
Interconnect
Dataflow Engine (DFE)
LMEM
(Large Memory)
4-96GB
Reconfigurable
compute fabric
Dataflow cores &
FMEM (Fast Memory)
High bandwidth
memory link
Link to main data network
(e.g., PCIe, Infiniband)
MaxRing links
Multiple platforms, single DFE abstraction
+
{
Application and MaxJ
gen4 gen5
Performance Portable Migration
(Intel based) (Xilinx based)
22
23. • MaxCompiler generates VHDL
ready for FPGA vendor tools
• Synthesis transforms VHDL into
logical “netlist” – sets of basic logic
expressions
• Map fits basic logic into N-input
look-up tables
• Place puts LUTs, DSPs, RAMs etc
at specific locations on chip
• Route sets up wiring between
blocks
23
Substrate Agnostic Compilation
MaxCompiler compilation
Synthesis
Map
Place
Route
Generate Maxfile
VHDL
Complete FPGA
Netlist
LUTs
Placed FPGA
23
25. • Allows you to see what lines of code are
• using what resources and focus optimization
• Separate reports for each kernel and for the manager
DFE Resource Usage Reporting
LUTs FFs BRAMs DSPs : MyKernel.java
727 871 1.0 2 : resources used by this file
0.24% 0.15% 0.09% 0.10% : % of available
71.41% 61.82% 100.00% 100.00% : % of total used
94.29% 97.21% 100.00% 100.00% : % of user resources
:
: public class MyKernel extends Kernel {
: public MyKernel (KernelParameters parameters) {
: super(parameters);
1 31 0.0 0 : DFEVar p = io.input("p", dfeFloat(8,24));
2 9 0.0 0 : DFEVar q = io.input("q", dfeUInt(8));
: DFEVar offset = io.scalarInput("offset", dfeUInt(8));
8 8 0.0 0 : DFEVar addr = offset + q;
18 40 1.0 0 : DFEVar v = mem.romMapped("table", addr,
: dfeFloat(8,24), 256);
139 145 0.0 2 : p = p * p;
401 541 0.0 0 : p = p + v;
: io.output("r", p, dfeFloat(8,24));
: }
: }
DSP Blocks
Block RAMs
IO Blocks
LUT/FFs
? ?
Different operations
use different
resources
25
26. • MaxCompiler gives detailed latency and area annotation
back to the programmer
• Evaluate precise effect of code
on latency and chip area
26
Optimization Feedback
12.8ns 6.4ns+ = 19.2ns (total compute latency)
26
27. 27
Small pilot system deployed in Oct 2017
• one 1U MPC-X with 8 MAX5 DFEs
• one 1U AMD EPYC based server
• one 1U login head node
Scaling using Amazon AWS cloud
• MAX5 fully compatible with F1 instances
• Elastic scaling between private and public
MPC-X node
Remote users
MAX5 DFE EPYC CPU
1TB DDR4
Head/Build node
ipmi
56 Gbps 2x Infiniband @ 56 Gbps
10 Gbps
10 Gbps
Supermicro EPYC node
Pilot System Deployed at Jülich
http://www.prace-ri.eu/pcp/
27
31. Global Weather Simulation with DFEs in China
⬥L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, X.
Huang, Y. Zhang, and G. Yang, Accelerating
solvers for global atmospheric equations
through mixed-precision data flow engine,
published at FPL 2013
⬥Joint research with Imperial College and
Tsinghua University
⬥Simulating the atmosphere using the
shallow water equation
An order of magnitude improvement over the Linpack-driven supercomputer technology
Platform Speedup Efficiency
6 Core CPU 1x 1x
Tianhe-1A Node 23x 15x
Maxeler MPC-X 330x 145x
31
32. 32
• A (fancy) name does not help with solving the problem at hand
• Cloud, (Intelligent) Edge, Fog are just names like … Maria
• FPGA is just a technology that can help bridging the gap to something
better (Spatial Computing Acceleration HW, Quantum Processing, …)
• just focus on building the best computer for the given job
• Learn, think, pioneer and stay always critical
• abstraction is powerful but quite often not needed
• use it with great care and remember Dan Slotnick
• We are turning Earth into a heterogeneous, planet-wide computer
• so we should try to not kill it in the process
• There is a lot of interest in this topic
Conclusions
34. Some links with more information
Maxeler Multiscale Dataflow Computing:
https://www.maxeler.com/technology/dataflow-computing/
Computing in Space explained by Mike Flynn:
http://www.openspl.org/what-is-openspl/
Computing in Space Course at Imperial College:
http://cc.doc.ic.ac.uk/openspl16/
Exciting Applications for DFEs (and JDFEs):
http://appgallery.maxeler.com
Maxeler DFEs on AWS EC2 F1:
https://aws.amazon.com/marketplace/seller-profile?id=2780c6ec-d326-47fc-9ff6-
c66ab2ba202a
Maxeler and Xilinx Alveo collaboration:
https://www.xilinx.com/products/boards-and-kits/alveo.html
34
35. Maxeler Applications Gallery
Dataflow Engine (DFE) Ecosystem
⬥ With over 150 universities in our university program, we
decided to create an app gallery to enable the community
to share applications, examples, demos, …
⬥ The App Gallery is complemented by a teaching program,
with the first successful course taught at Imperial College in
2014. see
http://cc.doc.ic.ac.uk/openspl14
⬥ Top 10 APPS:
➢ Correlation: in real-time, pairwise, on 6,000 streams
➢ 100% Guaranteed Packet Capture
➢ Webserver, cache and load balancing
➢ HESTON Option pricer
➢ N-body simulation
➢ Regex matching (e.g. for Security)
➢ Brain network simulation
➢ Quantum Chromo-Dynamics kernel
➢ Seismic Imaging
➢ Realtime Classification
Dataflow Apps and Analytics for Machine Learning http://appgallery.maxeler.com/
35
36. Peer Reviewed Dataflow Publications
2008: Seismic Imaging with Dataflow Engines 25x faster, An Implementation of the Acoustic
Wave Equation, T. Nemeth et al, Chevron, Society of Exploration Geophysicists, Nov 2008.
2010: Credit Derivatives Valuation and Risk, from 8 hours to 2 minutes, American Finance
Technology Award, with JP Morgan.
2011: Modeling and Imaging with Schlumberger, Beyond Traditional Microprocessors for
Geoscience High-Performance Computing Applications, O. Lindtjörn et al, Schlumberger,
IEEE Micro, vol. 31, no. 2, March/April 2011.
2012: Weather Imaging with CRS4, 60x faster, Acceleration of a Meteorological Limited Area
Model with dataflow Engines, Diego Oriato†, Simon Tilbury†, Marino Marrocu§, Gabrielle
Pusceddu§†Maxeler, §CRS4, 2012 Symposium on Application Accelerators in HPC.
2013: Convergence of Risk and Trading in partnership with CME Group and birth of OpenSPL
industry standard (www.openspl.org), In Cloud Computing it’s the Era of Convergence, Open
Markets Magazine, Ari Studnitzer, CME Group.
2014: Brain Simulation with Erasmus, Real-Time Olivary Neuron Simulations on Dataflow
Computing Machines, Georgios Smaragdos, Craig Davies, Christos Strydis, Ioannis Sourdis,
Catalin Ciobanu, Oskar Mencer, and Chris I. De Zeeuw, Supercomputing; Springer, 487-497
2017: High Energy Physics with Imperial, Using MaxCompiler for the high level synthesis
of trigger algorithms, S. Summers, A. Rose and P. Sanders, Journal of Instrumentation,
Volume 12, IOP Publishing.
36
38. Maxeler Trophy Cabinet
Academic History since 2005
• Imperial College Research Excellence Award
• Top EPSRC Advanced Fellowship
• Two Best Paper Awards
• Early Dataflow paper by Maxeler’s Founder has been recognized as one of the most
influential papers at the FPL conference in the last 25 years.
Recent Commercial Awards
• HPCwire Editors Choice Award, November 2011.
• American Finance Technology Awards, New York, winner,
“Most Cutting Edge IT Initiative,” December 2011.
• Golden Arrow, “...for revolutionizing Computers, ”
COM-SULT, January 2012.
• Gartner “Cool Vendor of the Year,” March 2012.
• Frost and Sullivan “Most innovative IT vendor, ” Dec 2013.
• CIO Review, 20 Most Promising Networking Companies, March 2014.
• CIO Review, 20 Most Promising HPC Companies, March 2015.
38