A focus on the use of FPGAs by cloud service providers. Includes Microsoft Azure Catapult, Google Tensor Processors, and Amazon EC2 F1 instances. Also includes background info on how to get started with FPGAs
This document discusses how HPC infrastructure is being transformed with AI. It summarizes that cognitive systems use distributed deep learning across HPC clusters to speed up training times. It also outlines IBM's hardware portfolio expansion for AI training, inference, and storage capabilities. The document discusses software stacks for AI like Watson Machine Learning Community Edition that use containers and universal base images to simplify deployment.
OpenPOWER Webinar on Machine Learning for Academic Research Ganesan Narayanasamy
The document discusses machine learning and deep learning techniques. It provides examples of different machine learning algorithms like decision trees, linear regression, neural networks and deep learning models. It also discusses applications of machine learning in areas like computer vision, natural language processing and bioinformatics. Finally, it talks about technologies that can help democratize machine learning like distributed computing frameworks and open source libraries.
The document discusses using temporal shift modules (TSM) for efficient video recognition, where TSM enables temporal modeling in 2D CNNs with no additional computation cost; TSM models achieve better performance than 3D CNNs and previous methods while using less computation, and can be used for applications like online video understanding, low-latency deployment on edge devices, and large-scale distributed training on supercomputers.
Snap ML is a machine learning framework for fast training of generalized linear models (GLMs) that can scale to large datasets. It uses multi-level parallelism across nodes and GPUs. Snap ML implementations include snap-ml-local for single nodes, snap-ml-mpi for multi-node HPC environments, and snap-ml-spark for Apache Spark clusters. Experimental results show Snap ML can train a logistic regression model on a 3TB Criteo dataset within 1.5 minutes using 16 GPUs.
This document discusses IBM's involvement in artificial intelligence and deep learning. It includes:
- An introduction to IBM's Cognitive Systems team working in AI.
- A brief history of IBM's AI projects including Deep Blue, Blue Gene, and Watson.
- Explanations of concepts like machine learning, deep learning, and how they relate to high performance computing.
- Details of IBM's current hardware, software, and services for AI workloads including the Power9 processor, PowerAI tools, and storage solutions.
The document provides an overview of IBM's expertise and offerings in the field of artificial intelligence.
Transparent Hardware Acceleration for Deep LearningIndrajit Poddar
This document provides an overview of transparent hardware acceleration for deep learning using IBM's PowerAI platform. It discusses how PowerAI leverages POWER CPUs and NVIDIA GPUs connected via NVLink to dramatically accelerate deep learning model training and inference. Using this approach, IBM has achieved significant performance improvements over x86 platforms, including faster training times, support for larger models, and more efficient distributed training across multiple servers.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/altera/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Bill Jenkins, Senior Product Specialist for High Level Design Tools at Intel, presents the "Accelerating Deep Learning Using Altera FPGAs" tutorial at the May 2016 Embedded Vision Summit.
While large strides have recently been made in the development of high-performance systems for neural networks based on multi-core technology, significant challenges in power, cost and, performance scaling remain. Field-programmable gate arrays (FPGAs) are a natural choice for implementing neural networks because they can combine computing, logic, and memory resources in a single device. Intel's Programmable Solutions Group has developed a scalable convolutional neural network reference design for deep learning systems using the OpenCL programming language built with our SDK for OpenCL. The design performance is being benchmarked using several popular CNN benchmarks: CIFAR-10, ImageNet and KITTI.
Building the CNN with OpenCL kernels allows true scaling of the design from smaller to larger devices and from one device generation to the next. New designs can be sized using different numbers of kernels at each layer. Performance scaling from one generation to the next also benefits from architectural advancements, such as floating-point engines and frequency scaling. Thus, you achieve greater than linear performance and performance per watt scaling with each new series of devices.
This document discusses how HPC infrastructure is being transformed with AI. It summarizes that cognitive systems use distributed deep learning across HPC clusters to speed up training times. It also outlines IBM's hardware portfolio expansion for AI training, inference, and storage capabilities. The document discusses software stacks for AI like Watson Machine Learning Community Edition that use containers and universal base images to simplify deployment.
OpenPOWER Webinar on Machine Learning for Academic Research Ganesan Narayanasamy
The document discusses machine learning and deep learning techniques. It provides examples of different machine learning algorithms like decision trees, linear regression, neural networks and deep learning models. It also discusses applications of machine learning in areas like computer vision, natural language processing and bioinformatics. Finally, it talks about technologies that can help democratize machine learning like distributed computing frameworks and open source libraries.
The document discusses using temporal shift modules (TSM) for efficient video recognition, where TSM enables temporal modeling in 2D CNNs with no additional computation cost; TSM models achieve better performance than 3D CNNs and previous methods while using less computation, and can be used for applications like online video understanding, low-latency deployment on edge devices, and large-scale distributed training on supercomputers.
Snap ML is a machine learning framework for fast training of generalized linear models (GLMs) that can scale to large datasets. It uses multi-level parallelism across nodes and GPUs. Snap ML implementations include snap-ml-local for single nodes, snap-ml-mpi for multi-node HPC environments, and snap-ml-spark for Apache Spark clusters. Experimental results show Snap ML can train a logistic regression model on a 3TB Criteo dataset within 1.5 minutes using 16 GPUs.
This document discusses IBM's involvement in artificial intelligence and deep learning. It includes:
- An introduction to IBM's Cognitive Systems team working in AI.
- A brief history of IBM's AI projects including Deep Blue, Blue Gene, and Watson.
- Explanations of concepts like machine learning, deep learning, and how they relate to high performance computing.
- Details of IBM's current hardware, software, and services for AI workloads including the Power9 processor, PowerAI tools, and storage solutions.
The document provides an overview of IBM's expertise and offerings in the field of artificial intelligence.
Transparent Hardware Acceleration for Deep LearningIndrajit Poddar
This document provides an overview of transparent hardware acceleration for deep learning using IBM's PowerAI platform. It discusses how PowerAI leverages POWER CPUs and NVIDIA GPUs connected via NVLink to dramatically accelerate deep learning model training and inference. Using this approach, IBM has achieved significant performance improvements over x86 platforms, including faster training times, support for larger models, and more efficient distributed training across multiple servers.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/altera/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Bill Jenkins, Senior Product Specialist for High Level Design Tools at Intel, presents the "Accelerating Deep Learning Using Altera FPGAs" tutorial at the May 2016 Embedded Vision Summit.
While large strides have recently been made in the development of high-performance systems for neural networks based on multi-core technology, significant challenges in power, cost and, performance scaling remain. Field-programmable gate arrays (FPGAs) are a natural choice for implementing neural networks because they can combine computing, logic, and memory resources in a single device. Intel's Programmable Solutions Group has developed a scalable convolutional neural network reference design for deep learning systems using the OpenCL programming language built with our SDK for OpenCL. The design performance is being benchmarked using several popular CNN benchmarks: CIFAR-10, ImageNet and KITTI.
Building the CNN with OpenCL kernels allows true scaling of the design from smaller to larger devices and from one device generation to the next. New designs can be sized using different numbers of kernels at each layer. Performance scaling from one generation to the next also benefits from architectural advancements, such as floating-point engines and frequency scaling. Thus, you achieve greater than linear performance and performance per watt scaling with each new series of devices.
Review state-of-the-art techniques that use neural networks to synthesize motion, such as mode-adaptive neural network and phase-functioned neural networks. See how next-generation CPUs with reinforcement learning can offer better performance.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/xilinx/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Nick Ni, Director of Product Marketing at Xilinx, presents the "Xilinx AI Engine: High Performance with Future-proof Architecture Adaptability" tutorial at the May 2019 Embedded Vision Summit.
AI inference demands orders- of-magnitude more compute capacity than what today’s SoCs offer. At the same time, neural network topologies are changing too quickly to be addressed by ASICs that take years to go from architecture to production. In this talk, Ni introduces the Xilinx AI Engine, which complements the dynamically- programmable FPGA fabric to enable ASIC-like performance via custom data flows and a flexible memory hierarchy. This combination provides an orders-of-magnitude boost in AI performance along with the hardware architecture flexibility needed to quickly adapt to rapidly evolving neural network topologies.
Short Survey on the current state of Field-programmable gate array usage in Deep learning by several companies like Intel Nervana and Google's TPU (tensor processing units) vs GPU usage in terms of energy consumption and performance.
The document discusses strategies for improving application performance on POWER9 processors using IBM XL and open source compilers. It reviews key POWER9 features and outlines common bottlenecks like branches, register spills, and memory issues. It provides guidelines on using compiler options and coding practices to address these bottlenecks, such as unrolling loops, inlining functions, and prefetching data. Tools like perf are also described for analyzing performance bottlenecks.
Axel Koehler from Nvidia presented this deck at the 2016 HPC Advisory Council Switzerland Conference.
“Accelerated computing is transforming the data center that delivers unprecedented through- put, enabling new discoveries and services for end users. This talk will give an overview about the NVIDIA Tesla accelerated computing platform including the latest developments in hardware and software. In addition it will be shown how deep learning on GPUs is changing how we use computers to understand data.”
In related news, the GPU Technology Conference takes place April 4-7 in Silicon Valley.
Watch the video presentation: http://insidehpc.com/2016/03/tesla-accelerated-computing/
See more talks in the Swiss Conference Video Gallery:
http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter:
http://insidehpc.com/newsletter
The document provides details about an OpenPOWER and AI workshop being held on June 18-19, 2018 at the Barcelona Supercomputing Center.
Day 1 will provide an introduction to AI and cover topics like Power9 and PowerAI features, large model support, and use case demonstrations. Day 2 will focus on deeper learning exercises and industry use cases using Power9 features like distributed deep learning.
The agenda lists out the schedule and topics to be covered each day, including welcome sessions, technical presentations, breaks and wrap-up discussions.
TAU Performance System and the Extreme-scale Scientific Software Stack (E4S) aim to improve productivity for HPC and AI workloads. TAU provides a portable performance evaluation toolkit, while E4S delivers modular and interoperable software stacks. Together, they lower barriers to using software tools from the Exascale Computing Project and enable performance analysis of complex, multi-component applications.
This document discusses three key artificial intelligence capabilities of IBM's Power9 architecture:
1) Large Memory Support enables processing of high-definition images and large models that exceed GPU memory limits.
2) Distributed Deep Learning allows scaling to multiple servers for faster and more accurate training on large datasets.
3) PowerAI Vision provides tools for labeling data, training models for computer vision tasks, and deploying models for production use.
Everything is changing from Health Care to the Automotive markets without forgetting Financial markets or any type of engineering everything has stopped being created as an individual or best-case scenario a team effort to something that is being developed and perfectioned by using AI and hundreds of computers.And even AI is something that we no longer can run in a single computer, no matter how powerful it is. What drives everything today is HPC or High-Performance Computing heavily linked to AI In this session we will discuss about AI, HPC computing, IBM Power architecture and how it can help develop better Healthcare, better Automobiles, better financials and better everything that we run on them
This document discusses using OpenMP 4.5 directives and CUDA to accelerate computational fluid dynamics (CFD) simulations on GPUs using OpenPOWER platforms. It describes porting an open-source CFD code called Code Saturne to leverage GPUs for tasks like linear algebra kernels and algebraic multigrid. It shows how OpenMP 4.5 data environments can be used to manage data movement between the host and device without modifying the code. Profiling results indicate that directive-based programming models can achieve speedups and improve programmer productivity when porting existing CPU codes to accelerate tasks on GPUs.
The document discusses IBM AI solutions on Power systems. It provides an overview of key features including OpenPOWER collaboration, IBM machine learning and deep learning solutions designed for faster results, and Power9 servers adopted by research institutions. It then discusses specific IBM Power systems like the IBM Power AC922 that are optimized for AI workloads through features like CPU-GPU NVLink and large model support in TensorFlow.
Distributed Deep Learning At Scale On Apache Spark With BigDLYulia Tell
This document provides an agenda and details for a co-hosted meetup between Intel and Databricks on March 23, 2017 about BigDL. The agenda includes opening remarks, two tech talks (one from Intel and one from Databricks), and a mingling session. It also provides WiFi access details and background on Intel's Big Data Technologies group and BigDL. BigDL is an open-source distributed deep learning library for Apache Spark that allows users to run deep learning applications on Spark.
Gary Paek from Intel presented this deck at the HPC User Forum in Tucson.
Learn more: https://software.intel.com/en-us/tags/18892
and
http://hpcuserforum.com
Watch the video presentation: http://wp.me/p3RLHQ-fdt
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Amazon EC2 F1 is a new compute instance with programmable hardware for application acceleration. With F1, you can directly access custom FPGA hardware on the instance in a few clicks.
Learning Objectives:
• Learn about the capabilities, features, and benefits of the new F1 instances
• Develop your FPGA using the F1 Hardware Developer Kit and FPGA Developer AMI
• Deploy your FPGA acceleration code using F1 instances
• Use F1 instances for hardware acceleration in your applications
• Learn how to offer pre-packaged Amazon FPGA Machine Images (AFIs) to your customers through the AWS Marketplace
Deep Learning Accelerator Design TechniquesMindos Cheng
The document discusses various design techniques for deep learning accelerators (DLA). It covers topics such as convolution layers, fully-connected layers, CNN accelerators, filter decomposition, model compression through pruning and retraining, tensor cores, systolic arrays, burst fetching, analog computing, thermal management, memory bandwidth optimization, and zero-copy techniques.
IBM AI Solutions on Power Systems is a presentation about IBM's AI solutions. It introduces IBM Visual Insights for tasks like image classification, object detection, and segmentation. A use case demo shows breast cancer classification in under one second with high accuracy. Another demo detects diabetic retinopathy in eye images. The presentation discusses open issues in medical imaging AI and IBM's response to COVID-19, including an X-ray demo to detect COVID-19 in lung images. It calls for collaboration to share medical data and models.
The document discusses IBM's PowerAI software for large model support and distributed deep learning. It describes how PowerAI uses large model support (LMS) to enable processing of high-definition images, large models, and higher batch sizes that don't fit in GPU memory. It provides examples of using LMS with Caffe and TensorFlow. It also describes IBM's distributed deep learning library (DDL) for scaling deep learning training across multiple servers and GPUs, and how tools like ddlrun automatically handle tasks like topology detection and mpirun options.
End-to-End Big Data AI with Analytics ZooJason Dai
The document discusses Analytics Zoo, an open-source software platform for building end-to-end big data AI applications. It provides distributed deep learning frameworks like TensorFlow and PyTorch on Apache Spark. Analytics Zoo allows seamless scaling of AI models from laptop to distributed big data and includes features like automated machine learning, time series forecasting, and serving models in production. It aims to simplify development of end-to-end big data AI solutions.
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Jason Dai
This document summarizes a CVPR 2020 tutorial on the Analytics Zoo platform for automated machine learning workflows for distributed big data using Apache Spark. The tutorial covers an overview of Analytics Zoo and the BigDL distributed deep learning framework. It demonstrates distributed training of deep learning models using TensorFlow and PyTorch on Spark, and features of Analytics Zoo like end-to-end pipelines, ML workflow for automation, and model deployment with cluster serving. Real-world use cases applying Analytics Zoo at companies like SK Telecom, Midea, and MasterCard are also presented.
FPGAs for Supercomputing: The Why and HowDESMOND YUEN
Excellent presentation by Hal Finkel (hfinkel@anl.gov), Kazutomo Yoshii, and Franck Cappello as to why FPGAs are a competitive HPC accelerator technology.
If you like what you read be sure you ♥ it below. Thank you!
Review state-of-the-art techniques that use neural networks to synthesize motion, such as mode-adaptive neural network and phase-functioned neural networks. See how next-generation CPUs with reinforcement learning can offer better performance.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/xilinx/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Nick Ni, Director of Product Marketing at Xilinx, presents the "Xilinx AI Engine: High Performance with Future-proof Architecture Adaptability" tutorial at the May 2019 Embedded Vision Summit.
AI inference demands orders- of-magnitude more compute capacity than what today’s SoCs offer. At the same time, neural network topologies are changing too quickly to be addressed by ASICs that take years to go from architecture to production. In this talk, Ni introduces the Xilinx AI Engine, which complements the dynamically- programmable FPGA fabric to enable ASIC-like performance via custom data flows and a flexible memory hierarchy. This combination provides an orders-of-magnitude boost in AI performance along with the hardware architecture flexibility needed to quickly adapt to rapidly evolving neural network topologies.
Short Survey on the current state of Field-programmable gate array usage in Deep learning by several companies like Intel Nervana and Google's TPU (tensor processing units) vs GPU usage in terms of energy consumption and performance.
The document discusses strategies for improving application performance on POWER9 processors using IBM XL and open source compilers. It reviews key POWER9 features and outlines common bottlenecks like branches, register spills, and memory issues. It provides guidelines on using compiler options and coding practices to address these bottlenecks, such as unrolling loops, inlining functions, and prefetching data. Tools like perf are also described for analyzing performance bottlenecks.
Axel Koehler from Nvidia presented this deck at the 2016 HPC Advisory Council Switzerland Conference.
“Accelerated computing is transforming the data center that delivers unprecedented through- put, enabling new discoveries and services for end users. This talk will give an overview about the NVIDIA Tesla accelerated computing platform including the latest developments in hardware and software. In addition it will be shown how deep learning on GPUs is changing how we use computers to understand data.”
In related news, the GPU Technology Conference takes place April 4-7 in Silicon Valley.
Watch the video presentation: http://insidehpc.com/2016/03/tesla-accelerated-computing/
See more talks in the Swiss Conference Video Gallery:
http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter:
http://insidehpc.com/newsletter
The document provides details about an OpenPOWER and AI workshop being held on June 18-19, 2018 at the Barcelona Supercomputing Center.
Day 1 will provide an introduction to AI and cover topics like Power9 and PowerAI features, large model support, and use case demonstrations. Day 2 will focus on deeper learning exercises and industry use cases using Power9 features like distributed deep learning.
The agenda lists out the schedule and topics to be covered each day, including welcome sessions, technical presentations, breaks and wrap-up discussions.
TAU Performance System and the Extreme-scale Scientific Software Stack (E4S) aim to improve productivity for HPC and AI workloads. TAU provides a portable performance evaluation toolkit, while E4S delivers modular and interoperable software stacks. Together, they lower barriers to using software tools from the Exascale Computing Project and enable performance analysis of complex, multi-component applications.
This document discusses three key artificial intelligence capabilities of IBM's Power9 architecture:
1) Large Memory Support enables processing of high-definition images and large models that exceed GPU memory limits.
2) Distributed Deep Learning allows scaling to multiple servers for faster and more accurate training on large datasets.
3) PowerAI Vision provides tools for labeling data, training models for computer vision tasks, and deploying models for production use.
Everything is changing from Health Care to the Automotive markets without forgetting Financial markets or any type of engineering everything has stopped being created as an individual or best-case scenario a team effort to something that is being developed and perfectioned by using AI and hundreds of computers.And even AI is something that we no longer can run in a single computer, no matter how powerful it is. What drives everything today is HPC or High-Performance Computing heavily linked to AI In this session we will discuss about AI, HPC computing, IBM Power architecture and how it can help develop better Healthcare, better Automobiles, better financials and better everything that we run on them
This document discusses using OpenMP 4.5 directives and CUDA to accelerate computational fluid dynamics (CFD) simulations on GPUs using OpenPOWER platforms. It describes porting an open-source CFD code called Code Saturne to leverage GPUs for tasks like linear algebra kernels and algebraic multigrid. It shows how OpenMP 4.5 data environments can be used to manage data movement between the host and device without modifying the code. Profiling results indicate that directive-based programming models can achieve speedups and improve programmer productivity when porting existing CPU codes to accelerate tasks on GPUs.
The document discusses IBM AI solutions on Power systems. It provides an overview of key features including OpenPOWER collaboration, IBM machine learning and deep learning solutions designed for faster results, and Power9 servers adopted by research institutions. It then discusses specific IBM Power systems like the IBM Power AC922 that are optimized for AI workloads through features like CPU-GPU NVLink and large model support in TensorFlow.
Distributed Deep Learning At Scale On Apache Spark With BigDLYulia Tell
This document provides an agenda and details for a co-hosted meetup between Intel and Databricks on March 23, 2017 about BigDL. The agenda includes opening remarks, two tech talks (one from Intel and one from Databricks), and a mingling session. It also provides WiFi access details and background on Intel's Big Data Technologies group and BigDL. BigDL is an open-source distributed deep learning library for Apache Spark that allows users to run deep learning applications on Spark.
Gary Paek from Intel presented this deck at the HPC User Forum in Tucson.
Learn more: https://software.intel.com/en-us/tags/18892
and
http://hpcuserforum.com
Watch the video presentation: http://wp.me/p3RLHQ-fdt
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Amazon EC2 F1 is a new compute instance with programmable hardware for application acceleration. With F1, you can directly access custom FPGA hardware on the instance in a few clicks.
Learning Objectives:
• Learn about the capabilities, features, and benefits of the new F1 instances
• Develop your FPGA using the F1 Hardware Developer Kit and FPGA Developer AMI
• Deploy your FPGA acceleration code using F1 instances
• Use F1 instances for hardware acceleration in your applications
• Learn how to offer pre-packaged Amazon FPGA Machine Images (AFIs) to your customers through the AWS Marketplace
Deep Learning Accelerator Design TechniquesMindos Cheng
The document discusses various design techniques for deep learning accelerators (DLA). It covers topics such as convolution layers, fully-connected layers, CNN accelerators, filter decomposition, model compression through pruning and retraining, tensor cores, systolic arrays, burst fetching, analog computing, thermal management, memory bandwidth optimization, and zero-copy techniques.
IBM AI Solutions on Power Systems is a presentation about IBM's AI solutions. It introduces IBM Visual Insights for tasks like image classification, object detection, and segmentation. A use case demo shows breast cancer classification in under one second with high accuracy. Another demo detects diabetic retinopathy in eye images. The presentation discusses open issues in medical imaging AI and IBM's response to COVID-19, including an X-ray demo to detect COVID-19 in lung images. It calls for collaboration to share medical data and models.
The document discusses IBM's PowerAI software for large model support and distributed deep learning. It describes how PowerAI uses large model support (LMS) to enable processing of high-definition images, large models, and higher batch sizes that don't fit in GPU memory. It provides examples of using LMS with Caffe and TensorFlow. It also describes IBM's distributed deep learning library (DDL) for scaling deep learning training across multiple servers and GPUs, and how tools like ddlrun automatically handle tasks like topology detection and mpirun options.
End-to-End Big Data AI with Analytics ZooJason Dai
The document discusses Analytics Zoo, an open-source software platform for building end-to-end big data AI applications. It provides distributed deep learning frameworks like TensorFlow and PyTorch on Apache Spark. Analytics Zoo allows seamless scaling of AI models from laptop to distributed big data and includes features like automated machine learning, time series forecasting, and serving models in production. It aims to simplify development of end-to-end big data AI solutions.
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Jason Dai
This document summarizes a CVPR 2020 tutorial on the Analytics Zoo platform for automated machine learning workflows for distributed big data using Apache Spark. The tutorial covers an overview of Analytics Zoo and the BigDL distributed deep learning framework. It demonstrates distributed training of deep learning models using TensorFlow and PyTorch on Spark, and features of Analytics Zoo like end-to-end pipelines, ML workflow for automation, and model deployment with cluster serving. Real-world use cases applying Analytics Zoo at companies like SK Telecom, Midea, and MasterCard are also presented.
FPGAs for Supercomputing: The Why and HowDESMOND YUEN
Excellent presentation by Hal Finkel (hfinkel@anl.gov), Kazutomo Yoshii, and Franck Cappello as to why FPGAs are a competitive HPC accelerator technology.
If you like what you read be sure you ♥ it below. Thank you!
This document discusses three options for implementing digital designs: microcontrollers, ASICs, and FPGAs. It provides details on the differences between FPGAs and microcontrollers, and between FPGAs and ASICs. FPGAs offer reconfigurable hardware, faster speeds than microcontrollers due to parallel processing, and more flexible I/O. However, ASICs are best for high volume manufacturing due to lower costs. The document also provides information on the internal architecture of FPGAs, including configurable logic blocks, look up tables, programmable interconnects, and I/O blocks.
This document provides an introduction to electronic design automation (EDA) tools and discusses different types of programmable logic devices including field programmable gate arrays (FPGAs) and complex programmable logic devices (CPLDs). It describes the basic architecture of FPGAs including logic blocks, interconnects, and input/output blocks. The advantages of FPGAs such as shorter development time and flexibility are also summarized.
This document provides an introduction to FPGA design fundamentals including:
- Programmable logic devices like PLDs, CPLDs, and FPGAs which allow for reconfigurable logic circuits.
- The basic architecture of FPGAs including configurable logic blocks (CLBs), input/output blocks (IOBs), and a programmable interconnect structure.
- Verilog and VHDL as common hardware description languages used for FPGA design entry and simulation.
- A simple example of designing a half-adder circuit in VHDL, including entity, architecture, and behavioral modeling style.
Small introduction to FPGA acceleration and the impact of the new High Level Synthesis toolchains to their programmability
Video here: https://www.linkedin.com/posts/marcobarbone_can-my-application-benefit-from-fpga-acceleration-activity-6848674747375460352-0fua
An FPGA (field-programmable gate array) is an integrated circuit designed to be configured by a customer after manufacturing. FPGAs contain programmable logic blocks and a hierarchy of reconfigurable interconnects that allow the blocks to be wired together in different configurations. This flexibility allows FPGAs to implement any logical function that an ASIC could perform, with advantages including the ability to reprogram functionality after shipping and lower engineering costs than an ASIC. Common applications of FPGAs include digital signal processing, software-defined radio, medical imaging, and more.
The document is a seminar report on FPGA technology in outer space applications. It discusses the history and evolution of FPGA technology over time, including increasing gate densities and falling prices. It describes typical FPGA architecture which includes configurable logic blocks, interconnects, and I/O pads. Modern FPGAs integrate additional resources like memory blocks, DSP slices, and soft processor cores. The document highlights applications of FPGAs in aerospace, including COTS boards and development kits. It also outlines future potential for FPGAs in more complex roles in space systems.
This document discusses Field Programmable Gate Arrays (FPGAs), including their history, components, applications, and advantages. FPGAs allow logic functions to be programmed in the field after manufacturing and consist of configurable logic blocks, input/output blocks, and a routing matrix. They are used widely in embedded systems, consumer electronics, communications, and more due to their flexibility, short development times, and ability to be updated in the field. FPGAs provide advantages over traditional ICs like long-term availability, field updates/upgrades, extremely short time to market, and massively parallel processing capabilities.
The document discusses using heterogeneous computing with GPUs, FPGAs, and ARM processors to solve complex problems. Specifically, it proposes combining NVIDIA's Tegra K1 system-on-module with Altera FPGAs using OpenCL and CUDA to create powerful and scalable systems from small edge nodes to large HPC clusters. It provides examples of potential applications and challenges in integrating the different components.
2.FPGA for dummies: modern FPGA architectureMaurizio Donna
The document discusses the architecture and components of field programmable gate arrays (FPGAs). It describes the basic building blocks of FPGAs, including look-up tables (LUTs), flip-flops (FFs), wires, and input/output pads. It notes that modern FPGAs also include additional elements like embedded memory, phase-locked loops, high-speed transceivers, memory controllers, multiply-accumulate blocks, and embedded processors. The document provides details on these individual components, such as the use of block RAM for memory, PLLs and DLLs for clocking, transceivers for high-speed communication, and DSP blocks for arithmetic functions.
Coral is a framework that allows the distributed acceleration of large data sets across clusters of FPGA resources using simple programming models. It is designed to scale up from single devices to multiple FPGAs, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of FPGAs, each of which may be prone to failures.
Coral abstracts FPGA resources (device, memory), enabling fault-tolerant heterogeneous distributed systems to easily be built and run effectively.
It allows:
- instant scalability to multiple FPGAs
- seamless virtualization of the FPGA cluster
The document describes an IBM workshop on CAPI and OpenCAPI technologies. It provides an overview of FPGA acceleration using SNAP, including how SNAP simplifies FPGA programming using a C/C++ based approach. Examples of use cases for FPGA acceleration like video processing and machine learning inference are also presented.
Hardware for deep learning includes CPUs, GPUs, FPGAs, and ASICs. CPUs are general purpose but support deep learning through instructions like AVX-512 and libraries. GPUs like NVIDIA and AMD models are commonly used due to high parallelism and memory bandwidth. FPGAs offer high efficiency but require specialized programming. ASICs like Google's TPU are customized for deep learning and provide high performance but limited flexibility. Emerging hardware aims to improve efficiency and better match neural network computations.
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATIONieijjournal
ASIC (Application Specific Integrated Circuit) design verification takes as long as the designers take to describe, synthesis and implement the design. The hybrid approach, where the design is first prototyped on an FPGA (Field-Programmable Gate Array) platform for functional validation and then implemented as
an ASIC allows earlier defect detection in the design process and thus allows a significant time saving.
This paper deals with a CMOS standard-cell ASIC implementation of a SoC (System on Chip) based on the
OpenRISC processor for Voice over IP (VoIP) application; where a hybrid approach is adopted. The
architecture of the design is mainly based on the reuse of IPs cores described at the RTL level. This RTL code is technology-independent; hence the design can be ported easily from FPGA to ASIC.
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATIONieijjournal1
ASIC (Application Specific Integrated Circuit) design verification takes as long as the designers take to
describe, synthesis and implement the design. The hybrid approach, where the design is first prototyped on
an FPGA (Field-Programmable Gate Array) platform for functional validation and then implemented as
an ASIC allows earlier defect detection in the design process and thus allows a significant time saving.
This paper deals with a CMOS standard-cell ASIC implementation of a SoC (System on Chip) based on the
OpenRISC processor for Voice over IP (VoIP) application; where a hybrid approach is adopted. The
architecture of the design is mainly based on the reuse of IPs cores described at the RTL level. This RTL
code is technology-independent; hence the design can be ported easily from FPGA to ASIC. Results show
that the SoC occupied the area of 2.64mm². Regarding the power consumption, RTL power estimation is
given.
FROM FPGA TO ASIC IMPLEMENTATION OF AN OPENRISC BASED SOC FOR VOIP APPLICATIONieijjournal
ASIC (Application Specific Integrated Circuit) design verification takes as long as the designers take to describe, synthesis and implement the design. The hybrid approach, where the design is first prototyped on an FPGA (Field-Programmable Gate Array) platform for functional validation and then implemented as
an ASIC allows earlier defect detection in the design process and thus allows a significant time saving. This paper deals with a CMOS standard-cell ASIC implementation of a SoC (System on Chip) based on the OpenRISC processor for Voice over IP (VoIP) application; where a hybrid approach is adopted. The
architecture of the design is mainly based on the reuse of IPs cores described at the RTL level. This RTL code is technology-independent; hence the design can be ported easily from FPGA to ASIC. Results show that the SoC occupied the area of 2.64mm². Regarding the power consumption, RTL power estimation is given.
The document discusses the architecture of CPLDs and FPGAs. It begins by explaining the problems with using basic logic gates on PCBs and introduces programmable logic devices as a solution. It then describes different types of PLDs including PLA, PAL, GAL, CPLD and FPGA. CPLDs have a complexity between FPGAs and basic PLDs, containing non-volatile memory and supporting larger logic than PLDs. FPGAs contain logic cells, interconnects, and can implement thousands of gates. The document provides examples of implementing logic with different PLDs and describes the architecture and programming of CPLDs and FPGAs.
FPGAs were introduced in 1984 as a programmable alternative to PLDs. They fill the gap between discrete logic and smaller PLDs on the low end and more expensive ASICs on the high end. The basic elements of an FPGA are configurable logic blocks (CLBs), configurable I/O blocks (IOBs), and a programmable interconnect. FPGAs from vendors like Xilinx and Altera have a regular architecture of CLBs surrounded by IOBs and connected via a hierarchy of programmable interconnects.
Similar to A Primer on FPGAs - Field Programmable Gate Arrays (20)
The Indian government has been working over the past few years to include elements of ITS in the transport sector. This standard ensures the optimal operation of the current transport infrastructure. It also increases the efficiency, safety, comfort, and quality of the system. That is why the government created the AIS-140 standard. Compliance with this standard means all vehicles used for public transit must have panic buttons and vehicle tracking modules installed. Nevertheless, in future in the standard protocol of AIS-140 you can expect fare collection and CCTV capabilities.
Get more information here: https://blog.watsoo.com/2023/12/27/all-about-prithvi-ais-140-gps-vehicle-tracker/
Google Calendar is a versatile tool that allows users to manage their schedules and events effectively. With Google Calendar, you can create and organize calendars, set reminders for important events, and share your calendars with others. It also provides features like creating events, inviting attendees, and accessing your calendar from mobile devices. Additionally, Google Calendar allows you to embed calendars in websites or platforms like SlideShare, making it easier for others to view and interact with your schedules.
5. 5 Cloud Saturday Atlanta
About Me
Project Manager
Systems Engineer
3yr. Residence
Multi-national
Pharmaceutical Co.
Virtualization & Storage
Consulting
NetApp A-Team
Renewed Passion for
Embedded Systems (IoT)
Passion for BigData
Ecosystem
Chief Architect - DevOpsy
Kinda Role
Hobbies: Maker, High-Power Model
Rocketry, Building & Flying Drones
Contact Info:
http://www.waflhouse.com
@triggan on Twitter
github.com/triggan
6. 6 Cloud Saturday Atlanta
What is an FPGA? Basic architecture
FPGAs and their use by cloud service providers
Use cases and application
Where do you get started / tool chain
Agenda
7. 7 Cloud Saturday Atlanta
Field Programmable GateArray
“Field Programmable” – architecture can be
changed after deployment (mostly).
GateArray –
Gate – Short for a transistor logic gate. Most
common form is a NAND gate. Chain
together to create Transistor-Transistor Logic
circuits (TTL).
Array – a whole lot of them.
What is an FPGA?
8. 8 Cloud Saturday Atlanta
An integrated circuit – a “chip”
Programmable logic – not really a bunch of logic gates; composed
of lookup tables and some other components.
What is an FPGA?
9. 9 Cloud Saturday Atlanta
95% of market comprised of these 3. Many other niche players.
Xilinx is the market leader (50%)
Intel/Altera (39%, but gaining ground with Intel acquisition)
(purchased for $16.7B in 2015)
Current Market / Manufacturers
10. 10 Cloud Saturday Atlanta
ASICs –Application Specific Integrated Circuits – custom designed
and printed silicon.
Susceptible to low yield rates some times.
No one wants to do VLSI (Very Large Scale Integration) chip
layout.
An ASIC without the mess…
15. 15 Cloud Saturday Atlanta
Cloud Adoption – Use Cases
According to Intel executive vice president Diane Bryant, a third of all
servers used by major cloud providers will utilize FPGAs by 2020.
16. 16 Cloud Saturday Atlanta
Started 2010
Originally targeted to accelerate
Bing search queries.
Now used for a number of
different use cases.
FPGAs deployed in nearly every
production server within Bing and
Azure – known as the
“Configurable Cloud”.
Microsoft also partnering with
Baidu in China to bring FPGAs to
their datacenters.
Microsoft Project Catapult
17. 17 Cloud Saturday Atlanta
FPGAinsertedon PCIebus with “loopback”connectivityof NetworkInterface
Card. All networktrafficflows throughFPGA.
FPGAs fromdifferentserversconnectedinto a “Torus”of 48 FPGAs.
FPGA-to-FPGAcommunicationenablesfuturefunctionalityin the deep
learningandAI space;neural networks.
Microsoft Project Catapult
18. 18 Cloud Saturday Atlanta
Tensor Processing Units
Went different route – developed their ownASICs
Claimed FPGAs were too power hungry
Also connected via PCIe interface, but not directly network
connected
Tailored specifically for Machine Learning. (TensorFlow)
Architecture still a closely guarded secret within Google.
Google TPUs
19. 19 Cloud Saturday Atlanta
Built for end-user consumption
Currently in preview since Dec. 2016
Development Tools available viaAWS Marketplace
Built on Xilinx Ultrascale+ FPGAs
F1.16xlarge – similar architecture to Catapult – both PCIe and bi-
directional link connectivity between FPGAs
2M Logic Cells per FPGA!!
Amazon EC2 F1 Instances
20. 20 Cloud Saturday Atlanta
AWS pre-configured FPGAshell – makes I/O access to FPGA
much easier
Pre-builtAFIs (similar to existingAMI images) – offers the ability to
create a library of FPGAconfigurations. Also offer another security
layer – bit-stream is encrypted.
Amazon EC2 F1 Instances
21. 21 Cloud Saturday Atlanta
Xilinx tools (Vivado Studio and SDAccel) built into the EC2 F1AMI.
Amazon EC2 F1 Instances
22. 22 Cloud Saturday Atlanta
Ability to republishAFIs (FPGAImages) back toAWS
Marketplace.
Can also create solutions that leverage FPGAs but do not expose
that functionality back to end-user.
Amazon EC2 F1 Instances
23. 23 Cloud Saturday Atlanta
Traditional analytics – Map-Reduce paper on using FPGAs
Genomics research – highly parallel
Encryption/Cyrptography offload – in-line network
encryption/decryption
NetworkAnalysis – deep packet inspection
Machine/Deep Learning – using networks of FPGAs in unison to
mimic neural networks.
FinancialAnalysis – market analysis; requires near real-time
computation.
Video Manipulation – high throughput. (Go watch
“Implementation of MITMAttack on HDCP-Secured Links”)
Cloud-Based Use Cases
27. 27 Cloud Saturday Atlanta
Include the pin and signal
mapping files:
Xilinx - (UCF) – User
Constraint Files.
Altera – (SDC) – Synopsis
Design Construct
Code Packages
28. 28 Cloud Saturday Atlanta
Be cautious of order – remember this is hardware, not software:
Compile time – you’ll have a greater affinity for the speed of a C++
or Java compiler. Map & Route can taken 5 minutes or 5 hours
depending on complexity of the device.
Have a good understanding of signaling – pull up or pull down
inputs and outputs; tri-state I/O; high impedance I/O
Know basic microcontroller programming – most dev tools use a
uC to bootstrap the FPGA.
Gotchas to Watch Out For…
29. 29 Cloud Saturday Atlanta
Provides GPU programmers easy access to FPGAprogramming.
Now available from major manufactures – both Xilinx andAltera
have OpenCLSDKs
Work OK, but can be cumbersome with smaller devices (not as
efficient)
Good for rapid prototyping and comparing performance between a
GPU and FPGA.
Using OpenCL for FPGAs
31. 31 Cloud Saturday Atlanta
Intel has a good 4 hour video tutorial that
leverages the DE0-Nano platform.
“Learning FPGAs” – book to be
released later this spring. Tutorials use
the Mojo v3 platform.
Altera has released a “FPGAs for
Dummies” book about 2-3 years ago.
Available for free online.
Other Resources
34. 34 Cloud Saturday Atlanta
See website for more info and directions
https://cloudsaturdayatlanta.com/
Continue the conversation…
Editor's Notes
Made up of Configuration Logic Blocks (CLBs) along with programmable routing between CLBs. CLB shown in right side image.
Components – CLBs, Configurable Routing, Tons of I/O blocks (lots of these chips are BGA (ball grid array) – machine soldered only), PLLs – increase clock speed from 20Mhz up to near 1Ghz
CPUs and Microcontollers – good at doing things that don’t repeat very often. Built for sequential operations, not necessarily parallel ops. CPUs built for large complex operating systems – handle many different processing threads.
DSP – not really built for general use case – very specific.
GPUs – Great at tough math problems (BitCoin mining); but difficult to use otherwise.
Combinations of these can be very powerful…
FPGA PCIe cards are relatively inexpensive – starting around $200-300. About the cost of a decent graphics card. I/O interfaces vary. Most form factors are built for Network Connectivity (lots of use cases in the form of network/packet manipulation).
Novena laptop – Bunnie Huang (first to crack Xbox encryption). Bunnie is THE hardware geek. Done talks on reverse engineering microcontrollers in SD Cards. Also used an FPGA for HD video manipulation (more on that in a bit).
Heterogenous systems are really what we’re seeing more of today. Cell phones and mobile devices do this very well –combination of microprocessors and microcontrollers. CPUs handle a lot of the control functions and operating system overhead while FPGAs are doing application specific functions. Best of both worlds.
Most hyperscalers are using FPGAs in this form factor – integrating them into x86 platforms. Not custom built appliances.
Adoption of FPGAs by hyperscalers is picking up at a break neck pace. Moore’s law is on life support, if not already dead. We have to find other means to accelerate applications.
Spartan-6 that I have has just shy of 10,000 logic cells.
Typically accessing an FPGA via PCIe has been a bit cumbersome. Sending data requires DMA access and knowing how best to do that programmatically.
Typically accessing an FPGA via PCIe has been a bit cumbersome. Sending data requires DMA access and knowing how best to do that programmatically.
Adoption of FPGAs by hyperscalers is picking up at a break neck pace. Moore’s law is on life support, if not already dead. We have to find other means to accelerate applications.
Two schools of thought – Use high level language – great but could be costly in terms of resources. Really should try using an HDL (know what is happening on the chip).
Called an HDL – originally was created to document ASICs (hence description language). Later they decided to use the language to automate the creation of the hardware.
Easter egg – can you tell what the major difference is between both counters?
Maybe your not sure whether a GPU or FPGA would be a good fit for a particular application. Building it in OpenCL is a good start to figure out which might work best.