Hardware for deep learning includes CPUs, GPUs, FPGAs, and ASICs. CPUs are general purpose but support deep learning through instructions like AVX-512 and libraries. GPUs like NVIDIA and AMD models are commonly used due to high parallelism and memory bandwidth. FPGAs offer high efficiency but require specialized programming. ASICs like Google's TPU are customized for deep learning and provide high performance but limited flexibility. Emerging hardware aims to improve efficiency and better match neural network computations.
This webinar by Dov Nimratz (Senior Solution Architect, Consultant, GlobalLogic) was delivered at Embedded Community Webinar #1 on July 7, 2020.
Webinar agenda:
- CPU / GPU / TPU architectures
- Historical context
- CPU and their variations
- GPU or gin in a bottle for artificial intelligence tasks
- TPU architecture specialized artificial intelligence accelerator
- What's next in technology
More details and presentation: https://www.globallogic.com/ua/about/events/embedded-community-webinar-1/
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Intel® Software
Explore how to build a unified framework based on FFmpeg and GStreamer to enable video analytics on all Intel® hardware, including CPUs, GPUs, VPUs, FPGAs, and in-circuit emulators.
The need for intelligent, personalized experiences powered by AI is ever-growing. Our devices are producing more and more data that could help improve our AI experiences. How do we learn and efficiently process all this data from edge devices while maintaining privacy? On-device learning rather than cloud training can address these challenges. In this presentation, we’ll discuss:
- Why on-device learning is crucial for providing intelligent, personalized experiences without sacrificing privacy
- Our latest research in on-device learning, including few-shot learning, continuous learning, and federated learning
- How we are solving system and feasibility challenges to move from research to commercialization
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/xilinx/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Nick Ni, Director of Product Marketing at Xilinx, presents the "Xilinx AI Engine: High Performance with Future-proof Architecture Adaptability" tutorial at the May 2019 Embedded Vision Summit.
AI inference demands orders- of-magnitude more compute capacity than what today’s SoCs offer. At the same time, neural network topologies are changing too quickly to be addressed by ASICs that take years to go from architecture to production. In this talk, Ni introduces the Xilinx AI Engine, which complements the dynamically- programmable FPGA fabric to enable ASIC-like performance via custom data flows and a flexible memory hierarchy. This combination provides an orders-of-magnitude boost in AI performance along with the hardware architecture flexibility needed to quickly adapt to rapidly evolving neural network topologies.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2020/12/vitis-and-vitis-ai-application-acceleration-from-cloud-to-edge-a-presentation-from-xilinx/
For more information about edge AI and computer vision, please visit:
https://www.edge-ai-vision.com
Vinod Kathail, Fellow and Chief Architect at Xilinx, presents the “Vitis and Vitis AI: Application Acceleration from Cloud to Edge” tutorial at the September 2020 Embedded Vision Summit.
Xilinx SoCs and FPGAs provide significant advantages in throughput, latency, and energy efficiency for production deployments of compute-intensive applications when compared to CPUs and GPUs. Over the last decade, FPGAs have evolved into highly configurable devices that provide on-chip heterogeneous multi-core CPUs, domain-specific programmable accelerators and “any-to-any” interface connectivity.
Today, the Xilinx Vitis Unified Software Platform supports high-level programming in C, C++, OpenCL, and Python, enabling developers to build and seamlessly deploy applications on Xilinx platforms including Alveo cards, FPGA instances in the cloud, and embedded devices. Moreover, Vitis enables the acceleration of large-scale data processing and machine learning applications using familiar high-level frameworks, such as TensorFlow and SPARK. This presentation provides an overview of the Vitis Software platform and the accelerated Vitis Vision Library, which enables customizable functions such as image signal processing, adaptable AI inference, 3D reconstruction and motion analysis.
This webinar by Dov Nimratz (Senior Solution Architect, Consultant, GlobalLogic) was delivered at Embedded Community Webinar #1 on July 7, 2020.
Webinar agenda:
- CPU / GPU / TPU architectures
- Historical context
- CPU and their variations
- GPU or gin in a bottle for artificial intelligence tasks
- TPU architecture specialized artificial intelligence accelerator
- What's next in technology
More details and presentation: https://www.globallogic.com/ua/about/events/embedded-community-webinar-1/
Build a Deep Learning Video Analytics Framework | SIGGRAPH 2019 Technical Ses...Intel® Software
Explore how to build a unified framework based on FFmpeg and GStreamer to enable video analytics on all Intel® hardware, including CPUs, GPUs, VPUs, FPGAs, and in-circuit emulators.
The need for intelligent, personalized experiences powered by AI is ever-growing. Our devices are producing more and more data that could help improve our AI experiences. How do we learn and efficiently process all this data from edge devices while maintaining privacy? On-device learning rather than cloud training can address these challenges. In this presentation, we’ll discuss:
- Why on-device learning is crucial for providing intelligent, personalized experiences without sacrificing privacy
- Our latest research in on-device learning, including few-shot learning, continuous learning, and federated learning
- How we are solving system and feasibility challenges to move from research to commercialization
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/xilinx/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Nick Ni, Director of Product Marketing at Xilinx, presents the "Xilinx AI Engine: High Performance with Future-proof Architecture Adaptability" tutorial at the May 2019 Embedded Vision Summit.
AI inference demands orders- of-magnitude more compute capacity than what today’s SoCs offer. At the same time, neural network topologies are changing too quickly to be addressed by ASICs that take years to go from architecture to production. In this talk, Ni introduces the Xilinx AI Engine, which complements the dynamically- programmable FPGA fabric to enable ASIC-like performance via custom data flows and a flexible memory hierarchy. This combination provides an orders-of-magnitude boost in AI performance along with the hardware architecture flexibility needed to quickly adapt to rapidly evolving neural network topologies.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2020/12/vitis-and-vitis-ai-application-acceleration-from-cloud-to-edge-a-presentation-from-xilinx/
For more information about edge AI and computer vision, please visit:
https://www.edge-ai-vision.com
Vinod Kathail, Fellow and Chief Architect at Xilinx, presents the “Vitis and Vitis AI: Application Acceleration from Cloud to Edge” tutorial at the September 2020 Embedded Vision Summit.
Xilinx SoCs and FPGAs provide significant advantages in throughput, latency, and energy efficiency for production deployments of compute-intensive applications when compared to CPUs and GPUs. Over the last decade, FPGAs have evolved into highly configurable devices that provide on-chip heterogeneous multi-core CPUs, domain-specific programmable accelerators and “any-to-any” interface connectivity.
Today, the Xilinx Vitis Unified Software Platform supports high-level programming in C, C++, OpenCL, and Python, enabling developers to build and seamlessly deploy applications on Xilinx platforms including Alveo cards, FPGA instances in the cloud, and embedded devices. Moreover, Vitis enables the acceleration of large-scale data processing and machine learning applications using familiar high-level frameworks, such as TensorFlow and SPARK. This presentation provides an overview of the Vitis Software platform and the accelerated Vitis Vision Library, which enables customizable functions such as image signal processing, adaptable AI inference, 3D reconstruction and motion analysis.
This is a talk at AI Nextcon Seattle on Feb 12, 2020.
An overview of TensorFlow Lite and various resources for helping you deploy TFLite models to mobile and edge devices. Walk through an example of end to end on-device ML: train a model from scratch, convert to TFLite and deploy it.
AI firsts: Leading from research to proof-of-conceptQualcomm Research
AI has made tremendous progress over the past decade, with many advancements coming from fundamental research from many decades ago. Accelerating the pipeline from research to commercialization has been daunting since scaling technologies in the real world faces many challenges beyond the theoretical work done in the lab. Qualcomm AI Research has taken on the task of not only generating novel AI research but also being first to demonstrate proof-of-concepts on commercial devices, enabling technology to scale in the real world. This presentation covers:
The challenges of deploying cutting-edge research on real-world mobile devices
How Qualcomm AI Research is solving system and feasibility challenges with full-stack optimizations to quickly move from research to commercialization
Examples where Qualcomm AI Research has had industrial or academic firsts
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
It’s long ago, approx. 30 years, since AI was not only a topic for Science-Fiction writers, but also a major research field surrounded with huge hopes and investments. But the over-inflated expectations ended in a subsequent crash and followed by a period of absent funding and interest – the so-called AI winter. However, the last 3 years changed everything – again. Deep learning, a machine learning technique inspired by the human brain, successfully crushed one benchmark after another and tech companies, like Google, Facebook and Microsoft, started to invest billions in AI research. “The pace of progress in artificial general intelligence is incredible fast” (Elon Musk – CEO Tesla & SpaceX) leading to an AI that “would be either the best or the worst thing ever to happen to humanity” (Stephen Hawking – Physicist).
What sparked this new Hype? How is Deep Learning different from previous approaches? Are the advancing AI technologies really a threat for humanity? Let’s look behind the curtain and unravel the reality. This talk will explore why Sundar Pichai (CEO Google) recently announced that “machine learning is a core transformative way by which Google is rethinking everything they are doing” and explain why "Deep Learning is probably one of the most exciting things that is happening in the computer industry” (Jen-Hsun Huang – CEO NVIDIA).
Either a new AI “winter is coming” (Ned Stark – House Stark) or this new wave of innovation might turn out as the “last invention humans ever need to make” (Nick Bostrom – AI Philosoph). Or maybe it’s just another great technology helping humans to achieve more.
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete DeckSlideTeam
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck is loaded with easy-to-follow content, and intuitive design. Introduce the types and levels of artificial intelligence using the highly-effective visuals featured in this PPT slide deck. Showcase the AI-subfield of machine learning, as well as deep learning through our comprehensive PowerPoint theme. Represent the differences, and interrelationship between AI, ML, and DL. Elaborate on the scope and use case of machine intelligence in healthcare, HR, banking, supply chain, or any other industry. Take advantage of the infographic-style layout to describe why AI is flourishing in today’s day and age. Elucidate AI trends such as robotic process automation, advanced cybersecurity, AI-powered chatbots, and more. Cover all the essentials of machine learning and deep learning with the help of this PPT slideshow. Outline the application, algorithms, use cases, significance, and selection criteria for machine learning. Highlight the deep learning process, types, limitations, and significance. Describe reinforcement training, neural network classifications, and a lot more. Hit download and begin personalization. Our AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck are topically designed to provide an attractive backdrop to any subject. Use them to look like a presentation pro. https://bit.ly/3ngJCKf
The field of artificial intelligence (AI) has witnessed tremendous growth in recent years with the advent of Deep Neural Networks (DNNs) that surpass humans in a variety of cognitive tasks.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/altera/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Bill Jenkins, Senior Product Specialist for High Level Design Tools at Intel, presents the "Accelerating Deep Learning Using Altera FPGAs" tutorial at the May 2016 Embedded Vision Summit.
While large strides have recently been made in the development of high-performance systems for neural networks based on multi-core technology, significant challenges in power, cost and, performance scaling remain. Field-programmable gate arrays (FPGAs) are a natural choice for implementing neural networks because they can combine computing, logic, and memory resources in a single device. Intel's Programmable Solutions Group has developed a scalable convolutional neural network reference design for deep learning systems using the OpenCL programming language built with our SDK for OpenCL. The design performance is being benchmarked using several popular CNN benchmarks: CIFAR-10, ImageNet and KITTI.
Building the CNN with OpenCL kernels allows true scaling of the design from smaller to larger devices and from one device generation to the next. New designs can be sized using different numbers of kernels at each layer. Performance scaling from one generation to the next also benefits from architectural advancements, such as floating-point engines and frequency scaling. Thus, you achieve greater than linear performance and performance per watt scaling with each new series of devices.
This is a presentation I gave as a short overview of LSTMs. The slides are accompanied by two examples which apply LSTMs to Time Series data. Examples were implemented using Keras. See links in slide pack.
FPGAs for Supercomputing: The Why and HowDESMOND YUEN
Excellent presentation by Hal Finkel (hfinkel@anl.gov), Kazutomo Yoshii, and Franck Cappello as to why FPGAs are a competitive HPC accelerator technology.
If you like what you read be sure you ♥ it below. Thank you!
This is a talk at AI Nextcon Seattle on Feb 12, 2020.
An overview of TensorFlow Lite and various resources for helping you deploy TFLite models to mobile and edge devices. Walk through an example of end to end on-device ML: train a model from scratch, convert to TFLite and deploy it.
AI firsts: Leading from research to proof-of-conceptQualcomm Research
AI has made tremendous progress over the past decade, with many advancements coming from fundamental research from many decades ago. Accelerating the pipeline from research to commercialization has been daunting since scaling technologies in the real world faces many challenges beyond the theoretical work done in the lab. Qualcomm AI Research has taken on the task of not only generating novel AI research but also being first to demonstrate proof-of-concepts on commercial devices, enabling technology to scale in the real world. This presentation covers:
The challenges of deploying cutting-edge research on real-world mobile devices
How Qualcomm AI Research is solving system and feasibility challenges with full-stack optimizations to quickly move from research to commercialization
Examples where Qualcomm AI Research has had industrial or academic firsts
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
It’s long ago, approx. 30 years, since AI was not only a topic for Science-Fiction writers, but also a major research field surrounded with huge hopes and investments. But the over-inflated expectations ended in a subsequent crash and followed by a period of absent funding and interest – the so-called AI winter. However, the last 3 years changed everything – again. Deep learning, a machine learning technique inspired by the human brain, successfully crushed one benchmark after another and tech companies, like Google, Facebook and Microsoft, started to invest billions in AI research. “The pace of progress in artificial general intelligence is incredible fast” (Elon Musk – CEO Tesla & SpaceX) leading to an AI that “would be either the best or the worst thing ever to happen to humanity” (Stephen Hawking – Physicist).
What sparked this new Hype? How is Deep Learning different from previous approaches? Are the advancing AI technologies really a threat for humanity? Let’s look behind the curtain and unravel the reality. This talk will explore why Sundar Pichai (CEO Google) recently announced that “machine learning is a core transformative way by which Google is rethinking everything they are doing” and explain why "Deep Learning is probably one of the most exciting things that is happening in the computer industry” (Jen-Hsun Huang – CEO NVIDIA).
Either a new AI “winter is coming” (Ned Stark – House Stark) or this new wave of innovation might turn out as the “last invention humans ever need to make” (Nick Bostrom – AI Philosoph). Or maybe it’s just another great technology helping humans to achieve more.
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete DeckSlideTeam
AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck is loaded with easy-to-follow content, and intuitive design. Introduce the types and levels of artificial intelligence using the highly-effective visuals featured in this PPT slide deck. Showcase the AI-subfield of machine learning, as well as deep learning through our comprehensive PowerPoint theme. Represent the differences, and interrelationship between AI, ML, and DL. Elaborate on the scope and use case of machine intelligence in healthcare, HR, banking, supply chain, or any other industry. Take advantage of the infographic-style layout to describe why AI is flourishing in today’s day and age. Elucidate AI trends such as robotic process automation, advanced cybersecurity, AI-powered chatbots, and more. Cover all the essentials of machine learning and deep learning with the help of this PPT slideshow. Outline the application, algorithms, use cases, significance, and selection criteria for machine learning. Highlight the deep learning process, types, limitations, and significance. Describe reinforcement training, neural network classifications, and a lot more. Hit download and begin personalization. Our AI Vs ML Vs DL PowerPoint Presentation Slide Templates Complete Deck are topically designed to provide an attractive backdrop to any subject. Use them to look like a presentation pro. https://bit.ly/3ngJCKf
The field of artificial intelligence (AI) has witnessed tremendous growth in recent years with the advent of Deep Neural Networks (DNNs) that surpass humans in a variety of cognitive tasks.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/altera/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Bill Jenkins, Senior Product Specialist for High Level Design Tools at Intel, presents the "Accelerating Deep Learning Using Altera FPGAs" tutorial at the May 2016 Embedded Vision Summit.
While large strides have recently been made in the development of high-performance systems for neural networks based on multi-core technology, significant challenges in power, cost and, performance scaling remain. Field-programmable gate arrays (FPGAs) are a natural choice for implementing neural networks because they can combine computing, logic, and memory resources in a single device. Intel's Programmable Solutions Group has developed a scalable convolutional neural network reference design for deep learning systems using the OpenCL programming language built with our SDK for OpenCL. The design performance is being benchmarked using several popular CNN benchmarks: CIFAR-10, ImageNet and KITTI.
Building the CNN with OpenCL kernels allows true scaling of the design from smaller to larger devices and from one device generation to the next. New designs can be sized using different numbers of kernels at each layer. Performance scaling from one generation to the next also benefits from architectural advancements, such as floating-point engines and frequency scaling. Thus, you achieve greater than linear performance and performance per watt scaling with each new series of devices.
This is a presentation I gave as a short overview of LSTMs. The slides are accompanied by two examples which apply LSTMs to Time Series data. Examples were implemented using Keras. See links in slide pack.
FPGAs for Supercomputing: The Why and HowDESMOND YUEN
Excellent presentation by Hal Finkel (hfinkel@anl.gov), Kazutomo Yoshii, and Franck Cappello as to why FPGAs are a competitive HPC accelerator technology.
If you like what you read be sure you ♥ it below. Thank you!
Implementing AI: High Performance Architectures: Large scale HPC hardware in ...KTN
The Implementing AI: High Performance Architectures webinar, hosted by KTN and eFutures, was the fourth event in the Implementing AI webinar series.
The focus of the webinar was the impact of processing AI data on data centres - particularly from the technology perspective. Prof. Simon McIntosh-Smith, Professor in High Performance Computing, University of Bristol, covered Large scale HPC hardware in the age of AI.
Session ID: SFO17-509
Session Name: Deep Learning on ARM Platforms
- SFO17-509
Speaker: Jammy Zhou
Track:
★ Session Summary ★
A new era of deep learning is coming with algorithm evolvement, powerful computing platforms and large dataset availability. This session will focus on existing and potential heterogeneous accelerator solutions (GPU, FPGA, DSP, and etc) for ARM platforms and the work ahead from platform perspective.
---------------------------------------------------
★ Resources ★
Event Page: http://connect.linaro.org/resource/sfo17/sfo17-509/
Presentation:
Video:
---------------------------------------------------
★ Event Details ★
Linaro Connect San Francisco 2017 (SFO17)
25-29 September 2017
Hyatt Regency San Francisco Airport
---------------------------------------------------
Keyword:
http://www.linaro.org
http://connect.linaro.org
---------------------------------------------------
Follow us on Social Media
https://www.facebook.com/LinaroOrg
https://twitter.com/linaroorg
https://www.youtube.com/user/linaroorg?sub_confirmation=1
https://www.linkedin.com/company/1026961
In this deck from the UK HPC Conference, Gunter Roeth from NVIDIA presents: Hardware & Software Platforms for HPC, AI and ML.
"Data is driving the transformation of industries around the world and a new generation of AI applications are effectively becoming programs that write software, powered by data, vs by computer programmers. Today, NVIDIA’s tensor core GPU sits at the core of most AI, ML and HPC applications, and NVIDIA software surrounds every level of such a modern application, from CUDA and libraries like cuDNN and NCCL embedded in every deep learning framework and optimized and delivered via the NVIDIA GPU Cloud to reference architectures designed to streamline the deployment of large scale infrastructures."
Watch the video: https://wp.me/p3RLHQ-l2Y
Learn more: http://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/uk-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the Argonne Training Program on Extreme-Scale Computing 2019, Howard Pritchard from LANL and Simon Hammond from Sandia present: NNSA Explorations: ARM for Supercomputing.
"The Arm-based Astra system at Sandia will be used by the National Nuclear Security Administration (NNSA) to run advanced modeling and simulation workloads for addressing areas such as national security, energy and science.
"By introducing Arm processors with the HPE Apollo 70, a purpose-built HPC architecture, we are bringing powerful elements, like optimal memory performance and greater density, to supercomputers that existing technologies in the market cannot match,” said Mike Vildibill, vice president, Advanced Technologies Group, HPE. “Sandia National Laboratories has been an active partner in leveraging our Arm-based platform since its early design, and featuring it in the deployment of the world’s largest Arm-based supercomputer, is a strategic investment for the DOE and the industry as a whole as we race toward achieving exascale computing.”
Watch the video: https://wp.me/p3RLHQ-l29
Learn more: https://insidehpc.com/2018/06/arm-goes-big-hpe-builds-petaflop-supercomputer-sandia/
and
https://extremecomputingtraining.anl.gov/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Building and operating HPC-based AI computing environment inside Gwangju Institute of Science and Technology
For using the part of the slide, you need to cite "Narantuya Jargalsaikhan, GIST AI-X Computing Cluster, 2021".
Thank you!
In this session I will tell you what Hortonworks and IBM Power solutions are and how we can realize significant business value development and prompt use of open innovation in future cognitive utilization. In addition, I will introduce the value added unique to IBM that can be provided by IBM and Hortonworks partnership from the viewpoint of storage, analytics, data science and streaming analysis.
The Implementing AI: High Performance Architectures webinar, hosted by KTN and eFutures, was the fourth event in the Implementing AI summer webinar series.
Every business is increasing the use of artificial intelligence to gain efficiency and to make better decisions. These new demands for data processing are not well delivered by traditional computer architectures. Enterprises, developers, data scientists, and researchers need new platforms that unify all AI workloads, simplifying infrastructure and accelerating ROI. This has led to the development of high performance and specialised hardware devices to meet these new demands.
The focus of this webinar was the impact of processing AI data on data centres - particularly from the technology perspective. The webinar had four presentations from experts, covering the opportunities, implementation techniques and Case Studies, followed by a panel Q&A session.
The objective of this conference is to explain how to develop on small machines that can boot fast, fanless … It’s convenient and often more realistic than to work in virtual machines, especially when working on developping drivers.
Willy will list all the necessary items to setup an optimal environment to be as efficient as possible. Among these one: the importance of investing time to speed up the boot, etc …
He will also speak a bit about crosstool-ng, used in this environement.
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
This slide introduces technical specs and details about Backend.AI 19.09.
* On-premise clustering / container orchestration / scaling on cloud
* Container-level fractional GPU technology to use one GPU as many GPUs on many containers at the same time.
* NVidia GPU Cloud integrations
* Enterprise features
A more deeper talk on the Transformer architecture from the webinar at NTR
https://www.ntr.ai/webinar/transformery
Google slides version: https://docs.google.com/presentation/d/1dIadh_nIszxXG8-672vJmvFGT6jBp0mOqzNV4g3e2Lc/edit?usp=sharing
A talk on Transformers at GDG DevParty
27.06.2020
Link to Google Slides version: https://docs.google.com/presentation/d/1N7ayCRqgsFO7TqSjN4OWW-dMOQPT5DZcHXsZvw8-6FU/edit?usp=sharing
Discussing two papers on applying Deep Learning for Biology:
BERTology Meets Biology: Interpreting Attention in Protein Language Models
https://arxiv.org/abs/2006.15222
Evaluating Protein Transfer Learning with TAPE
https://arxiv.org/abs/1906.08230
Modern neural net architectures - Year 2019 versionGrigory Sapunov
Slides from the talk on UseData 2019 conference. Describes what happened in the NN architecture space in the last two years. Focus on production-ready things. Other interesting but more research-related topics (like Graph networks) are not covered here.
The most significant (not purely scientific) results in AI in the last year (2018-2019).
Disclaimer: may be very subjective :)
Slides to the set of lectures given in Feb-Apr 2019.
This one was conducted in Atlas Biomed Group, 2019-04-26
Практический подход к выбору доменно-адаптивного NMTGrigory Sapunov
Выбор правильного машинного перевода (Machine Translation, MT) под конкретный проект крайне сложен: качество движков сильно разнится между различными доменами и языковыми парами и постоянно меняется с обновлением моделей. С доменно-адаптивным MT ситуация ещё более усложняется — добавляются требования к обучающему корпусу и становится менее прозрачной стоимость владения.
В докладе мы рассмотрим гибридный подход к оценке качества сервисов облачного доменно-адаптивного нейросетевого машинного перевода (NMT), при котором автоматическая оценка сравнивается с оценкой, данной экспертами, использующими единую метрику. Мы также обсудим детали обоих подходов, поделимся результатами оценки и мыслями относительно сравнения обоих методов.
Мы ответим на вопросы, насколько хорош доменно-адаптивный NMT, какова стоимость владения таким решением, как долго он обучается, сколько данных требуется, насколько это безопасно и как его использовать.
Введение в архитектуры нейронных сетей / HighLoad++ 2016Grigory Sapunov
Slides from HighLoad++ 2016 conference.
Introduction into neural network architectures (Rus)
Презентация для конференции HighLoad++ 2016.
http://www.highload.ru/2016/abstracts/2454.html
Видеозапись доклада:
https://www.youtube.com/watch?v=XY5AczPW7V4
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
2. Executive Summary :)
Most hardware focused on DL, which requires a lot of computations:
● There’s much more diversity in CPUs now, not only x86.
● GPUs (mostly NVIDIA) are the most popular choice. Intel and AMD can
propose interesting alternatives this year.
● There are some available ASIC alternatives: Google TPU (in cloud only),
Graphcore, Huawei Ascend.
● More ASICs are coming into this field: Cerebras, Habana, etc.
● Some companies try to use FPGAs and allow to use them in the cloud
(Microsoft, AWS).
● Edge AI is everywhere already! More to come!
● Neuromorphic computing is on the rise (IBM TrueNorth, Tianjic,
memristors, etc)
● Quantum computing can potentially benefit machine learning as well
(but probably it won’t be a desktop or in-house server solutions)
4. The Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design
https://arxiv.org/abs/1911.05289
5. Typically multi-core even on the desktop market:
● usually from 2 to 10 cores in modern Core i3-i9 Intel CPUs
● up to 18 cores/36 threads in high-end Intel CPUs (i9–
7980XE/9980XE/10980XE) [https://en.wikipedia.org/wiki/List_of_Intel_Core_i9_microprocessors]
● up to 64 cores/128 threads in AMD Ryzen Threadripper
(Ryzen Threadripper 3990X, Ryzen Threadripper Pro 3995WX)
x86: Desktops
6. On the server market:
● Intel Xeon: up to 56 cores/112 threads (Xeon Platinum 9282 Processor)
● AMD EPYC: up to 64 cores/128 threads (EPYC 7702/7742)
● usually having more cores than desktop processors and some other useful
capabilities (supporting more RAM, multi-processor configurations, ECC, etc)
x86: Servers
7. AVX-512: Fused Multiply Add (FMA) core instructions for enabling lower-precision
operations. List of CPUs with AVX-512 support:
https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512
VNNI (Vector Neural Network Instructions): Multiply and add for integers, etc.
designed to accelerate convolutional neural network-based algorithms.
https://en.wikichip.org/wiki/x86/avx512vnni
DL Boost: AVX512-VNNI + Brain floating-point format (bfloat16)
designed for inference acceleration.
https://en.wikichip.org/wiki/brain_floating-point_format
x86: ML instructions (SIMD)
https://blog.inten.to/cpu-hardware-for-deep-learning-b91f53cb18af
8. ● BigDL: distributed deep learning library for Apache Spark
https://github.com/intel-analytics/BigDL
● Deep Neural Network Library (DNNL): an open-source performance library
for deep learning applications. Layer primitives, etc.
https://intel.github.io/mkl-dnn/
● PlaidML: advanced and portable tensor compiler for enabling deep learning
on laptops, embedded devices, or other devices.
Supports Keras, ONNX, and nGraph.
https://github.com/plaidml/plaidml
● OpenVINO Toolkit: for computer vision
https://docs.openvinotoolkit.org/
x86: Optimized ML Libraries
9. Some CPU-optimized DL libraries:
● Caffe Con Troll (research project, latest commit in 2016)
https://github.com/HazyResearch/CaffeConTroll
● Intel Caffe (optimized for Xeon):
https://github.com/intel/caffe
● Intel DL Boost can be used in many popular frameworks:
TensorFlow, PyTorch, MXNet, PaddlePaddle, Intel Caffe
https://www.intel.ai/increasing-ai-performance-intel-dlboost/
x86: Optimized ML Libraries
10. ● nGraph: open source C++ library, compiler and runtime for Deep Learning.
Frameworks using nGraph Compiler stack to execute workloads have shown
up to 45X performance boost when compared to native framework
implementations. https://www.ngraph.ai/
Graph compilers
12. ● #Cores
● PCIe bandwidth
● PCIe generation (gen3, gen4)
● PCIe lanes (x16, x8, etc) at the processor/chipset side
● Memory type (DDR4, DDR3, etc)
● Memory speed (2133, 2666, 3200, etc)
● Memory channels (1, 2, 4, …)
● Memory size
● Memory speed/bandwidth
● ECC support
● Power usage (Watts)
● Price
● ...
Important dimensions
https://blog.inten.to/cpu-hardware-for-deep-learning-b91f53cb18af
13. ● Single-board computers: Raspberry Pi, part of Jetson
Nano, and Google Coral Dev Board.
● Mobile: Qualcomm, Apple A11, etc
● Server: Marvell ThunderX, Ampere eMAG,
Amazon A1 instance, etc; NVIDIA announced
GPU-accelerated Arm-based servers.
● Laptops: Apple M1, Microsoft Surface Pro X
● ARM also has ML/AI Ethos NPU and Mali GPU
ARM
14. ● ARM announces Neoverse N1 platform (scales up to 128 cores)
https://www.networkworld.com/article/3342998/arm-introduces-neoverse-high-performance-cpus-for-servers-5g.html
● Qualcomm manufactured ARM server processor for cloud applications called Centriq 2400 (48 single-thread cores,
2.2GHz). Project stopped.
https://www.tomshardware.com/news/qualcomm-server-chip-exit-china-centriq-2400,38223.html
● Ampere Altra is the first 80-core ARM-based server processor
https://venturebeat.com/2020/03/03/ampere-altra-is-the-first-80-core-arm-based-server-processor/
● Ampere announces 128-core Arm server processor
https://www.networkworld.com/article/3564514/ampere-announces-128-core-arm-server-processor.html
● Ampere eMAG ARM server microprocessors (up to 32 cores, up to 3.3 GHz)
https://amperecomputing.com/product/, https://en.wikichip.org/wiki/ampere_computing/emag
● Marvell ThunderX ARM Processors (up to 48 cores, up to 2.5 GHz)
https://www.marvell.com/server-processors/thunderx-arm-processors/
● Amazon Graviton ARM processor (16 cores, 2.3GHz)
https://en.wikichip.org/wiki/annapurna_labs/alpine/al73400
https://aws.amazon.com/blogs/aws/new-ec2-instances-a1-powered-by-arm-based-aws-graviton-processors/
● Huawei Kunpeng 920 ARM Server CPU (64 cores, 2.6 GHz)
https://www.huawei.com/en/press-events/news/2019/1/huawei-unveils-highest-performance-arm-based-cpu
ARM: Servers
16. Current architecture is POWER9:
● 12 cores x 8 threads or 24 cores x 4 threads (96 threads).
● PCIe v.4, 48 PCIe lanes
● Nvidia NVLink 2.0: the industry’s only CPU-to-GPU Nvidia NVLink connection
● CAPI 2.0, OpenCAPI 3.0 (for heterogeneous computing with FPGA/ASIC)
IBM POWER
17.
18. An open-source hardware instruction set architecture.
Examples:
● SiFive U5, U7 and U8 cores
https://www.anandtech.com/show/15036/sifive-announces-first-riscv-ooo-cpu-core-the-u8series-processor-ip
● Alibaba's RISC-V processor Xuantie 910 with Vector Engine for AI
Acceleration 12nm 64-bit 16 cores clocked at up to 2.5GHz,
the fastest RISC-V processor to date
https://www.theregister.co.uk/2019/07/27/alibaba_risc_v_chip/
● Western Digital SweRV Core designed for embedded devices
supporting data-intensive edge applications.
https://www.westerndigital.com/company/innovations/risc-v
● Manticore: A 4096-core RISC-V Chiplet Architecture for
Ultra-efficient Floating-point Computing https://arxiv.org/abs/2008.06502
● Esperanto Technologies is building AI chip with 1k+ cores
https://www.esperanto.ai/technology/
RISC-V
22. ● Peak performance (GFLOPS) at FP32/16/...
● #Cores (+Tensor Cores)
● Memory size
● Memory speed/bandwidth
● Precision support
● Can connect using NVLink?
● Power usage (Watts)
● Price
● GFLOPS/USD
● GFLOPS/Watt
● Form factor (for desktop or server?)
● ECC memory
● Legal restrictions (e.g. GeForce is not allowed to use in datacenters)
Important dimensions
https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
23.
24. ● FP64 (64-bit float), not used for DL
● FP32 — the most commonly used for training
● FP16 or Mixed precision (FP32+FP16) — becoming the new default
● INT8 — usually for inference
● INT4, INT1 — experimental modes for inference
Precision
25. bfloat16 is now supported on Ampere GPUs, supported on TPU gen3, and will be
supported on AMD GPU and Intel CPUs..
Precision: bfloat16
https://moocaholic.medium.com/fp64-fp32-fp16-bfloat16-tf32-and-other-members-of-the-zoo-a1ca7897d407
27. Not only FLOPS: Roofline Performance Model
https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
28. Roofline Performance Model: Example
https://blog.inten.to/hardware-for-deep-learning-part-3-gpu-8906c1644664
29. Separate cards can joined using NVLink; SLI is not relevant for DL, it’s for
graphics.
NVSwitch: The Fully Connected NVLink
NCCL 1: multi-GPU collective communication primitives library
NVIDIA: Single-machine Multi-GPU
30. Distributed training is now a commodity (but scaling is sublinear).
NCCL 2: multi-node collective communication primitives library
NVIDIA: Distributed Multi-GPU
31. Intel offered the following performance numbers, given as peak GFLOPs of FP32
math using the OpenCL-based CLPeak benchmark.
GPU: Intel Xe
https://www.anandtech.com/show/16018/intel-xe-hp-graphics-early-samples-offer-42-tflops-of-fp32-performance
32. Peak Performance:
● 46.1 TFLOPs Single Precision Matrix (FP32)
● 23.1 TFLOPs Single Precision (FP32)
● 184.6 TFLOPs Half Precision (FP16)
● 11.5 TFLOPs Double Precision (FP64)
● 92.3 TFLOPs bfloat16
● 184.6 TOPs INT8 and INT4
32 GB HBM2, Up to 1228.8 GB/s
300W
Announced support for TensorFlow, PyTorch, etc!
AMD Instinct MI100
https://www.amd.com/en/products/server-accelerators/instinct-mi100
35. Problems
Serious problems with the current processors (CPU/GPU) are:
● Energy efficiency:
○ The version of AlphaGo playing against Lee Sedol used 1,920 CPUs and
280 GPUs (https://en.wikipedia.org/wiki/AlphaGo)
○ The estimated power consumption of approximately 1 MW (200 W per
CPU and 200 W per GPU) compared to only 20 watts used by the human
brain (https://jacquesmattheij.com/another-way-of-looking-at-lee-sedol-vs-alphago/)
● Architecture:
○ good for matrix multiplication (still the essence of DL)
○ but not well-suitable for brain-like computations
37. FPGA
● FPGA (field-programmable gate array) is an integrated circuit designed to be
configured by a customer or a designer after manufacturing
● Both FPGAs and ASICs (see later) are usually much more energy-efficient than
general purpose processors (so more productive with respect to GFLOPS per
Watt). FPGAs are usually used for inference, not training.
● OpenCL can be the language for development for FPGA (C/C++ can be as well),
and some ML/DL libraries are using OpenCL too (for example, Caffe). So, there
could appear an easy way to do low-level ML on FPGAs.
● For high-level ML there are vendor tools and graph compilers (inference only).
● Can use FPGA in the cloud!
● See also for MLIR (mentioned earlier).
● Learning curve to use FPGAs is too steep now :(
38. FPGA in production
There is some interesting movement to FPGA:
● Amazon has FPGA F1 instances https://aws.amazon.com/ec2/instance-types/f1/
● Alibaba has FGPA F3 instances in the cloud https://www.alibabacloud.com/blog/deep-dive-into-alibaba-
cloud-f3-fpga-as-a-service-instances_594057
● Yandex uses FPGAs for its own DL inference.
● Microsoft ran (in 2015) Project Catapult that uses clusters of FPGAs
https://blogs.msdn.microsoft.com/msr_er/2015/11/12/project-catapult-servers-available-to-academic-researchers/
https://www.microsoft.com/en-us/research/project/project-catapult/
● Microsoft Project Brainwave: AI inference omn FPGA
https://www.microsoft.com/en-us/research/project/project-brainwave/
● Microsoft Azure allows deploying pretrained models on FPGA (!).
https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-fpga-web-service
● Baidu has FPGA instances https://cloud.baidu.com/product/fpga.html
● ...
39. Two main manufacturers: Intel (ex. Altera) and Xilinx.
The ‘world’s largest’ FPGA chips
● Intel Stratix 10 GX 10M
>10.2 million logic cells, 43.3B transistors
https://www.techpowerup.com/260906/intel-unveils-worlds-largest-fpga
● Xilinx Virtex UltraScale+ VU19P
9M system logic cells, 35B transistors
https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus-vu19p.html
Intel has a hybrid Xeon+FPGA chip https://www.top500.org/news/intel-ships-xeon-skylake-processor-with-
integrated-fpga/
Intel has FPGA acceleration cards
https://www.intel.com/content/www/us/en/programmable/solutions/acceleration-hub/platforms.html
FPGA chips
40. Adaptive compute acceleration platform (ACAP)
Xilinx Versal ACAP, a fully software-programmable,
heterogeneous compute platform that combines Scalar Engines,
Adaptable Engines, and Intelligent Engines.
The Intelligent Engines are an array of VLIW and SIMD
processing engines and memories, all interconnected with 100s
of terabits per second of interconnect and memory bandwidth.
These permit 5X–10X performance improvement for ML and
DSP applications.
https://www.xilinx.com/products/silicon-devices/acap/versal-premium.html
https://www.xilinx.com/products/silicon-devices/acap/versal-ai-core.html
https://www.xilinx.com/support/documentation/white_papers/wp505-versal-acap.pdf
41. FPGA: Xilinx Vitis AI
Vitis AI is Xilinx’s development stack
for AI inference on Xilinx hardware
platforms, including both edge devices
and Alveo cards.
It consists of optimized IP, tools,
libraries, models, and example
designs.
Xilinx ML Suite is now deprecated.
https://github.com/Xilinx/Vitis-AI
42. FPGA: Intel OpenVINO toolkit
The OpenVINO (Open Visual Inference and Neural Network Optimization) toolkit
offers software developers a single toolkit to accelerate their solutions across
multiple hardware platforms including FPGAs
https://software.intel.com/content/www/us/en/develop/tools/openvino-toolkit.html
44. ASIC custom chips
ASIC (application-specific integrated circuit) is an integrated circuit customized for a
particular use, rather than intended for general-purpose use.
There is a lot of movement to ASIC right now:
● Google has Tensor Processing Units (TPU v2/v3) in the cloud, v4 exists too.
● Intel acquired Habana, Mobileye, Movidius, Nervana and has processors for training
and inference.
● Graphcore has its second generation IPU.
● AWS has its own chips for training and inference
● Alibaba Hanguang 800
● Huawei Ascend 310, 910
● Bitmain Sophon, Cerebras, Groq, and many many others …
Many ASICs are built for multi-chip and supercomputer configurations!
https://blog.inten.to/hardware-for-deep-learning-part-4-asic-96a542fe6a81
47. A “TPU v3 pod” 100+ petaflops, 32 TB HBM, 2-D toroidal mesh network
Many ASICs are built for multi-chip configurations
48. ASIC: Intel (Nervana) NNP-T [discontinued]
Processor for training. Can build PODs (say 10-rack POD with 480 NNP-T)
● 24 Tensor Processing Cluster (TPC)
● PCIe Gen 4 x16 accelerator card, 300W
● OCP Accelerator Module, 375W
● 119 TOPS bfloat16
● 32 GB HBM2
https://www.intel.ai/nervana-nnp/nnpt/
https://en.wikichip.org/wiki/nervana/microarchitectures/spring_crest
49. ASIC: Intel (Nervana) NNP-I [discontinued]
Processor for inference using mixed precision math, with a special emphasis on low-precision
computations using INT8.
● 12 inference compute engines (ICE) + 2 Intel architecture cores (AVX+VNNI)
● M.2 form factor (1 chip): 12W, up to 50 TOPS.
● PCIe card (2 chips): 75W, up to 170 TOPS.
https://www.intel.ai/nervana-nnp/nnpi
https://en.wikichip.org/wiki/intel/microarchitectures/spring_hill
51. ASIC: Graphcore IPU
Graphcore IPU: for both training and inference.
Allows new and emerging machine intelligence
workloads to be realized.
Colossus MK2 GC200 IPU:
● 59.4B transistors, 1472 independent processor cores running 8832 independent
parallel program threads
● 250 TFLOPS mixed precision
● 900MB in-processor mem, 47.5TB/s memory bandwidth
● 8TB/s on-chip exchange between cores, 320GB/s chip-to-chip bandwidth
● IPU-M2000 systems with 4xIPU (1 PFLOPS FP16)
IPU on Azure
https://www.graphcore.ai/posts/microsoft-and-graphcore-collaborate-to-accelerate-artificial-intelligence
52. ● Cerebras Systems Wafer Scale Engine (WSE), an
AI chip that measures 8.46x8.46 inches, making it
almost the size of an iPad.
● WSE chip has 1.2 trillion transistors. For
comparison, NVIDIA’s A100 GPU contains 54
billion transistors, 22x less!
● 400,000 computing cores and 18 gigabytes of
memory with 9 PB/s memory bandwidth.
● Cerebras CS-1 is a system built on WSE.
● WSE 2nd gen is announced! 850,000 AI-optimized
cores, 2.6 Trillion Transistors
https://cerebras.net/
Cerebras
53. ASIC: AWS Inferentia Chips
AWS Inferentia chips are designed to accelerate the inference.
● 64 TOPS on 16-bit floating point (FP16 and BF16) and mixed-precision data.
● 128 TOPS on 8-bit integer (INT8) data.
● Up to 16 chips in the largest instance (inf1.24xlarge)
https://aws.amazon.com/machine-learning/inferentia/
https://github.com/aws/aws-neuron-sdk
https://aws.amazon.com/blogs/aws/amazon-ec2-update-inf1-instances-with-aws-inferentia-chips-for-high-performance-cost-effective-inferencing/
54. ASIC: AWS Trainium
(December 1st, 2020) Amazon announced its AWS Trainium chip. will be available in 2021
https://aws.amazon.com/machine-learning/trainium/
55. ASIC: Huawei Ascend 310
● 22 TOPS INT8
● 11 TFLOPS FP16
● 8W of power consumption.
Atlas 300I Inference Card:
● 32 GB LPDDR4X with a bandwidth of 204.8 GB/s
● PCIe x16 Gen3.0 device, max 67 W
● A single card provides up to 88 TOPS INT8
56. ASIC: Huawei Ascend 910
● 32 built-in Da Vinci AI Cores and 16 TaiShan Cores
● 320 TFLOPS (FP16), 640 TOPS (INT8). It’s pretty
close to NVIDIA’s A100 BF16 peak performance of
312 TFLOPS
Atlas 300T Training Card
● 32 GB HBM or 16GB DDR4 2933
● PCIe x16 Gen4.0
● up to 300W power consumption
57. ASIC: Bitmain Sophon
Tensor Computing Processors:
● BM1680 (1st gen, 2 TFLOPS FP32, 32MB SRAM, 25W)
● BM1682 (2nd gen, 3 TFLOPS FP32, 16MB SRAM)
● BM1684 (3rd gen, 2.2TFLOPS FP32, 17.6 TOPS INT8, 32 MB SRAM)
● BM1880 (1 TOPS INT8).
There are Deep Learning Acceleration PCIe Cards:
● SC3 with a BM1682 chip (8 GB DDR memory, 65W)
● SC5 and SC5H with a BM1684 chip and 12 GB RAM
(up to 16 GB) with 30W max power consumption
● SC5+ with 3x BM1684 and 36 GB memory (up to 48 GB)
with 75W max power consumption.
https://www.sophon.ai/product/introduce/bm1684.html
58. ASIC: Alibaba Hanguang 800
AI-Inference Chip
Its performance is independent of the batch size.
59. ASIC: Baidu Kunlun
● 14nm chip
● 16GB HBM memory with
512 GB/s bandwidth
● up to 260 TOPS INT8
(that’s twice the INT8 performance of NVIDIA TESLA T4)
● 64 TFLOPS INT16/FP16 at 150W.
● This chip looks like an inference chip. The processor can be
accessed via Baidu Cloud.
September 2020, Baidu announced Kunlun 2. The new chip uses 7
nm process technology and its computational capability is over three
times that of the previous generation. The mass production of the
chip is expected to begin in early 2021.
60. ASIC: Groq TSP
Groq develops its own Tensor Streaming Processor.
Jonathan Ross, Croq’s CEO had co-founded the
first Google’s TPU before that.
● 14nm chip, 26.8B transistors
● 220MB SRAM with 80TB/s on-die memory bandwidth
● no board memory
● PCIe Gen4 x16 with 31.5 GB/s in each direction.
● up to 1000 TOPS INT8 and 250 TFLOPS FP16 (with FP32 acc).
For comparison, NVIDIA A100 has 312 TFLOPS on dense FP16
calculations with FP32 acc, and 624 TOPS INT8. It’s even larger
than the 825 TOPS INT8 of Alibaba’s Hanguang 800.
61. ASIC: Others
● Qualcomm Cloud AI 100 (inference)
https://www.qualcomm.com/products/cloud-artificial-intelligence/cloud-ai-100
● Wave Computing Dataflow Processing Unit (DPU)
https://wavecomp.ai/products/
● ARM ML inference NPU Ethos-N78
https://www.arm.com/products/silicon-ip-cpu/machine-learning/arm-ml-processor
● SambaNova came out of stealth-mode in December 2020 with their
Reconfigurable Dataflow Architecture (RDA) delivering “100s of TFLOPS”.
https://sambanova.ai/
● Mythic focuses on Compute-in-Memory, Dataflow Architecture, and Analog
Computing https://www.mythic-ai.com/technology/
● Intel eASIC: an intermediary technology between FPGAs and standard-cell
ASICs with lower unit-cost and faster time-to-market
https://www.intel.com/content/www/us/en/products/programmable/asic/easic-devices.html
● ...
62. ASIC: Summary
● Very diverse field!
● Hard to directly compare different solutions based on their characteristics (can
be too different architectures).
● You can use a common benchmark like https://mlperf.org/
● DL framework support is usually limited, some solutions use their own
frameworks/libraries.
64. AI at the edge
● NVidia Jetson TK1/TX1/TX2/Xavier/Nano
○ 192/256/256/512/128 CUDA Cores
○ 4/4/6/8/4-Core ARM CPU, 2/4/8/16/4 Gb Mem
● Tablets, Smartphones
○ Qualcomm Snapdragon 845/855, Apple A11/12/Bionic, Huawei Kirin 970/980/990 etc
● Raspberry Pi 4 (1.5 GHz 4-core, 4Gb mem)
● Movidius Neural Compute Stick, Stick 2
● Google Edge TPU
65. (Nov 25, 2020) “Our brand-new 6th gen Qualcomm AI Engine includes the
Qualcomm® Hexagon™ 780 Processor with a fused AI-accelerator architecture,
plus the Tensor Accelerator with 2 times the compute capacity. This Qualcomm AI
Engine astonishes with up to 26 TOPS performance.”
https://www.qualcomm.com/products/snapdragon-888-5g-mobile-platform
Mobile AI: Qualcomm SnapDragon 888
66. “HUAWEI’s self-developed Da Vinci architecture NPU delivers better power efficiency,
stronger processing capabilities and higher accuracy. The powerful Big-Core plus ultra-low
consumption Tiny-Core contribute to an enormous boost in AI performance. In AI face
recognition, the efficiency of NPU Tiny-Core can be enhanced up to 24x than the Big-Core.
With 2 Big-Core plus 1 Tiny-Core, the NPU of Kirin 990 5G is ready to unlock the magic
of the future.”
https://consumer.huawei.com/en/campaign/kirin-990-series/
“Huawei intends to scale this AI processing block from servers to smartphones. It supports both INT8
and FP16 on both cores, whereas the older Cambricon design could only perform INT8 on one core.
There’s also a new ‘Tiny Core’ NPU. It’s a smaller version of the Da Vinci architecture focused on power
efficiency above all else, and it can be used for polling or other applications where performance isn’t
particularly time critical. The 990 5G will have two “big” NPU cores and a single Tiny Core, while the Kirin
990 (LTE) has one big core and one tiny core.”
https://www.extremetech.com/mobile/298028-huaweis-kirin-990-soc-is-the-first-chip-with-an-integrated-5g-modem
Mobile AI: Huawei Kirin 970, 980, 990 (NPU)
67. (Sep 15, 2020) Apple unveils A14 Bionic processor with
40% faster CPU and 11.8 billion transistors
“The chip has a 16-core neural engine that can execute 11 trillion
AI operations per second. The neural engine core count is twice
the previous chip, and can perform machine learning computations
10 times faster. The A14 has six CPU cores and four graphics
processing unit (GPU) cores.”
https://venturebeat.com/2020/09/15/apple-unveils-a14-bionic-processor-with-40-faster-cpu-and-11-8-billion-transistors/
Mobile AI: Apple (Neural Engine)
68. (January 12, 2021) Samsung sets new standard for
flagship mobile processors with Exynos 2100
“AI capabilities will also enjoy a significant boost with the Exynos
2100. The newly-designed tri-core NPU has architectural enhancements such as
minimizing unnecessary operations for high effective utilization and support for
feature-map and weight compression. Exynos 2100 can perform up to 26-trillion-
operations-per-second (TOPS) with more than twice the power efficiency than
the previous generation. With on-device AI processing and support for advanced
neural networks, users will be able to enjoy more interactive and smart features as
well as enhanced computer vision performance in applications such as imaging.”
https://www.samsung.com/semiconductor/minisite/exynos/newsroom/pressrelease/samsung-sets-new-
standard-for-flagship-mobile-processors-with-exynos-2100/
Mobile AI: Samsung (NPU)
69. (Aug 7, 2019) MediaTek Announces Dimensity 1000 ARM
Chip With Integrated 5G Modem
“The Dimensity 1000 doesn’t just bring new branding; it’s also
sporting four Cortex A77 CPU cores and four Cortex A55 CPU
cores, all built on a 7nm process node. There’s also a 9-core Mali GPU, a 5-core
ISP, and a 6-core AI processor.
The MediaTek AI Processing Unit APU 3.0 is a brand new architecture. It
houses six AI processors (two big cores, three small cores and a single tiny core)
The new APU 3.0 brings devices a significant performance boost at 4.5 TOPS.”
https://www.extremetech.com/extreme/302712-mediatek-announces-dimensity-1000-arm-chip-with-
integrated-5g-modem
https://i.mediatek.com/mediatek-5g
Mobile AI: MediaTek (APU)
70. AI at the Edge: Jetson Nano
Price: $99 ($59 for 2Gb)
NVIDIA Jetson Nano Developer Kit is a small, powerful
computer that lets you run multiple neural networks in parallel
for applications like image classification, object detection, segmentation,
and speech processing. All in an easy-to-use platform that runs in as little as 5
watts.
● 128-core Maxwell GPU + Quad-core ARM A57, 472 GFLOPS
● 4 GB 64-bit LPDDR4 25.6 GB/s
https://developer.nvidia.com/embedded/jetson-nano-developer-kit
See also Jetson TX1, TX2, Xavier: https://developer.nvidia.com/embedded/develop/hardware
71. Neural Compute Stick 2 (~$70)
The latest generation of Intel® VPUs includes 16
powerful processing cores (called SHAVE cores) and
a dedicated deep neural network hardware accelerator for high-performance
vision and AI inference applications—all at low power.
● Supports Convolutional Neural Network (CNN)
● Support: TensorFlow, Caffe, Apache MXNet, ONNX, PyTorch, and
PaddlePaddle via an ONNX conversion
● Processor: Intel Movidius Myriad X Vision Processing Unit (VPU)
● Connectivity: USB 3.0 Type-A
https://software.intel.com/en-us/neural-compute-stick
AI at the Edge: Movidius
72. AI at the Edge: Google Edge TPU
The Edge TPU is a small ASIC designed by Google that provides
high performance ML inferencing for low-power devices. For
example, it can execute state-of-the-art mobile vision models such
as MobileNet V2 at 400 FPS, in a power efficient manner.
The on-board Edge TPU coprocessor is capable of performing 4 TOPS
using 0.5 watts for each TOPS (2 TOPS per watt).
TensorFlow Lite models can be compiled to run on the Edge TPU.
USB/Mini PCIe/M.2 A+E key/M.2 B+M key/SoM/Dev Board
https://cloud.google.com/edge-tpu/
https://coral.ai/products/
73. ● Sophon Neural Network Stick (NNS)
https://www.sophon.ai/product/introduce/nns.html
● Xilinx Edge AI (FPGA!)
https://www.xilinx.com/applications/industrial/analytics-machine-learning.html
● The Hailo-8 M.2 Module
https://hailo.ai/product-hailo/hailo-8-m2-module/
● More:
https://github.com/crespum/edge-ai
AI at the Edge: Others
75. Problems
Even with FPGA/ASIC and edge devices:
● Energy efficiency:
○ Better than CPU/GPU, but still far from 20 watts used by the human brain
● Architecture:
○ Even more specialized for ML/DL computations, but...
○ Still far from brain-like computations
77. Neuromorphic chips
● Neuromorphic computing - brain-inspired computing - has emerged as a new
technology to enable information processing at very low energy cost using
electronic devices that emulate the electrical behaviour of (biological) neural
networks.
● Neuromorphic chips attempt to model in silicon the massively parallel way the
brain processes information as billions of neurons and trillions of synapses
respond to sensory inputs such as visual and auditory stimuli.
● DARPA SyNAPSE program (Systems of Neuromorphic Adaptive Plastic
Scalable Electronics)
● IBM TrueNorth; Stanford Neurogrid; HRL neuromorphic chip; Human Brain
Project SpiNNaker and HICANN.
https://www.technologyreview.com/s/526506/neuromorphic-chips/
78. Neuromorphic chips: IBM TrueNorth
● 1M neurons, 256M synapses, 4096 neurosynaptic
cores on a chip, est. 46B synaptic ops per sec per W
● Uses 70mW, power density is 20 milliwatts per
cm^2— almost 1/10,000th the power of most modern
microprocessors
● “Our sights are now set high on the ambitious goal of
integrating 4,096 chips in a single rack with 4B neurons and 1T synapses while
consuming ~4kW of power”.
● Currently IBM is making plans to commercialize it.
● (2016) Lawrence Livermore National Lab got a cluster of 16 TrueNorth chips
(16M neurons, 4B synapses, for context, the human brain has 86B neurons).
When running flat out, the entire cluster will consume a grand total of 2.5 watts.
http://spectrum.ieee.org/tech-talk/computing/hardware/ibms-braininspired-computer-chip-comes-from-the-future
79. Neuromorphic chips: IBM TrueNorth
● (03.2016) IBM Research demonstrated convolutional neural nets with close to
state of the art performance:
“Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing”, http://arxiv.org/abs/1603.08270
80. Neuromorphic chips: Intel Loihi
● Fully asynchronous neuromorphic many core mesh that
supports a wide range of sparse, hierarchical and recurrent
neural network topologies
● Each neuromorphic core includes a learning engine that can
be programmed to adapt network parameters during
operation, supporting supervised, unsupervised,
reinforcement and other learning paradigms.
● Fabrication on Intel’s 14 nm process technology.
● A total of 130,000 neurons and 130 million synapses.
● Development and testing of several algorithms with high
algorithmic efficiency for problems including path planning,
constraint satisfaction, sparse coding, dictionary learning,
and dynamic pattern learning and adaptation.
https://newsroom.intel.com/editorials/intels-new-self-learning-chip-promises-accelerate-artificial-intelligence/
https://techcrunch.com/2018/01/08/intel-shows-off-its-new-loihi-ai-chip-and-a-new-49-qubit-quantum-chip/
https://ieeexplore.ieee.org/document/8259423
https://en.wikichip.org/wiki/intel/loihi
81. Neuromorphic chips: Intel Pohoiki Beach
(Jul 15, 2019) “Intel announced that an 8 million-neuron
neuromorphic system comprising 64 Loihi research chips
— codenamed Pohoiki Beach — is now available to the broader
research community. With Pohoiki Beach, researchers can
experiment with Intel’s brain-inspired research chip, Loihi, which
applies the principles found in biological brains to computer
architectures. ”
https://newsroom.intel.com/news/intels-pohoiki-beach-64-chip-neuromorphic-system-delivers-breakthrough-results-research-tests/
83. Neuromorphic chips: Intel Loihi
“Using Intel's Loihi neuromorphic research chip and
ABR's Nengo Deep Learning toolkit, we analyze the
inference speed, dynamic power consumption, and
energy cost per inference of a two-layer neural
network keyword spotter trained to recognize a single
phrase. We perform comparative analyses of this
keyword spotter running on more conventional
hardware devices including a CPU, a GPU, Nvidia's
Jetson TX1, and the Movidius Neural Compute
Stick.”
Benchmarking Keyword Spotting Efficiency on Neuromorphic Hardware
https://arxiv.org/abs/1812.01739
86. NxTF: a Keras-like API for SNNs on Loihi
“NxTF: An API and Compiler for Deep Spiking Neural Networks on Intel Loihi”
https://arxiv.org/abs/2101.04261
https://github.com/intel-nrc-ecosystem/models/tree/master/nxsdk_modules_ncl/dnn
87. NxTF: a Keras-like API for SNNs on Loihi
“NxTF: An API and Compiler for Deep Spiking Neural Networks on Intel Loihi”
https://arxiv.org/abs/2101.04261
88. NxTF: a Keras-like API for SNNs on Loihi
“NxTF: An API and Compiler for Deep Spiking Neural Networks on Intel Loihi”
https://arxiv.org/abs/2101.04261
89. Neuromorphic chips: Tianjic
Tianjic’s unified function core (FCore) which combines essential
building blocks for both artificial neural networks and biologically
networks — axon, synapse, dendrite and soma blocks. The 28-nm
chip consists of 156 FCores, containing approximately 40,000
neurons and 10 million synapses in an area of 3.8×3.8 mm2.
Tianjic delivers an internal memory bandwidth of more than 610 GB
per second, and a peak performance of 1.28 TOPS per watt for
running artificial neural networks. In the biologically-inspired spiking
neural network mode, Tianjic achieves a peak performance of about
650 giga synaptic operations per second (GSOPS) per watt.
https://medium.com/syncedreview/nature-cover-story-chinese-teams-tianjic-chip-bridges-machine-
learning-and-neuroscience-in-f1c3e8a03113
https://www.nature.com/articles/s41586-019-1424-8
90. Neuromorphic chips: Others
● SpiNNaker (1,036,800 ARM9 cores)
http://apt.cs.manchester.ac.uk/projects/SpiNNaker/
● SpiNNaker-2
https://niceworkshop.org/wp-content/uploads/2018/05/2-27-SHoppner-SpiNNaker2.pdf
https://arxiv.org/abs/1911.02385 “SpiNNaker 2: A 10 Million Core Processor System for Brain
Simulation and Machine Learning”
● BrainScaleS, HICANN: 20x 8-inch silicon wafers each incorporates 50 x 106
plastic synapses and 200,000 biologically realistic neurons.
https://www.humanbrainproject.eu/en/silicon-brains/how-we-work/hardware/
● Akida NSoC: 1.2 million neurons and 10 billion synapses
https://www.brainchipinc.com/products/akida-neuromorphic-system-on-chip
https://www.nextplatform.com/2020/01/30/neuromorphic-chip-maker-takes-aim-at-the-edge/
https://en.wikichip.org/wiki/brainchip/akida
● Neurogrid: Neurogrid can model a slab of cortex with up to 16x256x256
neurons (>1M) https://web.stanford.edu/group/brainsinsilicon/neurogrid.html
https://web.stanford.edu/group/brainsinsilicon/documents/BenjaminEtAlNeurogrid2014.pdf
93. Other approaches
● Memristors https://spectrum.ieee.org/semiconductors/design/the-mysterious-memristor
● Quantum computing https://ai.googleblog.com/2019/10/quantum-supremacy-using-programmable.html
● Optical computing https://www.nextplatform.com/2019/05/31/startup-looks-to-light-up-machine-learning/
● DNA computing https://www.wired.com/story/finally-a-dna-computer-that-can-actually-be-reprogrammed/
● Unconventional computing: cellular automata, reservoir computing, using
biological cells/neurons, chemical computation, membrane computing, slime
mold computing and much more https://www.springer.com/gp/book/9781493968824
● ...
94. References:
Hardware for Deep Learning series of posts:
https://blog.inten.to/hardware-for-deep-learning-current-state-and-trends-51c01ebbb6dc
● Part 1: Introduction and Executive summary
● Part 2: CPU
● Part 3: GPU
● Part 4: ASIC
● Part 5: FPGA
● Part 6: Mobile AI
● Part 7: Neuromorphic computing
● Part 8: Quantum computing
More info:https://www.intel.com/content/www/us/en/products/programmable/fpga.htmlhttps://www.xilinx.com/products/silicon-devices/fpga.html
https://www.techpowerup.com/258678/xilinx-announces-virtex-ultrascale-the-worlds-largest-fpga
https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/sg/product-catalog.pdf
https://www.intel.com/content/www/us/en/products/programmable/fpga/arria-10/features.html