In this talk, we evaluate training of deep recurrent neural networks with half-precision floats on Pascal and Volta GPUs. We implement a distributed, data-parallel, synchronous training algorithm by integrating TensorFlow and CUDA-aware MPI to enable execution across multiple GPU nodes and making use of high-speed interconnects. We introduce a learning rate schedule facilitating neural network convergence at up to O(100) workers.
Strong scaling tests performed on GPU clusters show linear runtime scaling and logarithmic communication time scaling for both single and mixed precision training modes. Performance is evaluated on a scientific dataset taken from the Joint European Torus (JET) tokamak, containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions. Half-precision significantly reduces memory and network bandwidth, allowing training of state-of-the-art models with over 70 million trainable parameters while achieving a comparable test set performance as single precision.
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Databricks
Spark SQL is a very effective distributed SQL engine for OLAP and widely adopted in Baidu production for many internal BI projects. However, Baidu has also been facing many challenges for large scale including tuning the shuffle parallelism for thousands of jobs, inefficient execution plan, and handling data skew.
In this talk, we will explore Intel and Baidu’s joint efforts to address challenges in large scale and offer an overview of an adaptive execution mode we implemented for Baidu’s Big SQL platform which is based on Spark SQL. At runtime, adaptive execution can change the execution plan to use a better join strategy and handle skewed join automatically. It can also change the number of reducer to better fit the data scale. In general, adaptive execution decreases the effort involved in tuning SQL query parameters and improves the execution performance by choosing a better execution plan and parallelism at runtime.
We’ll also share our experience of using adaptive execution in Baidu’s production cluster with thousands of server, where adaptive execution helps to improve the performance of some complex queries by 200%. After further analysis we found that several special scenarios in Baidu data analysis can benefit from the optimization of choosing better join type. We got 2x performance improvement in the scenario where the user wanted to analysis 1000+ advertisers’ cost from both web and mobile side and each side has a full information table with 10 TB parquet file per-day. Now we are writing probe jobs to detect more scenarios from current daily jobs of our users. We are also considering to expose the strategy interface based on the detailed metrics collected form adaptive execution mode for the upper users.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeDatabricks
Data shuffling is a costly operation. At Facebook, single job shuffles can reach the scale of over 300TB compressed using (relatively cheap) large spinning disks. However, shuffle reads issue large amounts of inefficient, small, random I/O requests to disks and can be a large source of job latency as well as waste of reserved system resources. In order to boost shuffle performance and improve resource efficiency, we have developed Spark-optimized Shuffle (SOS). This shuffle technique effectively converts a large number of small shuffle read requests into fewer large, sequential I/O requests.
In this session, we present SOS’s multi-stage shuffle architecture and implementation. We will also share our production results and future optimizations.
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion.
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Databricks
Spark SQL is a very effective distributed SQL engine for OLAP and widely adopted in Baidu production for many internal BI projects. However, Baidu has also been facing many challenges for large scale including tuning the shuffle parallelism for thousands of jobs, inefficient execution plan, and handling data skew.
In this talk, we will explore Intel and Baidu’s joint efforts to address challenges in large scale and offer an overview of an adaptive execution mode we implemented for Baidu’s Big SQL platform which is based on Spark SQL. At runtime, adaptive execution can change the execution plan to use a better join strategy and handle skewed join automatically. It can also change the number of reducer to better fit the data scale. In general, adaptive execution decreases the effort involved in tuning SQL query parameters and improves the execution performance by choosing a better execution plan and parallelism at runtime.
We’ll also share our experience of using adaptive execution in Baidu’s production cluster with thousands of server, where adaptive execution helps to improve the performance of some complex queries by 200%. After further analysis we found that several special scenarios in Baidu data analysis can benefit from the optimization of choosing better join type. We got 2x performance improvement in the scenario where the user wanted to analysis 1000+ advertisers’ cost from both web and mobile side and each side has a full information table with 10 TB parquet file per-day. Now we are writing probe jobs to detect more scenarios from current daily jobs of our users. We are also considering to expose the strategy interface based on the detailed metrics collected form adaptive execution mode for the upper users.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeDatabricks
Data shuffling is a costly operation. At Facebook, single job shuffles can reach the scale of over 300TB compressed using (relatively cheap) large spinning disks. However, shuffle reads issue large amounts of inefficient, small, random I/O requests to disks and can be a large source of job latency as well as waste of reserved system resources. In order to boost shuffle performance and improve resource efficiency, we have developed Spark-optimized Shuffle (SOS). This shuffle technique effectively converts a large number of small shuffle read requests into fewer large, sequential I/O requests.
In this session, we present SOS’s multi-stage shuffle architecture and implementation. We will also share our production results and future optimizations.
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion.
The talk by Maksud Ibrahimov, Chief Data Scientist at InfoReady Analytics. He is going to share with us how to maximise the performance of Spark.
As a user of Apache Spark from very early releases, he generally sees that the framework is easy to start with but as the program grows its performance starts to suffer. In this talk Maksud will answer the following questions:
- How to reach higher level of parallelism of your jobs without scaling up your cluster?
- Understanding shuffles, and how to avoid disk spills
- How to identify task stragglers and data skews?
- How to identify Spark bottlenecks?
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...Databricks
This session will explain how NetApp simplifies the process of analyzing IoT data, using Apache Spark clusters across data centers and the cloud using NetApp Private Storage (NPS) for AWS/Azure, NetApp Data Fabric and NetApp Connectors for NFS and S3. IoT data originates at the edge in different geographical locations, and it can arrive at different data centers or the cloud depending on sensor location. The challenge is how to combine these different data streams across different datacenters to generate wider insights.
Learn how NetApp Data Fabric helps solve this challenge. In the Data Fabric architecture, the IoT data is ingested via Kafka into an Apache Spark cluster running in AWS/Azure, but the data is stored in NPS provisioned NFS share through NFS Connector. The IoT data in NPS can then be moved to on-prem datacenters, or on-prem IoT data can be moved to NPS or ONTAP Cloud for processing in AWS/Azure using NetApp SnapMirror Flex Clone or NFS Connector. We’ll also review how NetApp StorageGRID object storage maintains IoT data for archival purposes using S3 Target. The above options allow you to analyze IoT data from AWS, StorageGRID, HDFS or NFS, providing a feasible solution for deploying Spark clusters across datacenters.
Takeaways will include identifying Spark challenges that can be remedied by extending your Spark environment to take advantage of NPS; understanding how NPS and StorageGRID can provide a cost-effective alternative for dev/test, DR for Spark analytics; and understanding Spark architecture and deployment options that utilize data from multiple locations, including on-prem and cloud-based repositories.
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsDataWorks Summit
As a Hadoop developer, do you want to quickly develop your Hadoop workflows? Do you want to test your workflows in a sandboxed environment similar to production? Do you want to write unit tests for your workflows and add assertions on top of it?
In just a few years, the number of users writing Hadoop/Spark jobs at LinkedIn have grown from tens to hundreds and the number of jobs running every day has grown from hundreds to thousands. With the ever increasing number of users and jobs, it becomes crucial to reduce the development time for these jobs. It is also important to test these jobs thoroughly before they go to production.
We’ve tried to address these issues by creating a testing framework for Hadoop/Spark jobs. The testing framework enables the users to run their jobs in an environment similar to the production environment and on the data which is sampled from the original data. The testing framework consists of a test deployment system, a data generation pipeline to generate the sampled data, a data management system to help users manage and search the sampled data and an assertion engine to validate the test output.
In this talk, we will discuss the motivation behind the testing framework before deep diving into its design. We will further discuss how the testing framework is helping the Hadoop users at LinkedIn to be more productive.
CaffeOnSpark Update: Recent Enhancements and Use CasesDataWorks Summit
By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. We released CaffeOnSpark as an open source project in early 2016, and shared its architecture design and basic usage at Hadoop Summit 2016.
In this talk, we will update audiences about the recenet development of CaffeOnSpark. We will highlight new features and capabilities: unified data layer which multi-label datasets, distributed LSTM training, interleave testing with training, monitoring/profiling framework, and docker deployment.
We plan to share some interesting use cases from Yahoo, including image classification, NSFW image detection, and automatic identification of eSports game highlights. We will offer an interactive demo of image auto captioning using CaffeOnSpark in a Hadoop based notebook.
A Comparative Performance Evaluation of Apache FlinkDongwon Kim
I compare Apache Flink to Apache Spark, Apache Tez, and MapReduce in Apache Hadoop in terms of performance. I run experiments using two benchmarks, Terasort and Hashjoin.
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...Databricks
When you run an Apache Spark application on a large cluster, you want to make sure you’re getting the most from that cluster. Any CPU or memory left on the table represents either a waste of money or a lost opportunity to speed up your Spark jobs. What many people don’t realize is how sensitive Spark cluster utilization is to the resource manager. Resource managers decide how to allocate cluster resources among the many users and applications contending for them. In this deep dive session, we will discuss how Spark integrates with two common open source resource managers, YARN and Mesos, as well as a new commercial product called IBM Spectrum Conductor with Spark. You will learn how resource managers arbitrate resources in multi-user/multi-tenant Spark clusters, and how this affects application performance. You will come away with new techniques for tuning Spark resource management to optimize goals like speed and fairness. The session will include a demo of a new open source benchmark designed to help analyse Spark multi-user/multi-tenant performance. The benchmark uses Spark SQL and machine learning jobs to load the cluster, and can be used during a pre-production cycle to tune Spark and resource manager configurations.
How to Automate Performance Tuning for Apache SparkDatabricks
Spark has made writing big data pipelines much easier than before. But a lot of effort is required to maintain performant and stable data pipelines in production over time. Did I choose the right type of infrastructure for my application? Did I set the Spark configurations correctly? Can my application keep running smoothly as the volume of ingested data grows over time? How to make sure that my pipeline always finishes on time and meets its SLA?
These questions are not easy to answer even for a handful of jobs, and this maintenance work can become a real burden as you scale to dozens, hundreds, or thousands of jobs. This talk will review what we found to be the most useful piece of information and parameters to look at for manual tuning, and the different options available to engineers who want to automate this work, from open-source tools to managed services provided by the data platform or third parties like the Data Mechanics platform.
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
700 Updatable Queries Per Second: Spark as a Real-Time Web Service. Find out how to use Apache Spark with FiloDb for low-latency queries - something you never thought possible with Spark. Scale it down, not just scale it up!
The talk by Maksud Ibrahimov, Chief Data Scientist at InfoReady Analytics. He is going to share with us how to maximise the performance of Spark.
As a user of Apache Spark from very early releases, he generally sees that the framework is easy to start with but as the program grows its performance starts to suffer. In this talk Maksud will answer the following questions:
- How to reach higher level of parallelism of your jobs without scaling up your cluster?
- Understanding shuffles, and how to avoid disk spills
- How to identify task stragglers and data skews?
- How to identify Spark bottlenecks?
Analyzing IOT Data in Apache Spark Across Data Centers and Cloud with NetApp ...Databricks
This session will explain how NetApp simplifies the process of analyzing IoT data, using Apache Spark clusters across data centers and the cloud using NetApp Private Storage (NPS) for AWS/Azure, NetApp Data Fabric and NetApp Connectors for NFS and S3. IoT data originates at the edge in different geographical locations, and it can arrive at different data centers or the cloud depending on sensor location. The challenge is how to combine these different data streams across different datacenters to generate wider insights.
Learn how NetApp Data Fabric helps solve this challenge. In the Data Fabric architecture, the IoT data is ingested via Kafka into an Apache Spark cluster running in AWS/Azure, but the data is stored in NPS provisioned NFS share through NFS Connector. The IoT data in NPS can then be moved to on-prem datacenters, or on-prem IoT data can be moved to NPS or ONTAP Cloud for processing in AWS/Azure using NetApp SnapMirror Flex Clone or NFS Connector. We’ll also review how NetApp StorageGRID object storage maintains IoT data for archival purposes using S3 Target. The above options allow you to analyze IoT data from AWS, StorageGRID, HDFS or NFS, providing a feasible solution for deploying Spark clusters across datacenters.
Takeaways will include identifying Spark challenges that can be remedied by extending your Spark environment to take advantage of NPS; understanding how NPS and StorageGRID can provide a cost-effective alternative for dev/test, DR for Spark analytics; and understanding Spark architecture and deployment options that utilize data from multiple locations, including on-prem and cloud-based repositories.
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsDataWorks Summit
As a Hadoop developer, do you want to quickly develop your Hadoop workflows? Do you want to test your workflows in a sandboxed environment similar to production? Do you want to write unit tests for your workflows and add assertions on top of it?
In just a few years, the number of users writing Hadoop/Spark jobs at LinkedIn have grown from tens to hundreds and the number of jobs running every day has grown from hundreds to thousands. With the ever increasing number of users and jobs, it becomes crucial to reduce the development time for these jobs. It is also important to test these jobs thoroughly before they go to production.
We’ve tried to address these issues by creating a testing framework for Hadoop/Spark jobs. The testing framework enables the users to run their jobs in an environment similar to the production environment and on the data which is sampled from the original data. The testing framework consists of a test deployment system, a data generation pipeline to generate the sampled data, a data management system to help users manage and search the sampled data and an assertion engine to validate the test output.
In this talk, we will discuss the motivation behind the testing framework before deep diving into its design. We will further discuss how the testing framework is helping the Hadoop users at LinkedIn to be more productive.
CaffeOnSpark Update: Recent Enhancements and Use CasesDataWorks Summit
By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. We released CaffeOnSpark as an open source project in early 2016, and shared its architecture design and basic usage at Hadoop Summit 2016.
In this talk, we will update audiences about the recenet development of CaffeOnSpark. We will highlight new features and capabilities: unified data layer which multi-label datasets, distributed LSTM training, interleave testing with training, monitoring/profiling framework, and docker deployment.
We plan to share some interesting use cases from Yahoo, including image classification, NSFW image detection, and automatic identification of eSports game highlights. We will offer an interactive demo of image auto captioning using CaffeOnSpark in a Hadoop based notebook.
A Comparative Performance Evaluation of Apache FlinkDongwon Kim
I compare Apache Flink to Apache Spark, Apache Tez, and MapReduce in Apache Hadoop in terms of performance. I run experiments using two benchmarks, Terasort and Hashjoin.
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...Databricks
When you run an Apache Spark application on a large cluster, you want to make sure you’re getting the most from that cluster. Any CPU or memory left on the table represents either a waste of money or a lost opportunity to speed up your Spark jobs. What many people don’t realize is how sensitive Spark cluster utilization is to the resource manager. Resource managers decide how to allocate cluster resources among the many users and applications contending for them. In this deep dive session, we will discuss how Spark integrates with two common open source resource managers, YARN and Mesos, as well as a new commercial product called IBM Spectrum Conductor with Spark. You will learn how resource managers arbitrate resources in multi-user/multi-tenant Spark clusters, and how this affects application performance. You will come away with new techniques for tuning Spark resource management to optimize goals like speed and fairness. The session will include a demo of a new open source benchmark designed to help analyse Spark multi-user/multi-tenant performance. The benchmark uses Spark SQL and machine learning jobs to load the cluster, and can be used during a pre-production cycle to tune Spark and resource manager configurations.
How to Automate Performance Tuning for Apache SparkDatabricks
Spark has made writing big data pipelines much easier than before. But a lot of effort is required to maintain performant and stable data pipelines in production over time. Did I choose the right type of infrastructure for my application? Did I set the Spark configurations correctly? Can my application keep running smoothly as the volume of ingested data grows over time? How to make sure that my pipeline always finishes on time and meets its SLA?
These questions are not easy to answer even for a handful of jobs, and this maintenance work can become a real burden as you scale to dozens, hundreds, or thousands of jobs. This talk will review what we found to be the most useful piece of information and parameters to look at for manual tuning, and the different options available to engineers who want to automate this work, from open-source tools to managed services provided by the data platform or third parties like the Data Mechanics platform.
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
700 Updatable Queries Per Second: Spark as a Real-Time Web Service. Find out how to use Apache Spark with FiloDb for low-latency queries - something you never thought possible with Spark. Scale it down, not just scale it up!
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the various techniques there followed in, for improving the performance of the Multicore
Processors. We conducted cluster experiments to find this limit. In this paper we propose an alternate design of Multicore processor based on the results of our cluster experiment.
Affect of parallel computing on multicore processorscsandit
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make
number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of
Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the
various techniques there followed in, for improving the performance of the Multicore
Processors.
We conducted cluster experiments to find this limit. In this paper we propose an alternate design
of Multicore processor based on the results of our cluster experiment.
SparkNet implements a scalable, distributed algorithm to train deep neural networks that can be applied to existing batch processing frameworks like MapReduce and Spark.
Work by researchers at UC Berkeley.
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2023/07/efficiently-map-ai-and-vision-applications-onto-multi-core-ai-processors-using-cevas-parallel-processing-framework-a-presentation-from-ceva/
Rami Drucker, Machine Learning Software Architect at CEVA, presents the “Efficiently Map AI and Vision Applications onto Multi-core AI Processors Using CEVA’s Parallel Processing Framework” tutorial at the May 2023 Embedded Vision Summit.
Next-generation AI and computer vision applications for autonomous vehicles, cameras, drones and robots require higher-than-ever computing power. Often, the most efficient way to deliver high performance (especially in cost- and power-constrained applications) is to use multi-core processors. But developers must then map their applications onto the multiple cores in an efficient manner, which can be difficult. To address this challenge and streamline application development, CEVA has introduced the Architecture Planner tool as a new element in CEVA’s comprehensive AI SDK.
In this talk, Drucker shows how the Architecture Planner tool analyzes the network model and the processor configuration (number of cores, memory sizes), then automatically maps the workload onto the multiple cores in an efficient manner. He explains key techniques used by the tool, including symmetrical and asymmetrical multi-processing, partition by sub-graphs, batch partitioning and pipeline partitioning.
An Introduction to TensorFlow architectureMani Goswami
Introduces you to the internals of TensorFlow and deep dives into distributed version of TensorFlow. Refer to https://github.com/manigoswami/tensorflow-examples for examples.
KSpeculative aspects of high-speed processor designssuser7dcef0
Highest total system speed
Ex. TOP500 speed,
Application speed of supercomputers
Highest processor chip performance
Ex. SPEC CPU rate
NAS parallel benchmarks
Highest single core performance
SPEC CPU int,
SPEC CPU fp
Dhrystone
Fast switching of threads between cores - Advanced Operating SystemsRuhaim Izmeth
Fast switching of threads between cores is a published research paper on Operating systems, This is our attempt to decode the research and present to the class
improve deep learning training and inference performances.rohit
factors affecting gpu performance for machine learning training and inference.
1. Deep Learning Performance Benchmarks
2. Gpu hardware basics
3. Internal data Transfer
4. Models, Datasets and Parallelism
5. Data training pipeline
6. Performance Tuning
7. Deep Learning Load Distribution Strategies.
8. Misc algorithms like Automatic Differentiation etc.
Hardback solution to accelerate multimedia computation through mgp in cmpeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
NECSTLab works with Machine Learning models and systems in a variety of ways. While some projects use known ML models within their target application for prediction, other projects accelerate prediction via FPGA or on cloud environments.
In the "STeEl" research line, two works use ML models. "Behaviors Identification in Social Individuals" (from G. Muscioni) uses an XGBoost model to identify group of baboons with similar behaviour. Similarly, SeNSE (from P. Cancian et al.) evaluates SVM- and Long-Short Term Memory-based classifiers to translate information from brain signals into gestures, in order to enable designing cheap human-hand prosthesis.
The DReAMS research line targets both embedded and server-class systems. For the embedded world, the project "Deep Learning on PYNQ" (from L. Stornaiuolo) enables parallel access to data via a dedicated core based on the Polimorphic Register File; among other applications, parallel access is especially important for Neural Network models to gather weights and inputs. Instead, SpiNN is a project in the early stages that attempts to simulate neurons with Spiking Neural Network, with a Reinforcement-Learning-based approach. For server-class applications, CONDOR (from N. Raspa, M. Bacis, G. Natale) provides a pre-packaged, tunable IP to deploy Convolutional Neural Networks to FPGAs. CONDOR employs a generic, parameterised architecture to describe the various CNN layers, easing the exploration of performance/area trade-offs with the goal of automatising it in the future.
Finally, in the ORCA research line, PRETZEL accelerates ML inference in cloud settings. PRETZEL (from Y. Lee, A. Scolari, M. Interlandi et al.) challenges the convenient black box approach of deploying ML models in order to gather more information about the models. Since different models can have much state and code in common, PRETZEL relies on these common data to enable memory saving and code optimisations, improving the inference latency and the throughput metrics.
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
https://sites.google.com/view/itri-icl-dla/
(Public Information Share) This is our lightweight DNN inference processor presentation, including a system solution (from Caffe prototxt to HW controls files), hardware features, and an example of object detection (Tiny YOLO) RTL simulation results. We modified open-source NVDLA, small configuration, and developed a RISC-V MCU in this accelerating system.
Similar to Training Distributed Deep Recurrent Neural Networks with Mixed Precision on GPU Clusters with Alexey Svyatkovskiy (20)
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on GPU Clusters with Alexey Svyatkovskiy
1. Alexey Svyatkovskiy (Microsoft/Princeton U/PPPL),
Julian Kates-Harbeck (Harvard U/PPPL),
William Tang (Princeton U/PPPL)
Training neural networks
with low precision floats
#DL2SAIS
2. Outline
• Training neural networks with reduced floating-point precision
– Motivation and challenges
– Half-precision and mixed precision training approaches
• Enabling FP16 training on Pascal and Volta GPUs
– Compression of weights and gradients
– Master copy of weights, loss scaling
– TensorCores
• Fusion Recurrent Neural Net (FRNN) analysis framework
– The physics problem we are trying to solve
– Training LSTMs on long sequences of variable length
– Distributed training at the largest scale
– Results on TSUBAME 3.0 and OLCF Summit supercomputers
• Conclusions
2#DL2SAIS
3. Training with reduced floating-point precision: motivation
• Models are getting larger
– Optimize memory and network bandwidths
• Need faster training
– Improve maximum computational throughput, use larger mini-batch size
– Minimize communication overhead during distributed training
NN architecture Num. layers Num. trainable
parameters
GFLOPS
AlexNet 8 60 million 1.4
GoogLeNet 22 4 million 4
Inception V3 48 24 million 9
ResNet50 152 26 million 23
Deep Speech 2 11 38 millions 465
#DL2SAIS 3
5. FP16 training: motivation and challenges
Challenges
• IEEE 754 FP16 has a narrow numerical range
– Overflow problem: Inf/NaN cause irreversible damage to training (quite rare)
– “Vanishing gradient“ (quite common)
• Small valued gradients
• Large ratio of weight to weight gradient
Subnormal range: [2-24
,2-14
] Normal range: [214
,65504]
float
half
-127 -24-14 0 15 128
#DL2SAIS 5
6. Mixed precision with master copy of weights
• Store weights, activations, gradients in FP16; perform weight update in FP32
– Solves “vanishing gradient” problem
– Disadvantage: conversion between FP32 and FP16 may be slow; extra memory to keep
an extra copy of weights in FP32
Mixed arithmetic precision
• FP16 matrix multiply, FP32 accumulate,
either FP32/FP16 for activations
• Volta TensorCores enable this on the
hardware level
– Solves “vanishing gradient” problem
– Programmable matrix-multiply-and-accumulate units.
8 TensorCores per streaming multiprocessor
Enabling FP16 training (I)
#DL2SAIS 6
7. Enabling FP16 training (III)
Loss scaling: shift gradients to occupy higher bins of FP16 representable range, which are not
used
– Apply a scalar factor to the loss function before backpropagation step
– Unscale weight gradients before the update step to maintain the magnitude of updates the
same as for FP32; alternatively re-tune the hyperparameters
• Scaling factors will depend on the neural network/dataset in hand
FP16 representable range FP16 representable rangeScale factor of 10 shifts
by 3 exponent values right
Everything below is irrelevant
for training
#DL2SAIS 7
8. Enabling FP16 training (II)
Weight and gradient sharing
• Quantize weight and gradient matrices into
N bins via clustering
– Weights in the same bin share the same value
– Store index of shared weights in integer matrix
Dynamic fixed-point representation
• Train in full precision (float)
• Quantize weight and activation
– Gather statistics about weight and activation (min, max
for each layer)
– Convert each float value to fixed point format (Google TPU uses int8 equivalent
representation and arithmetic for that)
S. Han et. al. “Deep compression”
#DL2SAIS 8
9. Single GPU results
• Volta, no loss-scaling required for accurate convergence
• Volta, loss-scaling required for accurate convergence
9#DL2SAIS
Model FP32 baseline Mixed precision
AlexNet 56.77 % 56.93 %
VGG-D 65.40 % 65.43 %
GoogLeNet 68.33 % 68.43 %
Inception V3 73.85 % 74.13 %
Resnet50 75.92 % 76.04 %
Model FP32 baseline MP no loss scaling MP loss sclaing
Faster R-CNN 69.1 % 68.6 % 69.7 %
Multibox SSD 76.9 % diverges 77.1 %
Result from: P. Micikevicius et. al. Mixed Precision
Training, https://arxiv.org/abs/1710.03740
10. OLCF Summit: hardware and setup
• Summit is an IBM system located at the Oak Ridge Leadership Computing Facility.
Having theoretical peak-performance of 200 PF, it is one of the “smartest” computers in the
world for deep learning applications with
mixed-precision capability in excess of 3 EF
– To be ranked as #1 supercomputing system
worldwide this summer
– 4600 compute nodes, two IBM POWER9,
PPC64LE processors and
6 Volta V100 accelerators per node, NVlink
– Deploy using Singularity containers
– LSF scheduler, jsrun to execute jobs
– AlpineTDS IBM Spectrumscale filesystem
– Globus DTN, HPSS
#DL2SAIS 10
11. Physics problem we are trying to solve
• Fusion energy: a clean source of energy that can be produced on Earth in a tokamak-style
fusion reactor
• Tokamak: a type of design for a nuclear fusion device. It is a torus-shaped vacuum chamber
surrounded by magnetic coils
• Existing tokamak-style fusion reactors: JET, DIII-D, NSTX
• Plasma disruptions are large macroscopic instabilities resulting in:
– Loss of confinement – ends fusion reaction
– Intense radiation – damaging
concentration in small areas
– Current quench – high magnetic forces
• Prediction and avoidance of disruptions is
a crucial aspect of tokamak operation,
especially in high-performance regime
#DL2SAIS 11
12. • Scalar time series: plasma current, lock-mode amplitude
• Vector time series
• All as a function of ρ (normalized flux surface)
• 1D current, electron temperature, radiation profiles
ρ = 0
ρ = 1
0.7 0.1 1.5 0.05
0.7 0.1 1.5 0.05 1.1 2.3 0.1
1.1 2.3 0.1
2.3 2.2 2.0 1.7 … 0.2 0.0
!" #$% & '$
1D convolution and activation
Max Pooling
Raw profile
Salient features
Full feature vector
Data: scalar and vector time series
#DL2SAIS 12
13. Fusion Recurrent Neural Net (FRNN)
Signals
Output
> Threshold?
Output: Disruption coming?
RNN Architecture:
• LSTM, 3 layers
• 300 hidden units per cell
• Stateful, returns sequences
Output Output
Internal
State
T = t
0D signals 1D
1D data
CNN
CNN architecture:
• Number of conv. filters: 10
• Size of conv. filters: 3
• Number of conv. layers: 2
• Pool size: 2
Time-distributed FC layer
Time-distributed FC layer
• apply to every temporal slice on
LSTM output
Alarm Alarm Alarm
LSTMLSTMLSTM
Signals Signals
0D signals 0D signals1D 1D
1D data
CNN
1D data
CNN
T = 0 T = 1#DL2SAIS 13
14. • Time series lengths in our problem vary by orders of magnitude: challenge for sequence-based
learning tasks
• We approximate gradient computation by truncated backprop through time (BPTT)
– Gradients are computed over a subsequence of time series, the internal states are saved,
and then the next subsequence is fed to the RNN while using the last internal states from
the previous subsequence as the initial internal states of the current subsequence
• Traditional approach of bucketing would not work as the sequence length is strongly correlated
with whether shots are disruptive or non-disruptive and thus individual batches would be biased
• We implement custom approach resetting internal state of individual examples within mini batch
– Since there is a persistent internal state between successive subsequences in time, it is not
possible to use more than one subsequence from a given shot in a given mini-batch
14#DL2SAIS
Training LSTM on sequences of variable lengths (I)
15. Training LSTM on sequences of variable lengths (II)
• To train batch-wise we need M independent (stemming from different shots) time slices of equal
length to feed to the GPU
– Maintain a buffer of M separate shots. At every training
step, the first T time slices of the buffer are fed as the
next batch, then shift the buffer by T time steps
• When shot is finished in the buffer, a new shot
is loaded and the RNN internal states of the
corresponding batch index are reset for training
• Internal states of the other batch indices are
maintained. Thus, the internal state is persisted
during learning for the entire length of any given shot
– Allow the RNN to learn temporal patterns much longer than unrolling length T and potentially as long as
the entire shot
• Random offsets of shots against each other and random shuffling of train set provides a mixture
of disruptive and non-disruptive samples for the network at every batch to stabilize training
15#DL2SAIS
16. Introducing FRNN
• FRNN represents a combined workflow implemented in Python
– Keras for layer definitions
– TensorFlow optimizers
– Custom parameter averaging, global weights update in CUDA-aware MPI
– Not a general purpose tool: geared toward Fusion energy scientific
community, but certain parts can be useful for a wider community
– https://github.com/PPPLDeepLearning/plasma-python
• We demonstrate how to enable FP16 for distributed training with this
combined Keras/TensorFlow/MPI based workflow
– Weight update is sensitive to floating point precision
– Volta/FP16 optimized algorithms are disabled by default in TF, chosen by
CuDNN autotuner
16#DL2SAIS
17. Distributed training flow
(Step 3) Update optimizer internal state and global
weights using the global gradient
Replica 1
GPU
worker
Replica 2
GPU
worker
Replica 3
GPU
worker
Replica 4
GPU
worker
Replica N
GPU
worker
Average local gradients
collected from workers
(Step 2) Aggregate local
gradients from each worker
using MPI_Allreduce, then
average them
(Step 4)
MPI_Broadcast
global parameters
after update
back to workers,
repeat (Step 1)
!"#$%&#
'
= !"#$%&#
'
− *+!"#$%&#
'
(Step 0) Initialize network parameters randomly, based on the model configuration
(Step 1) Perform fprop
and bprop on each
worker
17
18. No loss in training accuracy
• Use validation level area under the ROC and AUC (area under curve) to characterize
the quality of FRNN and applicability of the half-precision training
• Validation level AUC as a function of epoch show similar shapes for both FP16 and
FP32
– Reaching the plateau at around AUC=0.87 by the epoch 6
– SGD optimizer with momentum, loss scaling factor=10
The test set AUC=0.96
for
both the FP16 and
FP32
Base learning
rate
λ0 = 0.0004
Base learning rate
λ0 = 0.001
#DL2SAIS 18
19. Performance and scaling properties (I)
• FRNN shows nearly-linear runtime scaling and logarithmic communication
0.3 0.4 0.5 0.6 0.7 0.8
Validation AUC achieved
102
103
104
Speedupoverserial
Ideal
Measured
10 0 10 1 10 2 10 3 10 4
Number of GPUs
10 −1
10 0
10 1
10 2
10 3
10 4
Timeperepoch(s)
Data
Scaling model
Ideal scaling
1
G
PU
102
G
PU
s
(paralleltraining)
102
G
PU
s
(paralleltuning)
104
G
PU
s
10 −1
10 0
10 1
10 2
10 3
10 4
Runtime(hours)
2 4 6 8 10 12 14 16
Scaled walltime (hours)
10−2
10−1
100
Trainingloss
NGPU = 4
NGPU = 16
NGPU = 64
NGPU = 128
NGPU = 256
0.4
0.5
0.6
0.7
0.8
0.9
ValidationAUCachieved
10−2 10−1 100 101
Time to disruption (s)
0.0
0.2
0.4
0.6
Fractionofdetected
0.03 0.07 0.2 0.5 1.0
Time to disruption (s)
0.5
0.6
0.7
0.8
0.9
TestAUCachieved
cb
#DL2SAIS 19
20. Performance and scaling properties (II)
• Results for hyperparameter tuning with10000 GPUs
with parallel random search across
100 models, trained on 100 GPUs each.
• The time to solution for finding a model of given
validation AUC is compared to the scenario
of using a single GPU.
– The inset shows the time required for finding the best
model in the scenarios of using 1, 100 and
10000 total GPUs.
– For 100 GPUs, we distinguish training the 100 models
in serial, but using 100 GPUs for each model
(“parallel training”) vs. running 100 models in parallel,
trained on 1 GPU each (“parallel tuning”),
which both achieve nearly 100x speedup.
0.3 0.4 0.5 0.6 0.7 0.8
Validation AUC achieved
102
103
104
Speedupoverserial
Ideal
Measured
10 0 10 1 10 2 10 3 10 4
Number of GPUs
10 −1
10 0
10 1
10 2
10 3
10 4
Timeperepoch(s)
Data
Scaling model
Ideal scaling
1
G
PU
102
G
PU
s
(paralleltraining)
102
G
PU
s
(paralleltuning)
104
G
PU
s
10 −1
10 0
10 1
10 2
10 3
10 4
Runtime(hours)
2 4 6 8 10 12 14 16
Scaled walltime (hours)
10−2
10−1
100
Trainingloss
NGPU = 4
NGPU = 16
NGPU = 64
NGPU = 128
NGPU = 256
0.4
0.5
0.6
0.7
0.8
0.9
ValidationAUCachieved
10−2 10−1 100 101
Time to disruption (s)
0.0
Fracti
0.03 0.07 0.2 0.5 1.0
Time to disruption (s)
0.5
cb
#DL2SAIS 20
21. Conclusions
• As neural network models are getting larger, reduced floating point precision training
approaches become crucial
– Ability to optimize computational throughput and reduce memory and network bandwidths,
allowing for faster training
• Develop a domain-specific deep learning framework integrating Keras and TensorFlow with
custom parameter averaging and global weight update routines implemented with CUDA-aware
MPI intended for disruption forecasting in tokamaks
– Strong nearly linear runtime scaling on GPU clusters
– Hyperparameter tuning with O(10000) of nodes
• Evaluated training of deep RNNs with half-precision floats on a real scientific dataset
– Use scientific JET plasma disruption time series dataset
– Use loss scaling approach to facilitate model convergence at FP16
– Obtained comparable validation level performances at the end of each epoch for half and
single floating point precisions
21#DL2SAIS
22. Microsoft Cloud+AI
SMART Data Science
Team is hiring!
ML-Enhanced Coding
Assistant based on a
Markov Chain Model
Automatic Code
Reviews
Generation through
Deep Learning
Answering Documentation
Questions using a Seq-to-
Seq Deep Learning Model
AI Infused
Analytics
Code Risk
Analysis using
Graph Learning
#DL2SAIS 22
Contact: neels@microsoft.com
24. NVIDIA Volta architecture: TensorCores
• What are TensorCores?
– Programmable matrix-multiply-and-accumulate units. 8 TensorCores per
streaming multiprocessor
– Each tensor core can perform operations on small 4x4 matrices
– 1 matrix-multiply-and-accumulate operation per GPU clock
– During program execution, multiple Tensor Cores are used concurrently by a full
warp of execution in GPU
https://devblogs.nvidia.com/tensor-core-ai-performance-milestones/
#DL2SAIS 24
26. Recurrent Neural Nets: basic description
• RNNs are a family of neural networks to process sequential data
• Feed forward equations are recurrent:
! " = $ + &ℎ " − 1 + *+(")
ℎ " = tanh ! "
2 " = 3 + 4ℎ(")
L
o
h
x
y
U
WV
Unfold
Notations:
x – input sequence,
U – is the input to hidden weight matrix,
W - hidden to hidden,
V – hidden to output weights
b,c are the biases
tanh() is the activation function (non-linearity)
o – output sequence
Loss L and target values are denoted as y
U U
h(t-1) h(t) h(…)h(…)
W W W
x(t-1) x(t)
o(t-1) o(t)
L(t-1) L(t)
y(t-1) y(t)
#DL2SAIS 26
27. Model convergence at large N
• Learning rate schedule is crucial to facilitate model convergence during
distributed training when the effective mini-batch size is large
• We use exponential learning rate decay with adjustable base learning rate
• Effective batch size is multiplied by the
number of workers
• In a distributed regime, the base
learning rate is reduced as the number
of workers N is increased
• Learning rate is clipped
!" = !$ %"
!$ (', )) =
!$
1.0 + '/)
#DL2SAIS 27
28. Half-precision training (I)
Half-precision training:
– Store weights, activations and gradients in FP16
– Perform matrix and element-wise multiplications, reduction ops in FP16 (including during fprop, bprop and weight
update)
• You already know it is possible to perform FP16 math, arithmetics and comparisons on NVIDIA GPUs with compute capability
5.3 and greater
– CUDA half intrinsics (round to nearest even): http://docs.nvidia.com/cuda/cuda-math-
api/group__CUDA__MATH__INTRINSIC__HALF.html#group__CUDA__MATH__INTRINSIC__HALF
– TensorFlow has DT_HALF registered type:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/types.cc
– We add a custom MPI datatype of 2 contiguous bytes: https://github.com/PPPLDeepLearning/plasma-
python/blob/master/plasma/primitives/ops.py
• This talk:
– Review techniques to enable FP16 and mixed precision training
– Introduce FRNN framework for disruption forecasting in tokamak fusion plasmas , discuss strong scaling
– Evaluate FP16 training on FRNN framework and a real scientific dataset from JET experiment
– Evaluate FP16 training on the benchmark IMDB dataset
#DL2SAIS 28
Half-precision training (I)
Half-precision training:
– Store weights, activations and gradients in FP16
– Perform matrix and element-wise multiplications, reduction ops in FP16 (including during fprop, bprop and weight
update)
• You already know it is possible to perform FP16 math, arithmetics and comparisons on NVIDIA GPUs with compute capability
5.3 and greater
– CUDA half intrinsics (round to nearest even): http://docs.nvidia.com/cuda/cuda-math-
api/group__CUDA__MATH__INTRINSIC__HALF.html#group__CUDA__MATH__INTRINSIC__HALF
– TensorFlow has DT_HALF registered type:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/types.cc
– We add a custom MPI datatype of 2 contiguous bytes: https://github.com/PPPLDeepLearning/plasma-
python/blob/master/plasma/primitives/ops.py
• This talk:
– Review techniques to enable FP16 and mixed precision training
– Introduce FRNN framework for disruption forecasting in tokamak fusion plasmas , discuss strong scaling
– Evaluate FP16 training on FRNN framework and a real scientific dataset from JET experiment
– Evaluate FP16 training on the benchmark IMDB dataset
#DL2SAIS 28
29. RNN Predictions
True positive example
TTD [ms]
Current Increase
Locked Mode Spike
Density Drop
Internal Inductance Spike
Stored Diamagnetic Energy Drop
FRNN Signal Spike
ArbitraryUnits