FlexFlow is a deep learning engine that automatically finds optimal parallelization strategies for arbitrary DNN models and hardware configurations. It defines a comprehensive search space called SOAP that includes sample, operator, attribute, and parameter parallelization. FlexFlow uses a Markov Chain Monte Carlo algorithm to search this space. It simulates strategy executions, either through full simulation or more efficient delta simulation, to identify high performing strategies. Evaluation on real-world DNN models and hardware showed FlexFlow outperforms expert-designed strategies and other automated frameworks by achieving up to 3.3x higher training throughput.
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches: Multi-algorithm ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. The Super Learner algorithm, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. Although ensemble methods offer superior performance over their singleton counterparts, there is an implicit computational cost to ensembles, as it requires training and cross-validating multiple base learning algorithms.
We will demonstrate a variety of software- and hardware-based approaches that lead to more scalable ensemble learning software, including a highly scalable implementation of stacking called “H2O Ensemble”, built on top of the open source, distributed machine learning platform, H2O. H2O Ensemble scales across multi-node clusters and allows the user to create ensembles of deep neural networks, Gradient Boosting Machines, Random Forest, and others. As for algorithm-based approaches, we will present two algorithmic modifications to the original stacking algorithm that further reduce computation time — Subsemble algorithm and the Online Super Learner algorithm. This talk will also include benchmarks of the implementations of these new stacking variants.
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016MLconf
Alex Smola is the Manager of the Cloud Machine Learning Platform at Amazon. Prior to his role at Amazon, Smola was a Professor in the Machine Learning Department of Carnegie Mellon University and cofounder and CEO of Marianas Labs. Prior to that he worked at Google Strategic Technologies, Yahoo Research, and National ICT Australia. Prior to joining CMU, he was professor at UC Berkeley and the Australian National University. Alex obtained his PhD at TU Berlin in 1998. He has published over 200 papers and written or coauthored 5 books.
Abstract summary
Personalization and Scalable Deep Learning with MXNET: User return times and movie preferences are inherently time dependent. In this talk I will show how this can be accomplished efficiently using deep learning by employing an LSTM (Long Short Term Model). Moreover, I will show how to train large scale distributed parallel models using MXNet efficiently. This includes a brief overview of key components of defining networks, of optimization, and a walkthrough of the steps required to allocate machines, and to train a model.
Large Scale Deep Learning with TensorFlow Jen Aman
Large-scale deep learning with TensorFlow allows storing and performing computation on large datasets to develop computer systems that can understand data. Deep learning models like neural networks are loosely based on what is known about the brain and become more powerful with more data, larger models, and more computation. At Google, deep learning is being applied across many products and areas, from speech recognition to image understanding to machine translation. TensorFlow provides an open-source software library for machine learning that has been widely adopted both internally at Google and externally.
Applying your Convolutional Neural NetworksDatabricks
Part 3 of the Deep Learning Fundamentals Series, this session starts with a quick primer on activation functions, learning rates, optimizers, and backpropagation. Then it dives deeper into convolutional neural networks discussing convolutions (including kernels, local connectivity, strides, padding, and activation functions), pooling (or subsampling to reduce the image size), and fully connected layer. The session also provides a high-level overview of some CNN architectures. The demos included in these slides are running on Keras with TensorFlow backend on Databricks.
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
Fast, Cheap and Deep – Scaling Machine Learning: Distributed high throughput machine learning is both a challenge and a key enabling technology. Using a Parameter Server template we are able to distribute algorithms efficiently over multiple GPUs and in the cloud. This allows us to design very fast recommender systems, factorization machines, classifiers, and deep networks. This degree of scalability allows us to tackle computationally expensive problems efficiently, yielding excellent results e.g. in visual question answering.
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaSpark Summit
Deep recurrent neural networks are well-suited for sequence learning tasks like text classification and generation. The author discusses implementing recurrent neural networks in Spark for distributed deep learning on big data. Two use cases are described: predictive maintenance using sensor data to detect failures, and sentiment analysis of tweets using RNNs which achieve better accuracy than traditional classifiers.
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches: Multi-algorithm ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. The Super Learner algorithm, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. Although ensemble methods offer superior performance over their singleton counterparts, there is an implicit computational cost to ensembles, as it requires training and cross-validating multiple base learning algorithms.
We will demonstrate a variety of software- and hardware-based approaches that lead to more scalable ensemble learning software, including a highly scalable implementation of stacking called “H2O Ensemble”, built on top of the open source, distributed machine learning platform, H2O. H2O Ensemble scales across multi-node clusters and allows the user to create ensembles of deep neural networks, Gradient Boosting Machines, Random Forest, and others. As for algorithm-based approaches, we will present two algorithmic modifications to the original stacking algorithm that further reduce computation time — Subsemble algorithm and the Online Super Learner algorithm. This talk will also include benchmarks of the implementations of these new stacking variants.
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016MLconf
Alex Smola is the Manager of the Cloud Machine Learning Platform at Amazon. Prior to his role at Amazon, Smola was a Professor in the Machine Learning Department of Carnegie Mellon University and cofounder and CEO of Marianas Labs. Prior to that he worked at Google Strategic Technologies, Yahoo Research, and National ICT Australia. Prior to joining CMU, he was professor at UC Berkeley and the Australian National University. Alex obtained his PhD at TU Berlin in 1998. He has published over 200 papers and written or coauthored 5 books.
Abstract summary
Personalization and Scalable Deep Learning with MXNET: User return times and movie preferences are inherently time dependent. In this talk I will show how this can be accomplished efficiently using deep learning by employing an LSTM (Long Short Term Model). Moreover, I will show how to train large scale distributed parallel models using MXNet efficiently. This includes a brief overview of key components of defining networks, of optimization, and a walkthrough of the steps required to allocate machines, and to train a model.
Large Scale Deep Learning with TensorFlow Jen Aman
Large-scale deep learning with TensorFlow allows storing and performing computation on large datasets to develop computer systems that can understand data. Deep learning models like neural networks are loosely based on what is known about the brain and become more powerful with more data, larger models, and more computation. At Google, deep learning is being applied across many products and areas, from speech recognition to image understanding to machine translation. TensorFlow provides an open-source software library for machine learning that has been widely adopted both internally at Google and externally.
Applying your Convolutional Neural NetworksDatabricks
Part 3 of the Deep Learning Fundamentals Series, this session starts with a quick primer on activation functions, learning rates, optimizers, and backpropagation. Then it dives deeper into convolutional neural networks discussing convolutions (including kernels, local connectivity, strides, padding, and activation functions), pooling (or subsampling to reduce the image size), and fully connected layer. The session also provides a high-level overview of some CNN architectures. The demos included in these slides are running on Keras with TensorFlow backend on Databricks.
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
Fast, Cheap and Deep – Scaling Machine Learning: Distributed high throughput machine learning is both a challenge and a key enabling technology. Using a Parameter Server template we are able to distribute algorithms efficiently over multiple GPUs and in the cloud. This allows us to design very fast recommender systems, factorization machines, classifiers, and deep networks. This degree of scalability allows us to tackle computationally expensive problems efficiently, yielding excellent results e.g. in visual question answering.
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves MabialaSpark Summit
Deep recurrent neural networks are well-suited for sequence learning tasks like text classification and generation. The author discusses implementing recurrent neural networks in Spark for distributed deep learning on big data. Two use cases are described: predictive maintenance using sensor data to detect failures, and sentiment analysis of tweets using RNNs which achieve better accuracy than traditional classifiers.
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)Alexander Ulanov
Artificial neural networks (ANN) are one of the popular models of machine learning, in particular for deep learning. The models that are used in practice for image classification and speech recognition contain huge number of weights and are trained with big datasets. Training such models is challenging in terms of computation and data processing. We propose a scalable implementation of deep neural networks for Spark. We address the computational challenge by batch operations, using BLAS for vector and matrix computations and reusing the memory for reducing garbage collector activity. Spark provides data parallelism that enables scaling of training. As a result, our implementation is on par with widely used C++ implementations like Caffe on a single machine and scales nicely on a cluster. The developed API makes it easy to configure your own network and to run experiments with different hyper parameters. Our implementation is easily extensible and we invite other developers to contribute new types of neural network functions and layers. Also, optimizations that we applied and our experience with GPU CUDA BLAS might be useful for other machine learning algorithms being developed for Spark.
The slides were presented at Spark SF Friends meetup on December 2, 2015 organized by Alex Khrabrov @Nitro. The content is based on my talk on Spark Summit Europe. However, there are few major updates: update and more details on the parallelism heuristic, experiments with larger cluster, as well as the new slide design.
DLD meetup 2017, Efficient Deep LearningBrodmann17
The document discusses efficient techniques for deep learning on edge devices. It begins by noting that deep neural networks have high computational complexity which makes inference inefficient for edge devices without powerful GPUs. It then outlines the deep learning stack from hardware to libraries to frameworks to algorithms. The document focuses on how algorithms define model complexity and discusses the evolution of CNN architectures from LeNet5 to ResNet which generally increased in complexity. It covers techniques for reducing model size and operations like pruning, quantization, and knowledge distillation. The challenges of real-life applications on edge devices are discussed.
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMLAI2
MetaPerturb is a meta-learned perturbation function that can enhance generalization of neural networks on different tasks and architectures. It proposes a novel meta-learning framework involving jointly training a main model and perturbation module on multiple source tasks to learn a transferable perturbation function. This meta-learned perturbation function can then be transferred to improve performance of a target model on an unseen target task or architecture, outperforming baselines on various datasets and architectures.
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Intel® Software
This session discuss the implementation and performance of the K-nearest neighbor (KNN) computation on a distributed architecture using the Intel® Xeon Phi™ processor.
This document discusses Bayesian global optimization and its application to tuning machine learning models. It begins by outlining some of the challenges of tuning ML models, such as the non-intuitive nature of the task. It then introduces Bayesian global optimization as an approach to efficiently search the hyperparameter space to find optimal configurations. The key aspects of Bayesian global optimization are described, including using Gaussian processes to build models of the objective function from sampled points and finding the next best point to sample via expected improvement. Several examples are provided demonstrating how Bayesian global optimization outperforms standard tuning methods in optimizing real-world ML tasks.
Pelee: a real time object detection system on mobile devices Paper ReviewLEE HOSEONG
This document summarizes the Pelee object detection system which uses the PeleeNet efficient feature extraction network for real-time object detection on mobile devices. PeleeNet improves on DenseNet with two-way dense layers, a stem block, dynamic bottleneck layers, and transition layers without compression. Pelee uses SSD with PeleeNet, selecting fewer feature maps and adding residual prediction blocks for faster, more accurate detection compared to SSD and YOLO. The document concludes that PeleeNet and Pelee achieve real-time classification and detection on devices, outperforming existing models in speed, cost and accuracy with simple code.
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15MLconf
GraphMat: Bridging the Productivity-Performance Gap in Graph Analytics: With increasing interest in large-scale distributed graph analytics for machine learning and data mining, more data scientists and developers are struggling to achieve high performance without sacrificing productivity on large graph problems. In this talk, I will discuss our solution to this problem: GraphMat. Using generalized sparse matrix-based primitives, we are able to achieve performance that is very close to hand-optimized native code, while allowing users to write programs using the familiar vertex-centric programming paradigm. I will show how we optimized GraphMat to achieve this performance on distributed platforms and provide programming examples. We have integrated GraphMat with Apache Spark in a manner that allows the combination to outperform all other distributed graph frameworks. I will explain the reasons for this performance and show that our approach achieves very high hardware efficiency in both single-node and distributed environments using primitives that are applicable to many machine learning and HPC problems. GraphMat is open source software and available for download.
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016MLconf
Applying Deep Learning at Facebook Scale: Facebook leverages Deep Learning for various applications including event prediction, machine translation, natural language understanding and computer vision at a very large scale. There are more than a billion users logging on to Facebook every daily generating thousands of posts per second and uploading more than a billion images and videos every day. This talk will explain how Facebook scaled Deep Learning inference for realtime applications with latency budgets in the milliseconds.
This document provides an overview of next generation analytics with YARN, Spark and GraphLab. It discusses how YARN addressed limitations of Hadoop 1.0 like scalability, locality awareness and shared cluster utilization. It also describes the Berkeley Data Analytics Stack (BDAS) which includes Spark, and how companies like Ooyala and Conviva use it for tasks like iterative machine learning. GraphLab is presented as ideal for processing natural graphs and the PowerGraph framework partitions such graphs for better parallelism. PMML is introduced as a standard for defining predictive models, and how a Naive Bayes model can be defined and scored using PMML with Spark and Storm.
Implementation of linear regression and logistic regression on SparkDalei Li
This presentation was developed for a course project at Technical University of Madrid. The course is massively parallel machine learning supervised by Alberto Mozo and Bruno Ordozgoiti.
Image Classification Done Simply using Keras and TensorFlow Rajiv Shah
This presentation walks through the process of building an image classifier using Keras with a TensorFlow backend. It will give a basic understanding of image classification and show the techniques used in industry to build image classifiers. The presentation will start with building a simple convolutional network, augmenting the data, using a pretrained network, and finally using transfer learning by modifying the last few layers of a pretrained network. The classification will be based on the classic example of classifying cats and dogs. The code for the presentation can be found at https://github.com/rajshah4/image_keras, and the presentation will discuss how to extend the code to your own pictures to make a custom image classifier.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Slides to support Austin Machine Learning Meetup, 1/19/2015.
Overview of techniques of recent Kaggle code to perform online logistic regression with FTRL-proximal (SGD, L1/L2 regularization) and hash trick.
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark.
Snorkel is open source on github and available from Snorkel.Stanford.edu.
Distributed implementation of a lstm on spark and tensorflowEmanuel Di Nardo
Academic project based on developing a LSTM distributing it on Spark and using Tensorflow for numerical operations.
Source code: https://github.com/EmanuelOverflow/LSTM-TensorSpark
by Vikram Madan, Sr. Product Manager, AWS Deep Learning
In this workshop, we will provide cover deep learning fundamentals and focus on the powerful and scalable Apache MXNet open source deep learning framework. At the end of this tutorial you’ll be able to train your own deep neural network and fine tune existing state of the art models for image and object recognition. We’ll also deep dive on setting up your deep learning infrastructure on AWS and model deployment on AWS Lambda.
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
See our presentation from the 6th International EULAG Users Workshop. We talked about taking HPC to the "Industry 4.0" by implementing smart techniques to optimize the codes in terms of performance and energy consumption. It explains how Machine Learning can dynamically optimize HPC simulations and byteLAKE's software autotuning solution.
Find out more about byteLAKE at: www.byteLAKE.com
Slides from Strata+Hadoop Singapore 2016 presenting how Deep Learning can be scaled both vertically and horizontally, when to use CPUs and when to use GPUs.
Talk @ APT Group, University of Manchester, 06 August 2014
Abstract:
Nowadays HPC systems, such as those in the Top500, are equipped with a range of different processors, from multi-core CPUs to GPUs. Programming them can be a tough job, specially if we want to squeeze every last FLOPs of performance out of them.
As a Phd Student, I am now doing a brief research visit in the APT group, working in topics related to the programmability and efficient use of GPUs and many-core coprocessors. In particular, I am implementing a large database operation using OpenCL in these state-of-the-art systems. In this talk I will summarize my work in Manchester and discuss the future work in this topic.
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)Alexander Ulanov
Artificial neural networks (ANN) are one of the popular models of machine learning, in particular for deep learning. The models that are used in practice for image classification and speech recognition contain huge number of weights and are trained with big datasets. Training such models is challenging in terms of computation and data processing. We propose a scalable implementation of deep neural networks for Spark. We address the computational challenge by batch operations, using BLAS for vector and matrix computations and reusing the memory for reducing garbage collector activity. Spark provides data parallelism that enables scaling of training. As a result, our implementation is on par with widely used C++ implementations like Caffe on a single machine and scales nicely on a cluster. The developed API makes it easy to configure your own network and to run experiments with different hyper parameters. Our implementation is easily extensible and we invite other developers to contribute new types of neural network functions and layers. Also, optimizations that we applied and our experience with GPU CUDA BLAS might be useful for other machine learning algorithms being developed for Spark.
The slides were presented at Spark SF Friends meetup on December 2, 2015 organized by Alex Khrabrov @Nitro. The content is based on my talk on Spark Summit Europe. However, there are few major updates: update and more details on the parallelism heuristic, experiments with larger cluster, as well as the new slide design.
DLD meetup 2017, Efficient Deep LearningBrodmann17
The document discusses efficient techniques for deep learning on edge devices. It begins by noting that deep neural networks have high computational complexity which makes inference inefficient for edge devices without powerful GPUs. It then outlines the deep learning stack from hardware to libraries to frameworks to algorithms. The document focuses on how algorithms define model complexity and discusses the evolution of CNN architectures from LeNet5 to ResNet which generally increased in complexity. It covers techniques for reducing model size and operations like pruning, quantization, and knowledge distillation. The challenges of real-life applications on edge devices are discussed.
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMLAI2
MetaPerturb is a meta-learned perturbation function that can enhance generalization of neural networks on different tasks and architectures. It proposes a novel meta-learning framework involving jointly training a main model and perturbation module on multiple source tasks to learn a transferable perturbation function. This meta-learned perturbation function can then be transferred to improve performance of a target model on an unseen target task or architecture, outperforming baselines on various datasets and architectures.
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Intel® Software
This session discuss the implementation and performance of the K-nearest neighbor (KNN) computation on a distributed architecture using the Intel® Xeon Phi™ processor.
This document discusses Bayesian global optimization and its application to tuning machine learning models. It begins by outlining some of the challenges of tuning ML models, such as the non-intuitive nature of the task. It then introduces Bayesian global optimization as an approach to efficiently search the hyperparameter space to find optimal configurations. The key aspects of Bayesian global optimization are described, including using Gaussian processes to build models of the objective function from sampled points and finding the next best point to sample via expected improvement. Several examples are provided demonstrating how Bayesian global optimization outperforms standard tuning methods in optimizing real-world ML tasks.
Pelee: a real time object detection system on mobile devices Paper ReviewLEE HOSEONG
This document summarizes the Pelee object detection system which uses the PeleeNet efficient feature extraction network for real-time object detection on mobile devices. PeleeNet improves on DenseNet with two-way dense layers, a stem block, dynamic bottleneck layers, and transition layers without compression. Pelee uses SSD with PeleeNet, selecting fewer feature maps and adding residual prediction blocks for faster, more accurate detection compared to SSD and YOLO. The document concludes that PeleeNet and Pelee achieve real-time classification and detection on devices, outperforming existing models in speed, cost and accuracy with simple code.
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15MLconf
GraphMat: Bridging the Productivity-Performance Gap in Graph Analytics: With increasing interest in large-scale distributed graph analytics for machine learning and data mining, more data scientists and developers are struggling to achieve high performance without sacrificing productivity on large graph problems. In this talk, I will discuss our solution to this problem: GraphMat. Using generalized sparse matrix-based primitives, we are able to achieve performance that is very close to hand-optimized native code, while allowing users to write programs using the familiar vertex-centric programming paradigm. I will show how we optimized GraphMat to achieve this performance on distributed platforms and provide programming examples. We have integrated GraphMat with Apache Spark in a manner that allows the combination to outperform all other distributed graph frameworks. I will explain the reasons for this performance and show that our approach achieves very high hardware efficiency in both single-node and distributed environments using primitives that are applicable to many machine learning and HPC problems. GraphMat is open source software and available for download.
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016MLconf
Applying Deep Learning at Facebook Scale: Facebook leverages Deep Learning for various applications including event prediction, machine translation, natural language understanding and computer vision at a very large scale. There are more than a billion users logging on to Facebook every daily generating thousands of posts per second and uploading more than a billion images and videos every day. This talk will explain how Facebook scaled Deep Learning inference for realtime applications with latency budgets in the milliseconds.
This document provides an overview of next generation analytics with YARN, Spark and GraphLab. It discusses how YARN addressed limitations of Hadoop 1.0 like scalability, locality awareness and shared cluster utilization. It also describes the Berkeley Data Analytics Stack (BDAS) which includes Spark, and how companies like Ooyala and Conviva use it for tasks like iterative machine learning. GraphLab is presented as ideal for processing natural graphs and the PowerGraph framework partitions such graphs for better parallelism. PMML is introduced as a standard for defining predictive models, and how a Naive Bayes model can be defined and scored using PMML with Spark and Storm.
Implementation of linear regression and logistic regression on SparkDalei Li
This presentation was developed for a course project at Technical University of Madrid. The course is massively parallel machine learning supervised by Alberto Mozo and Bruno Ordozgoiti.
Image Classification Done Simply using Keras and TensorFlow Rajiv Shah
This presentation walks through the process of building an image classifier using Keras with a TensorFlow backend. It will give a basic understanding of image classification and show the techniques used in industry to build image classifiers. The presentation will start with building a simple convolutional network, augmenting the data, using a pretrained network, and finally using transfer learning by modifying the last few layers of a pretrained network. The classification will be based on the classic example of classifying cats and dogs. The code for the presentation can be found at https://github.com/rajshah4/image_keras, and the presentation will discuss how to extend the code to your own pictures to make a custom image classifier.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Slides to support Austin Machine Learning Meetup, 1/19/2015.
Overview of techniques of recent Kaggle code to perform online logistic regression with FTRL-proximal (SGD, L1/L2 regularization) and hash trick.
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark.
Snorkel is open source on github and available from Snorkel.Stanford.edu.
Distributed implementation of a lstm on spark and tensorflowEmanuel Di Nardo
Academic project based on developing a LSTM distributing it on Spark and using Tensorflow for numerical operations.
Source code: https://github.com/EmanuelOverflow/LSTM-TensorSpark
by Vikram Madan, Sr. Product Manager, AWS Deep Learning
In this workshop, we will provide cover deep learning fundamentals and focus on the powerful and scalable Apache MXNet open source deep learning framework. At the end of this tutorial you’ll be able to train your own deep neural network and fine tune existing state of the art models for image and object recognition. We’ll also deep dive on setting up your deep learning infrastructure on AWS and model deployment on AWS Lambda.
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
See our presentation from the 6th International EULAG Users Workshop. We talked about taking HPC to the "Industry 4.0" by implementing smart techniques to optimize the codes in terms of performance and energy consumption. It explains how Machine Learning can dynamically optimize HPC simulations and byteLAKE's software autotuning solution.
Find out more about byteLAKE at: www.byteLAKE.com
Slides from Strata+Hadoop Singapore 2016 presenting how Deep Learning can be scaled both vertically and horizontally, when to use CPUs and when to use GPUs.
Talk @ APT Group, University of Manchester, 06 August 2014
Abstract:
Nowadays HPC systems, such as those in the Top500, are equipped with a range of different processors, from multi-core CPUs to GPUs. Programming them can be a tough job, specially if we want to squeeze every last FLOPs of performance out of them.
As a Phd Student, I am now doing a brief research visit in the APT group, working in topics related to the programmability and efficient use of GPUs and many-core coprocessors. In particular, I am implementing a large database operation using OpenCL in these state-of-the-art systems. In this talk I will summarize my work in Manchester and discuss the future work in this topic.
This document provides an overview of deep learning on GPUs. It discusses how GPUs are well-suited for deep learning and other computationally intensive tasks due to their massively parallel architecture. The document then describes what deep learning is, including different types of neural networks commonly used. It also discusses how deep learning can enhance analytics and big data by automating feature extraction. Examples of running deep learning on Spark clusters using frameworks like TensorFlow on Spark are presented.
The document discusses recognizing handwritten digits using a convolutional neural network model with PyTorch on GPUs. It summarizes the dataset used, which contains images of handwritten digits. The methodology describes building and training a CNN model on GPUs using data parallelism across multiple GPUs. Testing was done varying batch sizes and number of GPUs. Results found that using more GPUs did not always improve performance and larger batch sizes did not necessarily yield better accuracy. Overall, optimal GPU utilization and batch size are important for good model performance when using multiple GPUs.
This document provides an overview of computer vision techniques including classification and object detection. It discusses popular deep learning models such as AlexNet, VGGNet, and ResNet that advanced the state-of-the-art in image classification. It also covers applications of computer vision in areas like healthcare, self-driving cars, and education. Additionally, the document reviews concepts like the classification pipeline in PyTorch, data augmentation, and performance metrics for classification and object detection like precision, recall, and mAP.
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.
This document discusses deep learning initiatives at NECSTLab focused on hardware acceleration of convolutional neural networks using FPGAs. It proposes a framework called CNNECST that provides high-level APIs to design CNNs, integrates with machine learning frameworks for training, and generates customized hardware for FPGA implementation through C++ libraries and Vivado. Experimental results show speedups and energy savings for CNNs like LeNet and MNIST on FPGA boards compared to CPU. Challenges and future work include supporting more layer types and reduced precision computations.
(Im2col)accelerating deep neural networks on low power heterogeneous architec...Bomm Kim
This document discusses accelerating deep neural networks on low power heterogeneous architectures. Specifically, it focuses on accelerating the inference time of the VGG-16 neural network on the ODROID-XU4 board, which contains an ARM CPU and Mali GPU. The authors develop parallel versions of VGG-16 using OpenMP for the CPU and OpenCL for the GPU. Several optimizations are explored in OpenCL, including work groups, vector data types, and the CLBlast library. The best OpenCL implementation achieves a 9.4x speedup over the original serial version.
Once-for-All: Train One Network and Specialize it for Efficient Deploymenttaeseon ryu
안녕하세요 딥러닝 논문읽기 모임 입니다! 오늘 소개 드릴 논문은 Once-for-All: Train One Network and Specialize it for Efficient Deployment 라는 제목의 논문입니다.
모델을 실제로 하드웨어에 Deploy하는 그 상황을 보고 있는데 이 페이퍼에서 꼽고 있는 가장 큰 문제는 실제로 트레인한 모델을 Deploy할 하드웨어 환경이 너무나도 많다는 문제가 하나 있습니다 모든 디바이스가 갖고 있는 리소스가 다르기 때문에 모든 하드웨어에 맞는 모델을 찾기가 사실상 불가능하다는 문제를 꼽고 있고요
각 하드웨어에 맞는 옵티멀한 네트워크 아키텍처가 모두 다른 상황에서 어떻게 해야 될건지에 대한 고민이 일반적 입니다. 이제 할 수 있는 접근중에 하나는 각 하드웨어에 맞게 옵티멀한 아키텍처를 모두 다 찾는 건데 그게 사실상 너무나 많은 계산량을 요구하기 때문에 불가능하다라는 문제를 갖고 있습니다 삼성 노트 10을 예로 한 어플리케이션의 requirement가 20m/s로 그 모델을 돌려야 된다는 요구사항이 있으면은 그 20m/s 안에 돌 수 있는 모델이 뭔지 accuracy가 뭔지 이걸 찾기 위해서는 파란색 점들을 모두 찾아야 되고 각 점이 이제 트레이닝 한번을 의미하게 됩니다 그래서 사실상 다 수의 트레이닝을 다 해야지만 그 중에 뭐가 최적인지 또 찾아야 합니다. 실제 Deploy해야 되는 시나리오가 늘어나면 이게 리니어하게 증가하기 때문에
각 하드웨어에 맞는 그런 옵티멀 네트워크를 찾는게 사실상 불가능합니다.
그래서 이제 OFA에서 제안하는 어프로치는 하나의 네트워크를 한번 트레이닝 하고 나면 다시 하드웨어에 맞게 트레이닝할 필요 없이 그냥 각 환경에 맞게 가져다 쓸 수 있는 서브네트워크를 쓰면 된다 이게 주로 메인으로 사용하고 있는 어프로치입니다.
오늘 논문 리뷰를 위해 펀디멘탈팀 김동현님이 자세한 리뷰를 도와주셨습니다 많은 관심 미리 감사드립니다!
AI on Greenplum Using Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu
This document discusses machine learning and deep learning capabilities in Greenplum using Apache MADlib. It begins with an overview of MADlib, describing it as an open source machine learning library for PostgreSQL and Greenplum Database. It then discusses specific machine learning algorithms and techniques supported, such as linear regression, neural networks, graph algorithms, and more. It also covers scaling of algorithms like SVM and PageRank with increasing data and graph sizes. Later sections discuss deep learning integration with Greenplum, challenges of model management and operationalization, and introduces MADlib Flow as a tool to address those challenges through an end-to-end data science workflow in SQL.
SystemML is an Apache project that provides a declarative machine learning language for data scientists. It aims to simplify the development of custom machine learning algorithms and enable scalable execution on everything from single nodes to clusters. SystemML provides pre-implemented machine learning algorithms, APIs for various languages, and a cost-based optimizer to compile execution plans tailored to workload and hardware characteristics in order to maximize performance.
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
Overview of the recent performance optimization of CGYRO, an Eulerian GyroKinetic Fusion Plasma solver, with emphasize on the Multiscale Turbulence Simulations.
Presented at the joint US-Japan Workshop on Exascale Computing Collaboration and6th workshop of US-Japan Joint Institute for Fusion Theory (JIFT) program (Jan 18th 2022).
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
In this deck from FOSDEM 2020, Frank McQuillan from Pivotal presents: Efficient Model Selection for Deep Neural Networks on Massively Parallel Processing Databases.
"In this session we will present an efficient way to train many deep learning model configurations at the same time with Greenplum, a free and open source massively parallel database based on PostgreSQL. The implementation involves distributing data to the workers that have GPUs available and hopping model state between those workers, without sacrificing reproducibility or accuracy. Then we apply optimization algorithms to generate and prune the set of model configurations to try.
Deep neural networks are revolutionizing many machine learning applications, but hundreds of trials may be needed to generate a good model architecture and associated hyperparameters. This is the challenge of model selection. It is time consuming and expensive, especially if you are only training one model at a time.
Massively parallel processing databases can have hundreds of workers, so can you use this parallel compute architecture to address the challenge of model selection for deep nets, in order to make it faster and cheaper?
It’s possible!
We will demonstrate results from this project using a version of Hyperband, which is a well known hyperparameter optimization algorithm, and the deep learning frameworks Keras and TensorFlow, all running on Greenplum database using Apache MADlib. Other topics will include architecture, scalability results and bright opportunities for the future."
Watch the video: https://wp.me/p3RLHQ-lsQ
Learn more: https://fosdem.org/2020/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Data Parallel and Object Oriented ModelNikhil Sharma
All the content is taken from Advance Computer Architecture book. Which (10.1.3 and 10.1.4)
This PPT covers the basics of Data-Parallel Model and Object-Oriented Model.
The document discusses IBM's PowerAI software for large model support and distributed deep learning. It describes how PowerAI uses large model support (LMS) to enable processing of high-definition images, large models, and higher batch sizes that don't fit in GPU memory. It provides examples of using LMS with Caffe and TensorFlow. It also describes IBM's distributed deep learning library (DDL) for scaling deep learning training across multiple servers and GPUs, and how tools like ddlrun automatically handle tasks like topology detection and mpirun options.
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
Watch the recording of the speech done at Spark Summit Brussles 2016 here:
https://www.youtube.com/watch?v=wyfTjd9z1sY
Data Science with SparkML on DataBricks is a perfect platform for application of Ensemble Learning on massive a scale. This presentation describes Prediction-as-a-Service platform which can predict trends on 1 billion observed prices daily. In order to train ensemble model on a multivariate time series in thousands/millions dimensional space, one has to fragment the whole space into subspaces which exhibit a significant similarity. In order to achieve this, the vastly sparse space has to undergo dimensionality reduction into a parameters space which then is used to cluster the observations. The data in the resulting clusters is modeled in parallel using machine learning tools capable of coefficient estimation at the massive scale (SparkML and Scikit Learn). The estimated model coefficients are stored in a database to be used when executing predictions on demand via a web service. This approach enables training models fast enough to complete the task within a couple of hours, allowing daily or even real time updates of the coefficients. The above machine learning framework is used to predict the airfares used as support tool for the airline Revenue Management systems.
Project Management: The Role of Project Dashboards.pdfKarya Keeper
Project management is a crucial aspect of any organization, ensuring that projects are completed efficiently and effectively. One of the key tools used in project management is the project dashboard, which provides a comprehensive view of project progress and performance. In this article, we will explore the role of project dashboards in project management, highlighting their key features and benefits.
8 Best Automated Android App Testing Tool and Framework in 2024.pdfkalichargn70th171
Regarding mobile operating systems, two major players dominate our thoughts: Android and iPhone. With Android leading the market, software development companies are focused on delivering apps compatible with this OS. Ensuring an app's functionality across various Android devices, OS versions, and hardware specifications is critical, making Android app testing essential.
Preparing Non - Technical Founders for Engaging a Tech AgencyISH Technologies
Preparing non-technical founders before engaging a tech agency is crucial for the success of their projects. It starts with clearly defining their vision and goals, conducting thorough market research, and gaining a basic understanding of relevant technologies. Setting realistic expectations and preparing a detailed project brief are essential steps. Founders should select a tech agency with a proven track record and establish clear communication channels. Additionally, addressing legal and contractual considerations and planning for post-launch support are vital to ensure a smooth and successful collaboration. This preparation empowers non-technical founders to effectively communicate their needs and work seamlessly with their chosen tech agency.Visit our site to get more details about this. Contact us today www.ishtechnologies.com.au
Most important New features of Oracle 23c for DBAs and Developers. You can get more idea from my youtube channel video from https://youtu.be/XvL5WtaC20A
Unveiling the Advantages of Agile Software Development.pdfbrainerhub1
Learn about Agile Software Development's advantages. Simplify your workflow to spur quicker innovation. Jump right in! We have also discussed the advantages.
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemPeter Muessig
Learn about the latest innovations in and around OpenUI5/SAPUI5: UI5 Tooling, UI5 linter, UI5 Web Components, Web Components Integration, UI5 2.x, UI5 GenAI.
Recording:
https://www.youtube.com/live/MSdGLG2zLy8?si=INxBHTqkwHhxV5Ta&t=0
UI5con 2024 - Bring Your Own Design SystemPeter Muessig
How do you combine the OpenUI5/SAPUI5 programming model with a design system that makes its controls available as Web Components? Since OpenUI5/SAPUI5 1.120, the framework supports the integration of any Web Components. This makes it possible, for example, to natively embed own Web Components of your design system which are created with Stencil. The integration embeds the Web Components in a way that they can be used naturally in XMLViews, like with standard UI5 controls, and can be bound with data binding. Learn how you can also make use of the Web Components base class in OpenUI5/SAPUI5 to also integrate your Web Components and get inspired by the solution to generate a custom UI5 library providing the Web Components control wrappers for the native ones.
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfVALiNTRY360
Salesforce Healthcare CRM, implemented by VALiNTRY360, revolutionizes patient management by enhancing patient engagement, streamlining administrative processes, and improving care coordination. Its advanced analytics, robust security, and seamless integration with telehealth services ensure that healthcare providers can deliver personalized, efficient, and secure patient care. By automating routine tasks and providing actionable insights, Salesforce Healthcare CRM enables healthcare providers to focus on delivering high-quality care, leading to better patient outcomes and higher satisfaction. VALiNTRY360's expertise ensures a tailored solution that meets the unique needs of any healthcare practice, from small clinics to large hospital systems.
For more info visit us https://valintry360.com/solutions/health-life-sciences
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesQuickdice ERP
Explore the seamless transition to e-invoicing with this comprehensive guide tailored for Saudi Arabian businesses. Navigate the process effortlessly with step-by-step instructions designed to streamline implementation and enhance efficiency.
How Can Hiring A Mobile App Development Company Help Your Business Grow?ToXSL Technologies
ToXSL Technologies is an award-winning Mobile App Development Company in Dubai that helps businesses reshape their digital possibilities with custom app services. As a top app development company in Dubai, we offer highly engaging iOS & Android app solutions. https://rb.gy/necdnt
Hand Rolled Applicative User ValidationCode KataPhilip Schwarz
Could you use a simple piece of Scala validation code (granted, a very simplistic one too!) that you can rewrite, now and again, to refresh your basic understanding of Applicative operators <*>, <*, *>?
The goal is not to write perfect code showcasing validation, but rather, to provide a small, rough-and ready exercise to reinforce your muscle-memory.
Despite its grandiose-sounding title, this deck consists of just three slides showing the Scala 3 code to be rewritten whenever the details of the operators begin to fade away.
The code is my rough and ready translation of a Haskell user-validation program found in a book called Finding Success (and Failure) in Haskell - Fall in love with applicative functors.
2. Outline
• Background
• Existing parallelization strategy
• Automatic generated strategy
• Overview
• Deep Learning Engine “FlexFlow”
• How to find best strategy
• Evaluation
• Comparison existing parallelization strategy
• Challenge
1
3. Training Large-scale DNN models is computationally expensive .
Large-scale and Complex Deep Neural Network ( DNN ) Models
Background 2
Reduce training time by parallelization across devices .
Inception v3
Model
“models/research/inception at master · tensorflow/models”. Github .
https://github.com/tensorflow/models/tree/master/research/inception , ( 2019-06-03 )
4. Existing Parallelization Approach
Data Parallelism
Splitting data per worker
3
Model Parallelism
Splitting model per worker
Dean et al. ( 2012 ). Large Scale Distributed Deep Networks.
In Neural Information Processing SystemsConference.
5. Data Parallelism
• Each device is placed a replica of
the entire DNN.
• Each device processes a subset of
the training data.
• Each device synchronizes
network parameters at
the end of iteration.
( Synchronous )
4
Dean et al. ( 2012 ). Large Scale Distributed Deep Networks.
In Neural Information Processing SystemsConference.
6. Model Parallelism
• Each device is assigned
disjoint subsets of DNN.
• Eliminates parameter synchronization
but requires data transfers
between operators.
5
Dean et al. ( 2012 ). Large Scale Distributed Deep Networks.
In Neural Information Processing SystemsConference.
7. ImageNet competition 6
(2016)
Yamazaki et al.
Yamazaki et al. (2019).YetAnother Accelerated SGD: ResNet-50Training on
ImageNet in 74.7 seconds.
(2017)
(2017)
(2017)
(2018)
(2018)
(2018)
(2019)
8. Present
Variation of optimal parallelization strategy due to various factors
• Hardware architecture
• DNN models architecture
• Training data
Necessity of designing special parallelization strategy manually
7
9. Automatic Generated Strategy
• ColocRL ( Mirho-seini et al., 2017 ) uses reinforcement learning
to learn efficient operator assignments for model parallelism.
• Executing each strategy in the hardware environment to get reward signals and takes
12-27 hours to find the best placement.
• OptCNN ( Jia et al., 2018 ) uses dynamic programming
to parallelize linear DNNs.
• Cannot apply to Recurrent Neural Network ( RNN ).
8
10. Overview
Z. Jia, M. Zaharia , A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for Deep Neural
Networks. In sysML Conference.
• Deep Learning Engine “FlexFlow” whichAutomatically finds
parallelization strategies for arbitrary DNNs & Hardware.
• FlexFlow increases training throughput by up to 3.3× over
state-of-the-art approaches.
9
11. Overview “FlexFlow”
1. Input information
• Operator Graph
• DeviceTopology
2. Search optimal parallelization strategy
• the SOAP search space
• Generating Strategy & Simulation
3. Execute best found strategy
10
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
12. Overview “FlexFlow”
1. Input information
• Operator Graph
• DeviceTopology
2. Search optimal parallelization strategy
• the SOAP search space
• Generating Strategy & Simulation
3. Execute best found strategy
11
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
13. Operator Graph & DeviceTopology
• Node = operator in DNN
• Convolution
• Matrix multiplication etc.
• Edge = tensor
• Output of operator
• Input of operator
12
• Node = device
• GPU
• CPU etc.
• Edge = hardware connection
• NVLink
• Network-link etc.
• PCI-e
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
14. Overview “FlexFlow”
1. Input information
• Operator Graph
• DeviceTopology
2. Search optimal parallelization strategy
• The SOAP search space
• Generating Strategy & Simulation
3. Execute best found strategy
13
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
15. The SOAP search space
• Introduce a comprehensive SOAP search space
• Sample
• Operator
• Attribute
• Parameter
14
17. Operator dimension in SOAP
• Sample … partitioning training samples ( Data parallelism )
• Operator … partitioning operators in DNN
• Attribute
• Parameter
16
Sample
Parameter
GPU1 GPU2 GPU3
Convolution#1 Convolution#2 Convolution#3
18. Attribute dimension in SOAP
• Sample … partitioning training samples ( Data parallelism )
• Operator … partitioning operators in DNN
• Attribute … partitioning attributes in a sample
• Parameter
17
GPU1
GPU2
GPU3
GPU4
Parallelizing 1D convolution
Sample
Parameter
19. Parameter dimension in SOAP
• Sample … partitioning training samples ( Data parallelism )
• Operator … partitioning operators in DNN
• Attribute … partitioning attributes in a sample
• Parameter … partitioning parameters in an operator
18
GPU1
GPU2
GPU3
GPU4
Parallelizing 1D convolution
Sample
Parameter
20. Parallelizable dimensions in SOAP space 19
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
21. Overview “FlexFlow”
1. Input information
• Operator Graph
• DeviceTopology
2. Search optimal parallelization strategy
• the SOAP search space
• Generating Strategy & Simulation
3. Execute best found strategy
20
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
22. How to search optimal strategy 21
Generate
strategy
Simulate
execution
Improve
strategy
Markov Chain Monte Carlo
( MCMC ) search algorithm
Full simulation
&
Delta simulation
Decision of parallelization
for each operator
23. Generate Strategy
Define parallelizable dimensions for each operator .
one strategy = combination of all types of parallelization for each operator
22
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ) Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
24. Simulate execution
• Challenge
• Measuring distributed executions on real hardware is slow.
• Observation
• The performance of DNN operators is highly predictable because
most of DNN operator using dense linear algebra.
• DNN models only use a small number of distinct operators.
• Execution Simulator
• Measure each distinct operator once.
• Use the measurement to estimate different parallelization strategies.
23
26. Improve Strategy : Full & Delta Simulation
• Full simulation ( initial simulation )
• Predict execution time when use an initial strategy.
• Delta simulation
• Do not have to build & simulate new task graph from scratch.
• The MarkovChain Monte Carlo search proposes a new strategy by
updating the previous strategy.
• Proposes new candidates until one of the following two criteria is satisfied.
1. The search time budget is exhausted for each initial strategy.
2. The search procedure cannot improve the best strategy for half of the search time.
25
27. Delta Simulation
• An operator in the current parallelization strategy is selected at random ,
and its parallelization configuration is replaced by a random configuration .
26
O5
O6
O1 O3
O3O1
O2 O4
O4O2
O5
O6
O3O1
O1
O2 O4
O4O2
Previous Simulation New Simulation
28. Evaluation
Evaluates the performance of FlexFlow on six real-world DNN benchmarks with two device
topologies .
Software dependencies of FlexFlow
27
Software libraries Version Contributors
cuDNN 7.3 NVIDIA
cuBLAS 9.0 NVIDIA
Legion 18.02.0 LANL , NVIDIA , SLAC , Stanford *
( optional ) GASNet 1.28.0 Lawrence Berkeley National Laboratory
* LosAlamos National Laboratory ( LANL )
Stanford National Accelerator Laboratory ( SLAC )
29. Devices topologies in experiments
The P100 Cluster The K80 Cluster
Main Memory 56GB 256 GB
CPU Intel 10-core E5-2600CPUs × 2 Intel 10-core E5-2680CPUs × 2
GPU NVIDIATesla P100GPUs × 4 NVIDIATesla K80 GPUs × 4
CPU - GPU shared PCI-e switch shared PCI-e switch
GPU - GPU NVLink separate PCI-e switch
Node - Node over 100GB/s EDR Infiniband over 56 GB/s EDR Infiniband
28
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
30. DNN in experiments
• Introduce picked two DNN benchmarks from six DNN benchmarks.
29
DNN Description Dataset
Convolutional Neural Networks ( CNN )
Inception-v3 A 102-layer CNN with inception modules ImageNet
Recurrent Neural Networks ( RNN )
Neural Machine
Translation ( NMT )
4 recurrent layers followed by
an attention and a softmax layer
WMT English-German
31. Per-iteration training performance 30
Num. Devices Num. Devices
Num.Samples/second/GPU
Num.Samples/second/GPUZ. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
Expert-designed strategy of CNN = Krizhevsky ( 2014 )
Expert-designed strategy of RNN = Wu et al. ( 2016 )
Higherisbetter
32. Comparison of parallelization performance
Parallelization performance for NMT on 64 K80 GPUs (16 nodes)
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.Expert-designed strategy = Wu et al. ( 2016 )
Lowerisbetter
33. Comparison Different Automated Frameworks 32
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
Higherisbetter
34. Full simulation & Delta simulation 33
Search performance with the full and delta simulation algorithms for
the NMT model on 16 P100 GPUs ( 4 nodes )
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
Lower is better
35. Simulation time & Real execution time 34
Z. Jia, M. Zaharia, A. Aiken. ( 2019 ). Beyond Data and Model Parallelism for
Deep Neural Networks. In sysMLConference.
36. Challenge
• FlexFlow does not consider memory constraints.
• MCMC may be not best algorithm.
• Assumption might be relaxed or even eliminated.
• data transfer time is tensorsize / channel bandwidth.
• execution time is independent to data.
35
37. Conclusion
• Deep Learning Engine “FlexFlow”
• Automatically finds parallelization strategies for arbitrary DNNs & Hardware.
• increases training throughput by up to 3.3× over state-of-the-art approaches.
• Challenges of FlexFlow
• Memory constraints
• Search algorithm
• Assumption
36
タスクグラフができたところで実際にシミュレーションする話をしていきます。
初回のシミュレーションのみフルシミュレーションで、2回目以降のシミュレーションはデルタシミュレーションというシミュレーション法を用います。
FlexFlowで重要なのはデルタシミュレーションの部分です、デルタシミュレーションは新たな戦略を考える際にタスクグラフを0から再構成してシミュレートをする必要はないという考えから考案されました。
FlexFlowは現在の戦略の一部を更新して、新しい戦略を提案し、それを繰り返してどんどん戦略を改善していくのですが、その際にはマルコフチェインモンテカルロ法という探索アルゴリズムを用います。
またデルタシミュレーションと戦略改善は以下の2点のどちらかに引っかかるまでつづけられます。
1つ目はあらかじめ設定された検索時間予算が尽きる場合です。この予算は人間が設定します。
2つ目探索時間の半分をかけても今以上に良い並列化戦略が発見されなかった場合です。
(フルシミュレーションではダイクストラ法で実行時間が計算され、デルタシミュレーションではベルマンフォード法という方法で実行時間が計算されます。)
Predict execution time when use existing strategies as initial strategy.Data parallelism , expert-designed strategies etc.
最後にフレックスフローの課題点についてお話しします。
FlexFlowはシミュレートの際にメモリ制約について考えいません。なので最適な並列化戦略を発見しても、メモリ制約を超えてしまっていると、その戦略は実行できない可能性があります。
また、戦略の更新にはマルコフチェインモンテカルロ法を用いていると説明しましたが、マルコフチェインモンテカルロ法が最適なアルゴリズムである保証は著者自身もないと言っております。
また、FlexFlowは実行をシミュレートする際に四つ仮定がありましたが、その仮定は容易にに成り立たなくなります。データ転送時間がテンソルサイズをチャネル帯域幅で割ったものとなるのはあくまで理論値ですし、実行時間はデータに対して独立であるとは言い難いと思います。ですので、FlexFlowにおける仮定を見直す必要があるかもしれません。
Simulation gives a very good insight on what is worth spending time on executing
みてわかるようにかなり複雑です
vertical = sample , horizontal = parameter
Compared to data parallelism, this strategy reduces the parameter synchronization costs by 75% and the per-iteration execution time by 12%.
For parallelizing the same Inception-v3 model on four K80 GPUs
we observe that the best discovered strategy tends to parallelize operators on adjacent GPUs with a direct connection to reduce the communication costs.
parallelizing NMT on four P100 GPUs.
First, for a layer with a large number of network(e.g., embed layers), it performs the computation on a single GPU to eliminate parameter synchronization. Second, for a layer with a large number of parameters and heavy com- putation (e.g., softmax layers), FlexFlow uses parallelism in the parameter dimension and assigns the computation for a subset of parameters to each task. This reduces parame- ter synchronization costs while maintaining load balance. Third, for multiple recurrent layers (e.g., LSTM and atten- tion layers), FlexFlow uses concurrency among different layers as well as parallelism within each operator to reduce parameter synchronization costs while balancing load
parameters and little computation