Effective collaboration among data scientists results in high- quality and efficient machine learning (ML) workloads. In a collaborative environment, such as Kaggle or Google Co-laboratory, users typically re-execute or modify published scripts to recreate or improve the result. This introduces many redundant data processing and model training operations. Reusing the data generated by the redundant operations leads to the more efficient execution of future workloads. However, existing collaborative environments lack a data management component for storing and reusing the result of previously executed operations.
In this paper, we present a system to optimize the execution of ML workloads in collaborative environments by reusing previously performed operations and their results. We utilize a so-called Experiment Graph (EG) to store the artifacts, i.e., raw and intermediate data or ML models, as vertices and operations of ML workloads as edges. In theory, the size of EG can become unnecessarily large, while the storage budget might be limited. At the same time, for some artifacts, the overall storage and retrieval cost might outweigh the recomputation cost. To address this issue, we propose two algorithms for materializing artifacts based on their likelihood of future reuse. Given the materialized artifacts inside EG, we devise a linear-time reuse algorithm to find the optimal execution plan for incoming ML workloads. Our reuse algorithm only incurs a negligible overhead and scales for the high number of incoming ML workloads in collaborative environments. Our experiments show that we improve the run-time by one order of magnitude for repeated execution of the workloads and 50% for the execution of modified workloads in collaborative environments.
Link to the paper: https://dl.acm.org/doi/abs/10.1145/3318464.3389715
The document discusses network performance profiling of Hadoop jobs. It presents results from running two common Hadoop benchmarks - Terasort and Ranked Inverted Index - on different Amazon EC2 instance configurations. The results show that the shuffle phase accounts for a significant portion (25-29%) of total job runtime. They aim to reproduce existing findings that network performance is a key bottleneck for shuffle-intensive Hadoop jobs. Some questions are also raised about inconsistencies in reported network bandwidth capabilities for EC2.
Jumpstart your career with the world’s most in-demand technology: Hadoop. Hadooptrainingacademy provides best Hadoop online training with quality videos, comprehensive
online live training and detailed study material. Join today!
For more info, visit: http://www.hadooptrainingacademy.com/
Contact Us:
8121660088
732-419-2619
http://www.hadooptrainingacademy.com/
KnittingBoar Toronto Hadoop User Group Nov 27 2012Adam Muise
This document discusses machine learning and parallel iterative algorithms. It provides an introduction to machine learning and Mahout. It then describes Knitting Boar, a system for parallelizing stochastic gradient descent on Hadoop YARN. Knitting Boar partitions data among workers that perform online logistic regression in batches. The workers send gradient updates to a master node, which averages the updates to produce a new global model. Experimental results show Knitting Boar achieves roughly linear speedup. The document concludes by discussing developing YARN applications and the Knitting Boar codebase.
The document outlines the process for developing a MapReduce application including:
1) Writing map and reduce functions with unit tests, then a driver program to run on test data.
2) Running the program on a cluster with the full dataset and fixing issues.
3) Tuning the program for performance after it is working correctly.
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit
This document summarizes research on implementing deep learning models using Spark. It describes:
1) Implementing a multilayer perceptron (MLP) model for digit recognition in Spark using batch processing and matrix optimizations to improve efficiency.
2) Analyzing the tradeoffs of computation and communication in parallelizing the gradient calculation for batch training across multiple nodes to find the optimal number of workers.
3) Benchmark results showing Spark MLP achieves similar performance to Caffe on a single node and outperforms it by scaling nearly linearly when using multiple nodes.
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit
Algorithms for sketching probability distributions from large data sets are a fundamental building block of modern data science. Sketching plays a role in diverse applications ranging from visualization, optimizing data encodings, estimating quantiles, data synthesis and imputation. The T-Digest is a versatile sketching data structure. It operates on any numeric data, models tricky distribution tails with high fidelity, and most crucially it works smoothly with aggregators and map-reduce.
T-Digest is a perfect fit for Apache Spark; it is single-pass and intermediate results can be aggregated across partitions in batch jobs or aggregated across windows in streaming jobs. In this talk I will describe a native Scala implementation of the T-Digest sketching algorithm and demonstrate its use in Spark applications for visualization, quantile estimations and data synthesis.
Attendees of this talk will leave with an understanding of data sketching with T-Digest sketches, and insights about how to apply T-Digest to their own data analysis applications.
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...Sri Ambati
Spark is a distributed computing framework that can handle large scale data processing. The Spark notebook provides an interactive environment for working with Spark. ADAM is a data format and API for genomic data on Spark that optimizes for large datasets. Sparkling Water integrates H2O machine learning with Spark to enable techniques like deep learning on genomic data in a distributed manner using the Spark notebook. Data scientists and developers can collaborate using these tools to access, manipulate, and analyze massive genomic datasets.
The document discusses network performance profiling of Hadoop jobs. It presents results from running two common Hadoop benchmarks - Terasort and Ranked Inverted Index - on different Amazon EC2 instance configurations. The results show that the shuffle phase accounts for a significant portion (25-29%) of total job runtime. They aim to reproduce existing findings that network performance is a key bottleneck for shuffle-intensive Hadoop jobs. Some questions are also raised about inconsistencies in reported network bandwidth capabilities for EC2.
Jumpstart your career with the world’s most in-demand technology: Hadoop. Hadooptrainingacademy provides best Hadoop online training with quality videos, comprehensive
online live training and detailed study material. Join today!
For more info, visit: http://www.hadooptrainingacademy.com/
Contact Us:
8121660088
732-419-2619
http://www.hadooptrainingacademy.com/
KnittingBoar Toronto Hadoop User Group Nov 27 2012Adam Muise
This document discusses machine learning and parallel iterative algorithms. It provides an introduction to machine learning and Mahout. It then describes Knitting Boar, a system for parallelizing stochastic gradient descent on Hadoop YARN. Knitting Boar partitions data among workers that perform online logistic regression in batches. The workers send gradient updates to a master node, which averages the updates to produce a new global model. Experimental results show Knitting Boar achieves roughly linear speedup. The document concludes by discussing developing YARN applications and the Knitting Boar codebase.
The document outlines the process for developing a MapReduce application including:
1) Writing map and reduce functions with unit tests, then a driver program to run on test data.
2) Running the program on a cluster with the full dataset and fixing issues.
3) Tuning the program for performance after it is working correctly.
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit
This document summarizes research on implementing deep learning models using Spark. It describes:
1) Implementing a multilayer perceptron (MLP) model for digit recognition in Spark using batch processing and matrix optimizations to improve efficiency.
2) Analyzing the tradeoffs of computation and communication in parallelizing the gradient calculation for batch training across multiple nodes to find the optimal number of workers.
3) Benchmark results showing Spark MLP achieves similar performance to Caffe on a single node and outperforms it by scaling nearly linearly when using multiple nodes.
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit
Algorithms for sketching probability distributions from large data sets are a fundamental building block of modern data science. Sketching plays a role in diverse applications ranging from visualization, optimizing data encodings, estimating quantiles, data synthesis and imputation. The T-Digest is a versatile sketching data structure. It operates on any numeric data, models tricky distribution tails with high fidelity, and most crucially it works smoothly with aggregators and map-reduce.
T-Digest is a perfect fit for Apache Spark; it is single-pass and intermediate results can be aggregated across partitions in batch jobs or aggregated across windows in streaming jobs. In this talk I will describe a native Scala implementation of the T-Digest sketching algorithm and demonstrate its use in Spark applications for visualization, quantile estimations and data synthesis.
Attendees of this talk will leave with an understanding of data sketching with T-Digest sketches, and insights about how to apply T-Digest to their own data analysis applications.
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...Sri Ambati
Spark is a distributed computing framework that can handle large scale data processing. The Spark notebook provides an interactive environment for working with Spark. ADAM is a data format and API for genomic data on Spark that optimizes for large datasets. Sparkling Water integrates H2O machine learning with Spark to enable techniques like deep learning on genomic data in a distributed manner using the Spark notebook. Data scientists and developers can collaborate using these tools to access, manipulate, and analyze massive genomic datasets.
This work investigates the performance of Big Data applications in virtualized Hadoop environments. An evaluation and comparison of the performance of applications running on a virtualized Hadoop cluster with separated data and computation layers against standard Hadoop installation is presented.
http://clds.sdsc.edu/wbdb2014.de/program
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big DataPingCAP
The performance of analytical query processing in data management systems depends primarily on the capabilities of the system's query optimizer. Increased data volumes and heightened interest in processing complex analytical queries have prompted Pivotal to build a new query optimizer.
In this paper we present the architecture of Orca, the new query optimizer for all Pivotal data management products, including Pivotal Greenplum Database and Pivotal HAWQ. Orca is a comprehensive development uniting state-of-the-art query optimization technology with own original research resulting in a modular and portable optimizer architecture.
In addition to describing the overall architecture, we highlight several unique features and present performance comparisons against other systems.
Distributed GLM with H2O - Atlanta MeetupSri Ambati
The document outlines a presentation about H2O's distributed generalized linear model (GLM) algorithm. The presentation includes sections about H2O.ai the company, an overview of the H2O software, a 30 minute section explaining H2O's distributed GLM in detail, a 15 minute demo of GLM, and a question and answer period. The document provides background on H2O.ai and H2O, and outlines the topics that will be covered in the distributed GLM section, including the algorithm, input parameters, outputs, runtime costs, and best practices.
Max-kernel search: How to search for just about anything?
Nearest neighbor search is a well studied and widely used task in computer science and is quite pervasive in everyday applications. While search is not synonymous with learning, search is a crucial tool for the most nonparametric form of learning. Nearest neighbor search can directly be used for all kinds of learning tasks — classification, regression, density estimation, outlier detection. Search is also the computational bottleneck in various other learning tasks such as clustering and dimensionality reduction. Key to nearest neighbor search is the notion of “near”-ness or similarity. Mercer kernels form a class of general nonlinear similarity functions and are widely used in machine learning. They can define a notion of similarity between pairs of objects of any arbitrary type and have been successfully applied to a wide variety of object types — fixed-length data, images, text, time series, graphs. I will present a technique to do nearest neighbor search with this class of similarity functions provably efficiently, hence facilitating faster learning for larger data.
The document discusses Dremel, an interactive query system for analyzing large-scale datasets. Dremel uses a columnar data storage format and a multi-level query execution tree to enable fast querying. It evaluates Dremel's performance on interactive queries, showing it can count terms in a field within seconds using 3000 workers, while MapReduce takes hours. Dremel also scales linearly and handles stragglers well. Today, similar systems like Google BigQuery and Apache Drill use Dremel-like techniques for interactive analysis of web-scale data.
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
The document describes Dremel, an interactive analysis system for web-scale datasets. Dremel uses a columnar data storage model and tree-based query serving architecture to enable interactive analysis of trillion record datasets distributed across thousands of nodes. It provides an SQL-like interface and can process queries orders of magnitude faster than traditional MapReduce systems by avoiding record assembly costs. Experiments show Dremel can analyze tens to hundreds of billions of records interactively on commodity hardware.
This document discusses modeling algorithms using the MapReduce framework. It outlines types of learning that can be done in MapReduce, including parallel training of models, ensemble methods, and distributed algorithms that fit the statistical query model (SQM). Specific algorithms that can be implemented in MapReduce are discussed, such as linear regression, naive Bayes, logistic regression, and decision trees. The document provides examples of how these algorithms can be formulated and computed in a MapReduce paradigm by distributing computations across mappers and reducers.
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...AM Publications
Cloud computing is the concept of distributing a work and also processing the same work over the internet. Cloud
computing is called as service on demand. It is always available on the internet in Pay and Use mode. Processing of the Big
Data takes more time to compute MRI and DICOM data. The processing of hard tasks like this can be solved by using the
concept of MapReduce. MapReduce function is a concept of Map and Reduce functions. Map is the process of splitting or
dividing data. Reduce function is the process of integrating the output of the Map’s input to produce the result. The Map
function does two various image processing techniques to process the input data. Java Advanced Imaging (JAI) is introduced
in the map function in this proposed work. The processed intermediate data of the Map function is sent to the Reduce function
for the further process. The Dynamic Handover Reduce Function (DHRF) algorithm is introduced in the reduce function in
this work. This algorithm is implemented in the Reduce function to reduce the waiting time while processing the intermediate
data. The DHRF algorithm gives the final output by processing the Reduce function. The enhanced MapReduce concept and
proposed optimized algorithm is made to work on Euca2ool (a Cloud tool) to produce an effective and better output when
compared with the previous work in the field of Cloud Computing and Big Data.
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)Alexander Ulanov
Artificial neural networks (ANN) are one of the popular models of machine learning, in particular for deep learning. The models that are used in practice for image classification and speech recognition contain huge number of weights and are trained with big datasets. Training such models is challenging in terms of computation and data processing. We propose a scalable implementation of deep neural networks for Spark. We address the computational challenge by batch operations, using BLAS for vector and matrix computations and reusing the memory for reducing garbage collector activity. Spark provides data parallelism that enables scaling of training. As a result, our implementation is on par with widely used C++ implementations like Caffe on a single machine and scales nicely on a cluster. The developed API makes it easy to configure your own network and to run experiments with different hyper parameters. Our implementation is easily extensible and we invite other developers to contribute new types of neural network functions and layers. Also, optimizations that we applied and our experience with GPU CUDA BLAS might be useful for other machine learning algorithms being developed for Spark.
The slides were presented at Spark SF Friends meetup on December 2, 2015 organized by Alex Khrabrov @Nitro. The content is based on my talk on Spark Summit Europe. However, there are few major updates: update and more details on the parallelism heuristic, experiments with larger cluster, as well as the new slide design.
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...PingCAP
This paper proposes interleaving with coroutines for
any type of index join. It showcases the proposal on SAP
HANA by implementing binary search and CSB+-tree traversal for an instance of index join related to dictionary compression. Coroutine implementations not only perform similarly to prior interleaving techniques, but also resemble the original code closely, while supporting both interleaved and non-interleaved execution. Thus, this paper claims that coroutines
make interleaving practical for use in real DBMS codebases.
Paper: http://www.vldb.org/pvldb/vol11/p230-psaropoulos.pdf
Follow PingCAP on Twitter: https://twitter.com/PingCAP
Follow PingCAP on LinkedIn: https://www.linkedin.com/company/13205484/
HPC + Ai: Machine Learning Models in Scientific Computinginside-BigData.com
In this video from the 2019 Stanford HPC Conference, Steve Oberlin from NVIDIA presents: HPC + Ai: Machine Learning Models in Scientific Computing.
"Most AI researchers and industry pioneers agree that the wide availability and low cost of highly-efficient and powerful GPUs and accelerated computing parallel programming tools (originally developed to benefit HPC applications) catalyzed the modern revolution in AI/deep learning. Clearly, AI has benefited greatly from HPC. Now, AI methods and tools are starting to be applied to HPC applications to great effect. This talk will describe an emerging workflow that uses traditional numeric simulation codes to generate synthetic data sets to train machine learning algorithms, then employs the resulting AI models to predict the computed results, often with dramatic gains in efficiency, performance, and even accuracy. Some compelling success stories will be shared, and the implications of this new HPC + AI workflow on HPC applications and system architecture in a post-Moore’s Law world considered."
Watch the video: https://youtu.be/SV3cnWf39kc
Learn more: https://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Dremel interactive analysis of web scale datasetsCarl Lu
Dremel is an interactive query system that can analyze large web-scale datasets containing trillions of records in seconds. It uses a columnar data layout and multi-level query execution trees to distribute queries across thousands of servers. Dremel's nested data model and column-striped storage allows it to efficiently retrieve and analyze only the necessary columns from large datasets. Experimental results demonstrated Dremel's ability to process queries over datasets containing trillions of records and petabytes of data in seconds using a cluster of thousands of servers.
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsWilliam Yetman
This was one of my first presentations on Big Data at Ancestry.com. The audience was split between Family Historians interested in the Technology and Developers interested in our Big Data Story. So the presentation is a mix. I think there is plenty for a someone with an interest in technology and enough meat for a "technologist".
Keep this in mind as you look at this presentation.
Thanks,
-Bill-
This document provides an overview of next generation analytics with YARN, Spark and GraphLab. It discusses how YARN addressed limitations of Hadoop 1.0 like scalability, locality awareness and shared cluster utilization. It also describes the Berkeley Data Analytics Stack (BDAS) which includes Spark, and how companies like Ooyala and Conviva use it for tasks like iterative machine learning. GraphLab is presented as ideal for processing natural graphs and the PowerGraph framework partitions such graphs for better parallelism. PMML is introduced as a standard for defining predictive models, and how a Naive Bayes model can be defined and scored using PMML with Spark and Storm.
Knitting boar - Toronto and Boston HUGs - Nov 2012Josh Patterson
1) The document discusses machine learning and parallel iterative algorithms like stochastic gradient descent. It introduces the Mahout machine learning library and describes an implementation of parallel SGD called Knitting Boar that runs on YARN.
2) Knitting Boar parallelizes Mahout's SGD algorithm by having worker nodes process partitions of the training data in parallel while a master node merges their results.
3) The author argues that approaches like Knitting Boar and IterativeReduce provide better ways to implement machine learning algorithms for big data compared to traditional MapReduce.
Hfsp bringing size based scheduling to hadoop
Do Your Projects With Technology Experts
To Get this projects Call : 9566355386 / 99625 88976
Web : http://www.lemenizinfotech.com
Web : http://www.ieeemaster.com
Mail : projects@lemenizinfotech.com
Blog : http://ieeeprojectspondicherry.weebly.com
Blog : http://www.ieeeprojectsinpondicherry.blogspot.in/
Youtube:https://www.youtube.com/watch?v=eesBNUnKvws
Dynamic Resource Allocation Algorithm using ContainersIRJET Journal
1) The document proposes a dynamic resource allocation algorithm using containers to optimize resource utilization in server farms.
2) It uses Docker to deploy applications in lightweight containers instead of virtual machines to reduce overhead. A node selection algorithm uses fuzzy logic to determine the most suitable node for container deployment based on resource availability and workload.
3) The proposed approach is tested on a small cluster using Docker, Hadoop and the node selection algorithm to process queries. Results show increased processing speed and better resource utilization compared to traditional virtualization methods.
1. The document discusses Microsoft's SCOPE analytics platform running on Apache Tez and YARN. It describes how Graphene was designed to integrate SCOPE with Tez to enable SCOPE jobs to run as Tez DAGs on YARN clusters.
2. Key components of Graphene include a DAG converter, Application Master, and tooling integration. The Application Master manages task execution and communicates with SCOPE engines running in containers.
3. Initial experience running SCOPE on Tez has been positive though challenges remain around scaling to very large workloads with over 15,000 parallel tasks and optimizing for opportunistic containers and Application Master recovery.
This work investigates the performance of Big Data applications in virtualized Hadoop environments. An evaluation and comparison of the performance of applications running on a virtualized Hadoop cluster with separated data and computation layers against standard Hadoop installation is presented.
http://clds.sdsc.edu/wbdb2014.de/program
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big DataPingCAP
The performance of analytical query processing in data management systems depends primarily on the capabilities of the system's query optimizer. Increased data volumes and heightened interest in processing complex analytical queries have prompted Pivotal to build a new query optimizer.
In this paper we present the architecture of Orca, the new query optimizer for all Pivotal data management products, including Pivotal Greenplum Database and Pivotal HAWQ. Orca is a comprehensive development uniting state-of-the-art query optimization technology with own original research resulting in a modular and portable optimizer architecture.
In addition to describing the overall architecture, we highlight several unique features and present performance comparisons against other systems.
Distributed GLM with H2O - Atlanta MeetupSri Ambati
The document outlines a presentation about H2O's distributed generalized linear model (GLM) algorithm. The presentation includes sections about H2O.ai the company, an overview of the H2O software, a 30 minute section explaining H2O's distributed GLM in detail, a 15 minute demo of GLM, and a question and answer period. The document provides background on H2O.ai and H2O, and outlines the topics that will be covered in the distributed GLM section, including the algorithm, input parameters, outputs, runtime costs, and best practices.
Max-kernel search: How to search for just about anything?
Nearest neighbor search is a well studied and widely used task in computer science and is quite pervasive in everyday applications. While search is not synonymous with learning, search is a crucial tool for the most nonparametric form of learning. Nearest neighbor search can directly be used for all kinds of learning tasks — classification, regression, density estimation, outlier detection. Search is also the computational bottleneck in various other learning tasks such as clustering and dimensionality reduction. Key to nearest neighbor search is the notion of “near”-ness or similarity. Mercer kernels form a class of general nonlinear similarity functions and are widely used in machine learning. They can define a notion of similarity between pairs of objects of any arbitrary type and have been successfully applied to a wide variety of object types — fixed-length data, images, text, time series, graphs. I will present a technique to do nearest neighbor search with this class of similarity functions provably efficiently, hence facilitating faster learning for larger data.
The document discusses Dremel, an interactive query system for analyzing large-scale datasets. Dremel uses a columnar data storage format and a multi-level query execution tree to enable fast querying. It evaluates Dremel's performance on interactive queries, showing it can count terms in a field within seconds using 3000 workers, while MapReduce takes hours. Dremel also scales linearly and handles stragglers well. Today, similar systems like Google BigQuery and Apache Drill use Dremel-like techniques for interactive analysis of web-scale data.
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
The document describes Dremel, an interactive analysis system for web-scale datasets. Dremel uses a columnar data storage model and tree-based query serving architecture to enable interactive analysis of trillion record datasets distributed across thousands of nodes. It provides an SQL-like interface and can process queries orders of magnitude faster than traditional MapReduce systems by avoiding record assembly costs. Experiments show Dremel can analyze tens to hundreds of billions of records interactively on commodity hardware.
This document discusses modeling algorithms using the MapReduce framework. It outlines types of learning that can be done in MapReduce, including parallel training of models, ensemble methods, and distributed algorithms that fit the statistical query model (SQM). Specific algorithms that can be implemented in MapReduce are discussed, such as linear regression, naive Bayes, logistic regression, and decision trees. The document provides examples of how these algorithms can be formulated and computed in a MapReduce paradigm by distributing computations across mappers and reducers.
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...AM Publications
Cloud computing is the concept of distributing a work and also processing the same work over the internet. Cloud
computing is called as service on demand. It is always available on the internet in Pay and Use mode. Processing of the Big
Data takes more time to compute MRI and DICOM data. The processing of hard tasks like this can be solved by using the
concept of MapReduce. MapReduce function is a concept of Map and Reduce functions. Map is the process of splitting or
dividing data. Reduce function is the process of integrating the output of the Map’s input to produce the result. The Map
function does two various image processing techniques to process the input data. Java Advanced Imaging (JAI) is introduced
in the map function in this proposed work. The processed intermediate data of the Map function is sent to the Reduce function
for the further process. The Dynamic Handover Reduce Function (DHRF) algorithm is introduced in the reduce function in
this work. This algorithm is implemented in the Reduce function to reduce the waiting time while processing the intermediate
data. The DHRF algorithm gives the final output by processing the Reduce function. The enhanced MapReduce concept and
proposed optimized algorithm is made to work on Euca2ool (a Cloud tool) to produce an effective and better output when
compared with the previous work in the field of Cloud Computing and Big Data.
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)Alexander Ulanov
Artificial neural networks (ANN) are one of the popular models of machine learning, in particular for deep learning. The models that are used in practice for image classification and speech recognition contain huge number of weights and are trained with big datasets. Training such models is challenging in terms of computation and data processing. We propose a scalable implementation of deep neural networks for Spark. We address the computational challenge by batch operations, using BLAS for vector and matrix computations and reusing the memory for reducing garbage collector activity. Spark provides data parallelism that enables scaling of training. As a result, our implementation is on par with widely used C++ implementations like Caffe on a single machine and scales nicely on a cluster. The developed API makes it easy to configure your own network and to run experiments with different hyper parameters. Our implementation is easily extensible and we invite other developers to contribute new types of neural network functions and layers. Also, optimizations that we applied and our experience with GPU CUDA BLAS might be useful for other machine learning algorithms being developed for Spark.
The slides were presented at Spark SF Friends meetup on December 2, 2015 organized by Alex Khrabrov @Nitro. The content is based on my talk on Spark Summit Europe. However, there are few major updates: update and more details on the parallelism heuristic, experiments with larger cluster, as well as the new slide design.
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...PingCAP
This paper proposes interleaving with coroutines for
any type of index join. It showcases the proposal on SAP
HANA by implementing binary search and CSB+-tree traversal for an instance of index join related to dictionary compression. Coroutine implementations not only perform similarly to prior interleaving techniques, but also resemble the original code closely, while supporting both interleaved and non-interleaved execution. Thus, this paper claims that coroutines
make interleaving practical for use in real DBMS codebases.
Paper: http://www.vldb.org/pvldb/vol11/p230-psaropoulos.pdf
Follow PingCAP on Twitter: https://twitter.com/PingCAP
Follow PingCAP on LinkedIn: https://www.linkedin.com/company/13205484/
HPC + Ai: Machine Learning Models in Scientific Computinginside-BigData.com
In this video from the 2019 Stanford HPC Conference, Steve Oberlin from NVIDIA presents: HPC + Ai: Machine Learning Models in Scientific Computing.
"Most AI researchers and industry pioneers agree that the wide availability and low cost of highly-efficient and powerful GPUs and accelerated computing parallel programming tools (originally developed to benefit HPC applications) catalyzed the modern revolution in AI/deep learning. Clearly, AI has benefited greatly from HPC. Now, AI methods and tools are starting to be applied to HPC applications to great effect. This talk will describe an emerging workflow that uses traditional numeric simulation codes to generate synthetic data sets to train machine learning algorithms, then employs the resulting AI models to predict the computed results, often with dramatic gains in efficiency, performance, and even accuracy. Some compelling success stories will be shared, and the implications of this new HPC + AI workflow on HPC applications and system architecture in a post-Moore’s Law world considered."
Watch the video: https://youtu.be/SV3cnWf39kc
Learn more: https://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Dremel interactive analysis of web scale datasetsCarl Lu
Dremel is an interactive query system that can analyze large web-scale datasets containing trillions of records in seconds. It uses a columnar data layout and multi-level query execution trees to distribute queries across thousands of servers. Dremel's nested data model and column-striped storage allows it to efficiently retrieve and analyze only the necessary columns from large datasets. Experimental results demonstrated Dremel's ability to process queries over datasets containing trillions of records and petabytes of data in seconds using a cluster of thousands of servers.
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsWilliam Yetman
This was one of my first presentations on Big Data at Ancestry.com. The audience was split between Family Historians interested in the Technology and Developers interested in our Big Data Story. So the presentation is a mix. I think there is plenty for a someone with an interest in technology and enough meat for a "technologist".
Keep this in mind as you look at this presentation.
Thanks,
-Bill-
This document provides an overview of next generation analytics with YARN, Spark and GraphLab. It discusses how YARN addressed limitations of Hadoop 1.0 like scalability, locality awareness and shared cluster utilization. It also describes the Berkeley Data Analytics Stack (BDAS) which includes Spark, and how companies like Ooyala and Conviva use it for tasks like iterative machine learning. GraphLab is presented as ideal for processing natural graphs and the PowerGraph framework partitions such graphs for better parallelism. PMML is introduced as a standard for defining predictive models, and how a Naive Bayes model can be defined and scored using PMML with Spark and Storm.
Knitting boar - Toronto and Boston HUGs - Nov 2012Josh Patterson
1) The document discusses machine learning and parallel iterative algorithms like stochastic gradient descent. It introduces the Mahout machine learning library and describes an implementation of parallel SGD called Knitting Boar that runs on YARN.
2) Knitting Boar parallelizes Mahout's SGD algorithm by having worker nodes process partitions of the training data in parallel while a master node merges their results.
3) The author argues that approaches like Knitting Boar and IterativeReduce provide better ways to implement machine learning algorithms for big data compared to traditional MapReduce.
Hfsp bringing size based scheduling to hadoop
Do Your Projects With Technology Experts
To Get this projects Call : 9566355386 / 99625 88976
Web : http://www.lemenizinfotech.com
Web : http://www.ieeemaster.com
Mail : projects@lemenizinfotech.com
Blog : http://ieeeprojectspondicherry.weebly.com
Blog : http://www.ieeeprojectsinpondicherry.blogspot.in/
Youtube:https://www.youtube.com/watch?v=eesBNUnKvws
Dynamic Resource Allocation Algorithm using ContainersIRJET Journal
1) The document proposes a dynamic resource allocation algorithm using containers to optimize resource utilization in server farms.
2) It uses Docker to deploy applications in lightweight containers instead of virtual machines to reduce overhead. A node selection algorithm uses fuzzy logic to determine the most suitable node for container deployment based on resource availability and workload.
3) The proposed approach is tested on a small cluster using Docker, Hadoop and the node selection algorithm to process queries. Results show increased processing speed and better resource utilization compared to traditional virtualization methods.
1. The document discusses Microsoft's SCOPE analytics platform running on Apache Tez and YARN. It describes how Graphene was designed to integrate SCOPE with Tez to enable SCOPE jobs to run as Tez DAGs on YARN clusters.
2. Key components of Graphene include a DAG converter, Application Master, and tooling integration. The Application Master manages task execution and communicates with SCOPE engines running in containers.
3. Initial experience running SCOPE on Tez has been positive though challenges remain around scaling to very large workloads with over 15,000 parallel tasks and optimizing for opportunistic containers and Application Master recovery.
Parsl: Pervasive Parallel Programming in PythonDaniel S. Katz
The document summarizes Parsl, a Python library for pervasive parallel programming. Parsl allows users to naturally express parallelism in Python programs and execute tasks concurrently across different computing platforms while respecting data dependencies. It supports various use cases from small machine learning workloads to extreme-scale simulations involving millions of tasks and thousands of nodes. Parsl provides simple, scalable, and flexible parallel programming while hiding complexity of parallel execution.
AI on Greenplum Using Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu
This document discusses machine learning and deep learning capabilities in Greenplum using Apache MADlib. It begins with an overview of MADlib, describing it as an open source machine learning library for PostgreSQL and Greenplum Database. It then discusses specific machine learning algorithms and techniques supported, such as linear regression, neural networks, graph algorithms, and more. It also covers scaling of algorithms like SVM and PageRank with increasing data and graph sizes. Later sections discuss deep learning integration with Greenplum, challenges of model management and operationalization, and introduces MADlib Flow as a tool to address those challenges through an end-to-end data science workflow in SQL.
Software tools, crystal descriptors, and machine learning applied to material...Anubhav Jain
This talk introduces several open-source software tools for accelerating materials design efforts:
- Atomate enables high-throughput DFT simulations through automated workflows. It has been used to generate large datasets for the Materials Project.
- Rocketsled uses machine learning to suggest the most informative calculations to optimize a target property faster than random searches.
- Matminer provides features to represent materials for machine learning and connects to data mining tools and databases.
- Automatminer develops machine learning models automatically from raw input-output data without requiring feature engineering by users.
- Robocrystallographer analyzes crystal structures and describes them in an interpretable text format.
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
This document discusses building a social analytics tool using MongoDB from a developer's perspective. It covers using MongoDB for its schema-less data and ability to handle fast read-write operations. Key topics include using aggregation queries to gain insights from data by chaining queries together and filtering/manipulating results at each stage. JavaScript capabilities in MongoDB allow applying business logic directly to data. Examples demonstrate removing garbage data and stopwords. Indexes, current progress, and tips/tricks learned around cloning collections and removing vs dropping are also covered, with a demo planned.
Hadoop is a software framework that allows for distributed processing of large data sets across clusters of computers. It uses MapReduce as a programming model and HDFS for storage. MapReduce divides applications into parallelizable map and reduce tasks that process key-value pairs across large datasets in a reliable and fault-tolerant manner. HDFS stores multiple replicas of data blocks for reliability and allows processing of data in parallel on nodes where the data is located. Hadoop can reliably store and process petabytes of data on thousands of low-cost commodity hardware nodes.
Optimizing Performance - Clojure Remote - Nikola PericNik Peric
When a project approaches production questions about performance always surface. This talk tackles several real-world problems that have occurred while bringing a data-driven project to production, and walks through the problem solving approach to each.
This document discusses scaling extract, transform, load (ETL) processes with Apache Hadoop. It describes how data volumes and varieties have increased, challenging traditional ETL approaches. Hadoop offers a flexible way to store and process structured and unstructured data at scale. The document outlines best practices for extracting data from databases and files, transforming data using tools like MapReduce, Pig and Hive, and loading data into data warehouses or keeping it in Hadoop. It also discusses workflow management with tools like Oozie. The document cautions against several potential mistakes in ETL design and implementation with Hadoop.
Hadoop is a software framework that allows for distributed processing of large data sets across clusters of computers. It uses MapReduce and HDFS to parallelize tasks, distribute data storage, and provide fault tolerance. Applications of Hadoop include log analysis, data mining, and machine learning using large datasets at companies like Yahoo!, Facebook, and The New York Times.
Visit http:aws.amazon.com/hpc for more information about HPC on AWS.
High Performance Computing (HPC) allows scientists and engineers to solve complex science, engineering, and business problems using applications that require high bandwidth, low latency networking, and very high compute capabilities. AWS allows you to increase the speed of research by running high performance computing in the cloud and to reduce costs by providing Cluster Compute or Cluster GPU servers on-demand without large capital investments. You have access to a full-bisection, high bandwidth network for tightly-coupled, IO-intensive workloads, which enables you to scale out across thousands of cores for throughput-oriented applications.
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan ZhuDatabricks
XGBoost (https://github.com/dmlc/xgboost) is a library designed and optimized for tree boosting. XGBoost attracts users from a broad range of organizations in both industry and academia, and more than half of the winning solutions in machine learning challenges hosted at Kaggle adopt XGBoost.
While being one of the most popular machine learning systems, XGBoost is only one of the components in a complete data analytic pipeline. The data ETL/exploration/serving functionalities are built up on top of more general data processing frameworks, like Apache Spark. As a result, users have to build a communication channel between Apache Spark and XGBoost (usually through HDFS) and face the difficulties/inconveniences in data navigating and application development/deployment.
We (Distributed (Deep) Machine Learning Community) develop XGBoost4J-Spark (https://github.com/dmlc/xgboost/tree/master/jvm-packages), which seamlessly integrates Apache Spark and XGBoost.
The communication channel between Spark and XGBoost is established based on RDDs/DataFrame/Datasets, all of which are standard data interfaces in Spark. Additionally, XGBoost can be embedded into Spark MLLib pipeline and tuned through the tools provided by MLLib. In this talk, I will cover the motivation/history/design philosophy/implementation details as well as the use cases of XGBoost4J-Spark. I expect that this talk will share the insights on building a heterogeneous data analytic pipeline based on Spark and other data intelligence frameworks and bring more discussions on this topic.
Many-Objective Performance Enhancement in Computing ClustersTarik Reza Toha
In a heterogeneous computing cluster, cluster objectives are conflicting to each other. Selecting a right combination of machines is necessary to enhance cluster performance, and to optimize all the cluster objectives. In this paper, we perform empirical performance analyses of a real cluster with our year-long collected data, formulate a new many-objective optimization problem for clusters, and integrate a greedy approach with the existing NSGA-III algorithm to solve this problem. From our experimental results, we find our approach performs better than existing optimization approaches.
Materialization and Reuse Optimizations for Production Data Science PipelinesBehrouz Derakhshan
Many companies and businesses train and deploy machine learning (ML) pipelines to answer prediction queries. In many applications, new training data continuously becomes available. A typical approach to ensure that ML models are up-to-date is to retrain the ML pipelines following a schedule, e.g., every day on the last seven days of data. Several use cases, such as A/B testing and ensemble learning, require many pipelines to be deployed in parallel. Existing solutions train each pipeline separately, which generates redundant data processing. Our goal is to eliminate redundant data processing in such scenarios using materialization and reuse optimizations. Our solution comprises of two main parts. First, we propose a materialization algorithm that given a storage budget, materializes the subset of the artifacts to minimize the run time of the subsequent executions. Second, we design a reuse algorithm to generate an execution plan by combining the pipelines into a directed acyclic graph (DAG) and reusing the materialized artifacts when appropriate. Our experiments show that our solution can reduce the training time by up to an order of magnitude for different deployment scenarios.
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkMahantesh Angadi
This seminar presentation provides an overview of scheduling methods in the Hadoop MapReduce framework. It begins with motivations for using distributed computing for big data and introduces Hadoop and MapReduce. The presentation then surveys several proposed scheduling methods, including static methods like FIFO and adaptive methods like the Fair Scheduler. It summarizes five research papers on scheduling, discussing proposed approaches like optimizing job completion time and learning node capabilities for heterogeneous clusters. The presentation concludes that scheduling algorithms should improve data locality and use prediction to efficiently schedule jobs on heterogeneous Hadoop clusters.
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
Ini adalah slide tambahan dari materi pengenalan Big Data Analytics (di file berikutnya), yang mengajak kita mulai hands-on dengan beberapa hal terkait Machine/Deep Learning, Big Data (batch/streaming), dan AI menggunakan Tensor Flow
This document provides an introduction to machine learning concepts and tools. It begins with an overview of what will be covered in the course, including machine learning types, algorithms, applications, and mathematics. It then discusses data science concepts like feature engineering and the typical steps in a machine learning project, including collecting and examining data, fitting models, evaluating performance, and deploying models. Finally, it reviews common machine learning tools and terminologies and where to find datasets.
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamFlink Forward
http://flink-forward.org/kb_sessions/no-shard-left-behind-dynamic-work-rebalancing-in-apache-beam/
The Apache Beam (incubating) programming model is designed to support several advanced data processing features such as autoscaling and dynamic work rebalancing. In this talk, we will first explain how dynamic work rebalancing not only provides a general and robust solution to the problem of stragglers in traditional data processing pipelines, but also how it allows autoscaling to be truly effective. We will then present how dynamic work rebalancing works as implemented in Google Cloud Dataflow and which path other Apache Beam runners link Apache Flink can follow to benefit from it.
Similar to Optimizing Machine Learning Pipelines in Collaborative Environments (20)
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
2. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Introduction
2
Math and
Statistics
Computer
Science
Domain
Knowledge
Data Science and Machine Learning
3. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Introduction
2
Math and
Statistics
Computer
Science
Domain
Knowledge
Data Science and Machine Learning Data Science and ML Toolkit
4. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Introduction
2
Math and
Statistics
Computer
Science
Domain
Knowledge
Data Science and Machine Learning Data Science and ML Toolkit
High-quality DS and ML applications require effective collaboration
5. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Collaborative DS and ML
3
6. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Collaborative DS and ML
3
• Jupyter Notebooks enable sharing code and results
7. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Collaborative DS and ML
3
• Containerized environments enable execution
and deployment of the notebooks
• Jupyter Notebooks enable sharing code and results
[1]
1. https://www.docker.com/
8. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Collaborative DS and ML
3
• Containerized environments enable execution
and deployment of the notebooks
• Platforms, such as Google Colab and Kaggle,
enable effective collaboration among data
scientists
• Jupyter Notebooks enable sharing code and results
[1]
[2]
1. https://www.docker.com/
[3]
2. https://colab.research.google.com/
3. https://www.kaggle.com/
9. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Problems
4
Lack of generated data artifacts management
10. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Problems
4
Lack of generated data artifacts management
Re-execution of existing notebooks
11. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Problems
4
Top 3 notebooks of Home Credit Default Risk Kaggle Competition1 generate
100s GBs of intermediate data artifact and are copied 10,000 times
1. https://www.kaggle.com/c/home-credit-default-risk
Lack of generated data artifacts management
Re-execution of existing notebooks
12. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Problems
4
Top 3 notebooks of Home Credit Default Risk Kaggle Competition1 generate
100s GBs of intermediate data artifact and are copied 10,000 times
1. https://www.kaggle.com/c/home-credit-default-risk
Lack of data management in existing collaborative environment leads to
1000s of hours of redundant data processing and model training
Lack of generated data artifacts management
Re-execution of existing notebooks
13. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Collaborative ML Workload Optimizer
5
Parser
ML Workload
Optimizer Materializer
Executor
Experiment Graph
ClientServer
Workload DAG
Optimized DAG
Annotated DAG
1
2
5
1
14. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Collaborative ML Workload Optimizer
5
Experiment Graph
○ Union of all the Workload DAGs
○ Vertices: data artifacts
○ Edges: operations
Materializer
○ Store data artifacts with high-
likelihood of future reuse
Optimizer
○ Linear-time reuse algorithm
○ Finds optimal execution DAG
Parser
ML Workload
Optimizer Materializer
Executor
Experiment Graph
ClientServer
Workload DAG
Optimized DAG
A
Annotated DAG
1
2
5
1
C B
A
B
C
15. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Materialization Problem
6
Given a storage budget, materialize a subset of the artifacts in order to minimize
the execution cost
16. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Materialization Problem
6
Given a storage budget, materialize a subset of the artifacts in order to minimize
the execution cost
Challenges:
1. Future workloads are unknown
2. Even if future workloads are known a priori, the problem is NP-Hard (Bhattacherjee, 2015)
3. Accommodating large graphs and fast number of incoming workloads
17. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Materialization Algorithm
7
run-time
For every artifact compute a utility value: size frequency
potential
18. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Materialization Algorithm
7
Materializer Experiment Graph
Materialized artifact
Un-materialized artifact
50,91
40,0.9 5,0.91 5,0.91
10,0.9 3,0.91
23,0.9
2,0.9
1.5,0.91
0.5,0.0
5 2
0.1
0.1 0.1
0.1 0.1
4
4
0.05
3
3
Client Server
run-time
For every artifact compute a utility value: size frequency
potential
19. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Storage-aware Materialization
● Many duplicated columns in intermediate data artifacts
8
20. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Storage-aware Materialization
● Many duplicated columns in intermediate data artifacts
○ Feature selection operations
○ Feature generation operations
8
Feature
Selection
Input Artifact Output Artifact
21. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Storage-aware Materialization
● Many duplicated columns in intermediate data artifacts
○ Feature selection operations
○ Feature generation operations
● Apply column deduplication strategy
8
while budget is not exhausted:
1. run materialization algorithm
2. deduplicate the materialized artifacts
3. update the size of unmaterialized
artifact
Storage-aware Materialization
Feature
Selection
Input Artifact Output Artifact
22. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Storage-aware Materialization
● Many duplicated columns in intermediate data artifacts
○ Feature selection operations
○ Feature generation operations
● Apply column deduplication strategy
8
while budget is not exhausted:
1. run materialization algorithm
2. deduplicate the materialized artifacts
3. update the size of unmaterialized
artifact
Storage-aware Materialization
Improves Storage utilization and Run-time
Feature
Selection
Input Artifact Output Artifact
23. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Reuse Problem
9
Given EG and a new workload, find the set of workload artifacts to reuse from
EG and the set of artifact to compute
24. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Reuse Problem
9
Given EG and a new workload, find the set of workload artifacts to reuse from
EG and the set of artifact to compute
Challenges:
1. Exponential time complexity of exhaustive search
2. Accommodating large graphs and fast number of incoming workloads, state-of-the-art
has polynomial time complexity 𝓞(|𝒱|.|𝘌|2) (Helix, 2018)
25. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Reuse Algorithm
10
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
A linear-time algorithm to compute optimal execution plan with a forward and
backward pass on the workload DAG
Materialized artifact
Un-materialized or do not exist in EG
26. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Reuse Algorithm
10
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
A linear-time algorithm to compute optimal execution plan with a forward and
backward pass on the workload DAG
Materialized artifact
Un-materialized or do not exist in EG
ForwardPass
• Accumulate run-times
• For every vertex
• Load or compute
27. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Reuse Algorithm
10
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
A linear-time algorithm to compute optimal execution plan with a forward and
backward pass on the workload DAG
Materialized artifact
Un-materialized or do not exist in EG
ForwardPass
• Accumulate run-times
• For every vertex
• Load or compute
BackwardPass
• Prune unnecessary
loaded vertices
28. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Reuse Algorithm
10
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
A linear-time algorithm to compute optimal execution plan with a forward and
backward pass on the workload DAG
Materialized artifact
Un-materialized or do not exist in EG
ForwardPass
• Accumulate run-times
• For every vertex
• Load or compute
BackwardPass
• Prune unnecessary
loaded vertices
Prune
29. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Evaluation
11
● Kaggle
○ Kaggle Home Credit Default Risk1
Competition
○ 9 source datasets
○ 5 real and 3 generated workloads
○ Total of 130 GB artifact sizes
● Helix
○ State-of-the-art iterative ML framework
○ Materialization Algorithm:
■ Only execution-time is considered
■ Nodes are not prioritized
○ Reuse Algorithm
■ Utilizes Edmonds-Karp Max-Flow
algorithm, which runs in 𝓞(|𝒱|.|𝘌|2)
● Naïve
○ No Optimization
1. https://www.kaggle.com/c/home-credit-default-risk
Workload Baseline
30. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
End-to-end Run-time
12
Repeated executions of Kaggle workloads (materialization budget = 16 GB)
31. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
End-to-end Run-time
12
Repeated executions of Kaggle workloads (materialization budget = 16 GB)
Execution of Kaggle workloads in sequence (materialization budget = 16 GB)
32. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
End-to-end Run-time
12
Optimizing ML Workloads improves run-time up to 1 order of magnitude for repeated
executions and 50% for different workloads
Repeated executions of Kaggle workloads (materialization budget = 16 GB)
Execution of Kaggle workloads in sequence (materialization budget = 16 GB)
33. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Materialization Impact
13
Total run-time of the Kaggle workloads with different materialization strategies and budgets
34. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Materialization Impact
13
Total run-time of the Kaggle workloads with different materialization strategies and budgets
Exploiting the artifacts characteristics, such as utility and duplication rate, improves the
materialization process and improves the run-time by 50%
35. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Reuse Overhead
14
Overhead of different reuse algorithms for 10,000 synthetically generated workloads
100
101
102
103
104
Number of Workloads
0
1k
2k
3k
Cumulative
Overhead(s)
Colab-reuse Helix
36. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Reuse Overhead
14
Overhead of different reuse algorithms for 10,000 synthetically generated workloads
Our linear-time reuse algorithm generates a negligible overhead in real collaborative
environments where 1000s of workloads are executed
100
101
102
103
104
Number of Workloads
0
1k
2k
3k
Cumulative
Overhead(s)
Colab-reuse Helix
37. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Summary
15
Parser
ML Workload
Optimizer Materializer
Executor
Experiment Graph ClientServer
Workload DAG
Optimized DAG
Annotated DAG
1
2
5
1
● Optimization of ML workloads in collaborative
environment through Materialization and
Reuse, while incurring negligible overhead
● Things not covered in the talk:
○ API and DAG Construction
○ Quality-based Materialization
○ Model Warmstarting
38. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
References
● Bhattacherjee, Souvik, et al. "Principles of dataset versioning: Exploring the recreation/storage tradeoff." Proceedings of the VLDB
Endowment. International Conference on Very Large Data Bases. Vol. 8. No. 12. NIH Public Access, 2015.
● Xin, Doris, et al. "Helix: Holistic optimization for accelerating iterative machine learning." Proceedings of the VLDB Endowment 12.4
(2018): 446-460.
● Jack Edmonds and Richard M. Karp. 1972. Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems. J. ACM 19, 2
(April 1972), 248–264. https://doi.org/10.1145/321694.321699
● Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Bussonnier, et al. 2016. Jupyter Notebooks – a
publishing format for reproducible computational workflows. In Posi- tioning and Power in Academic Publishing: Players, Agents and
Agendas, F. Loizides and B. Schmidt (Eds.). IOS Press, 87 – 90.
● Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). Borg, omega, and kubernetes. Queue, 14(1), 70-93.
16
49. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
DAG Annotation
20
train
ad_desc t_subset y
cnt_vect top_feats
X
model_2
model_1
cnt_head
50. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
DAG Annotation
20
● Edge run-time
train
ad_desc t_subset y
cnt_vect top_feats
X
model_2
model_1
cnt_head
5 s 2 s
0.1 s
0.05 s 0.1 s
0.1 s 0.1 s
4 s
4 s
0.05 s
3 s
3 s
51. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
DAG Annotation
20
● Edge run-time
● Vertex size
train
ad_desc t_subset y
cnt_vect top_feats
X
model_2
model_1
cnt_head
5 s 2 s
0.1 s
0.05 s 0.1 s
0.1 s 0.1 s
4 s
4 s
50 MB
40 MB
5 MB
5 MB
10 MB
3 MB
23 MB
0.5 MB
0.05 s
3 s
3 s
2 MB
1.5 MB
52. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
DAG Annotation
20
● Edge run-time
● Vertex size
● Vertex potential
○ Quality of the best reachable model
train
ad_desc t_subset y
cnt_vect top_feats
X
model_2
model_1
cnt_head
5 s 2 s
0.1 s
0.05 s 0.1 s
0.1 s 0.1 s
4 s
4 s
50 MB
40 MB
5 MB
5 MB
10 MB
3 MB
23 MB
0.5 MB
0.05 s
3 s
3 s
2 MB
1.5 MB
ModelQuality
Model
Quality
Model
Quality
Model
Quality
Accuracy=0.
9
Accuracy=0.9
1
0.0
0.9
0.9
0.91
0.9
0.91
0.91
0.91
0.9
0.91
53. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Model Materialization and Warmstarting
21
Run-time of model benchmarking scenario
54. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Impact of Model Warmstarting on Accuracy
22
55. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Reuse Algorithm
23
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
A linear-time algorithm to compute optimal execution plan with a forward and
backward pass on the workload DAG
Materialized artifact
Un-materialized or do not exist in EG
56. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Forward-pass)
24
1.Traverse from the source
2.Accumulate run-times
3.For every materialized node
compare the accumulated run-
time with the load-time
Forward-pass
Artifacts to execute
Artifacts to load
Materialized artifact
Un-materialized or do not exist in EG
57. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
Acc = 0.1
Acc = 0.05
Acc = 2.1
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Forward-pass)
24
1.Traverse from the source
2.Accumulate run-times
3.For every materialized node
compare the accumulated run-
time with the load-time
Forward-pass
Artifacts to execute
Artifacts to load
train
Materialized artifact
Un-materialized or do not exist in EG
58. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
Acc = 0.1
Acc = 0.05
Acc = 2.1
load cost = 0.06
Load > Acc
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Forward-pass)
24
1.Traverse from the source
2.Accumulate run-times
3.For every materialized node
compare the accumulated run-
time with the load-time
Forward-pass
Artifacts to execute
Artifacts to load
train
Materialized artifact
Un-materialized or do not exist in EG
59. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
Acc = 0.1
Acc = 0.05
Acc = 2.1
load cost = 0.06
Load > Acc
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Forward-pass)
24
1.Traverse from the source
2.Accumulate run-times
3.For every materialized node
compare the accumulated run-
time with the load-time
Forward-pass
Artifacts to execute
Artifacts to load
train
t_subset
Materialized artifact
Un-materialized or do not exist in EG
60. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
Acc = 0.1
Acc = 0.05
Acc = 2.1
load cost = 0.06
Load < Acc
load = 1.5
Load < Acc
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Forward-pass)
24
1.Traverse from the source
2.Accumulate run-times
3.For every materialized node
compare the accumulated run-
time with the load-time
Forward-pass
Artifacts to execute
Artifacts to load
train
t_subset
Materialized artifact
Un-materialized or do not exist in EG
61. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
Acc = 0.1
Acc = 0.05
Acc = 2.1
load cost = 0.06
Load < Acc
load = 1.5
Load < Acc
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Forward-pass)
24
1.Traverse from the source
2.Accumulate run-times
3.For every materialized node
compare the accumulated run-
time with the load-time
Forward-pass
Artifacts to execute
Artifacts to load
train
t_subset y
top_feats
Materialized artifact
Un-materialized or do not exist in EG
62. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Forward-pass)
24
1.Traverse from the source
2.Accumulate run-times
3.For every materialized node
compare the accumulated run-
time with the load-time
Forward-pass
Artifacts to execute
Artifacts to load
train
t_subset y
top_feats
Materialized artifact
Un-materialized or do not exist in EG
63. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Backward-pass)
25
1.Traverse backward from the
terminal
2.For every loaded artifact,
stop the traversal of its
parents
3.Prune artifacts that are not
visited
Backward-pass
y
top_feats
train
t_subset
Artifacts to execute
Artifacts to load
Artifacts to prune
64. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Backward-pass)
25
1.Traverse backward from the
terminal
2.For every loaded artifact,
stop the traversal of its
parents
3.Prune artifacts that are not
visited
Backward-pass
y
top_feats
Stop Traversal
X
X
X
train
t_subset
Artifacts to execute
Artifacts to load
Artifacts to prune
65. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Backward-pass)
25
1.Traverse backward from the
terminal
2.For every loaded artifact,
stop the traversal of its
parents
3.Prune artifacts that are not
visited
Backward-pass
y
top_feats
Stop Traversal
X
X
X
Artifacts to execute
Artifacts to load
Artifacts to prune
66. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
train
t_subset y
top_feats
train_2
top_feats_2
merged_feats
model_3
train_2
top_feats_2
merged_feats
model_3
Reuse Algorithm (Backward-pass)
25
1.Traverse backward from the
terminal
2.For every loaded artifact,
stop the traversal of its
parents
3.Prune artifacts that are not
visited
Backward-pass
y
top_feats
Artifacts to execute
Artifacts to load
Artifacts to prune
67. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Storage-aware Materialization
● Many duplicated columns in intermediate data artifacts
● Apply column deduplication strategy
26
train
ad_desc t_subset y
train = pd.read_csv('train.csv') #
[ad_desc,ts,u_id,price,y]
ad_desc = train['ad_desc‘]
t_subset = train[['ts','u_id','price‘]]
y = train['y']
68. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Storage-aware Materialization
● Many duplicated columns in intermediate data artifacts
● Apply column deduplication strategy
26
train
ad_desc t_subset y
train = pd.read_csv('train.csv') #
[ad_desc,ts,u_id,price,y]
ad_desc = train['ad_desc‘]
t_subset = train[['ts','u_id','price‘]]
y = train['y']
while budget is not exhausted:
1. run materialization algorithm
2. deduplicate the materialized artifacts
3. update the size of unmaterialized
artifact
Storage-aware Materialization
69. Behrouz Derakhshan et al. Optimizing Machine Learning Workloads in Collaborative Environments
Storage-aware Materialization
● Many duplicated columns in intermediate data artifacts
● Apply column deduplication strategy
26
train
ad_desc t_subset y
train = pd.read_csv('train.csv') #
[ad_desc,ts,u_id,price,y]
ad_desc = train['ad_desc‘]
t_subset = train[['ts','u_id','price‘]]
y = train['y']
while budget is not exhausted:
1. run materialization algorithm
2. deduplicate the materialized artifacts
3. update the size of unmaterialized
artifact
Storage-aware Materialization
Improves Storage utilization and Run-time