This document summarizes scaling machine learning to billions of parameters using Spark and a parameter server architecture. It describes the requirements for supporting both batch and sequential optimization at web scale. It then outlines the Spark + Parameter server approach, leveraging Spark for distributed processing and the parameter server for synchronizing model updates. Examples of distributed L-BFGS and Word2Vec training are presented to illustrate batch and sequential optimization respectively.
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit
Apache Spark MLlib provides scalable implementation of popular machine learning algorithms, which lets users train models from big dataset and iterate fast. The existing implementations assume that the number of parameters is small enough to fit in the memory of a single machine. However, many applications require solving problems with billions of parameters on a huge amount of data such as Ads CTR prediction and deep neural network. This requirement far exceeds the capacity of exisiting MLlib algorithms many of who use L-BFGS as the underlying solver. In order to fill this gap, we developed Vector-free L-BFGS for MLlib. It can solve optimization problems with billions of parameters in the Spark SQL framework where the training data are often generated. The algorithm scales very well and enables a variety of MLlib algorithms to handle a massive number of parameters over large datasets. In this talk, we will illustrate the power of Vector-free L-BFGS via logistic regression with real-world dataset and requirement. We will also discuss how this approach could be applied to other ML algorithms.
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit
Apache Spark MLlib provides scalable implementation of popular machine learning algorithms, which lets users train models from big dataset and iterate fast. The existing implementations assume that the number of parameters is small enough to fit in the memory of a single machine. However, many applications require solving problems with billions of parameters on a huge amount of data such as Ads CTR prediction and deep neural network. This requirement far exceeds the capacity of exisiting MLlib algorithms many of who use L-BFGS as the underlying solver. In order to fill this gap, we developed Vector-free L-BFGS for MLlib. It can solve optimization problems with billions of parameters in the Spark SQL framework where the training data are often generated. The algorithm scales very well and enables a variety of MLlib algorithms to handle a massive number of parameters over large datasets. In this talk, we will illustrate the power of Vector-free L-BFGS via logistic regression with real-world dataset and requirement. We will also discuss how this approach could be applied to other ML algorithms.
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
This talk highlights major improvements in Machine Learning (ML) targeted for Apache Spark 2.0. The MLlib 2.0 release focuses on ease of use for data science—both for casual and power users. We will discuss 3 key improvements: persisting models for production, customizing Pipelines, and improvements to models and APIs critical to data science.
(1) MLlib simplifies moving ML models to production by adding full support for model and Pipeline persistence. Individual models—and entire Pipelines including feature transformations—can be built on one Spark deployment, saved, and loaded onto other Spark deployments for production and serving.
(2) Users will find it much easier to implement custom feature transformers and models. Abstractions automatically handle input schema validation, as well as persistence for saving and loading models.
(3) For statisticians and data scientists, MLlib has doubled down on Generalized Linear Models (GLMs), which are key algorithms for many use cases. MLlib now supports more GLM families and link functions, handles corner cases more gracefully, and provides more model statistics. Also, expanded language APIs allow data scientists using Python and R to call many more algorithms.
Finally, we will demonstrate these improvements live and show how they facilitate getting started with ML on Spark, customizing implementations, and moving to production.
Random Walks on Large Scale Graphs with Apache Spark with Min ShenDatabricks
Random Walks on graphs is a useful technique in machine learning, with applications in personalized PageRank, representational learning and others. This session will describe a novel algorithm for enumerating walks on large-scale graphs that benefits from the several unique abilities of Apache Spark.
The algorithm generates a recursive branching DAG of stages that separates out the “closed” and “open” walks. Spark’s shuffle file management system is ingeniously used to accumulate the walks while the computation is progressing. In-memory caching over multi-core executors enables moving the walks several “steps” forward before shuffling to the next stage.
See performance benchmarks, and hear about LinkedIn’s experience with Spark in production clusters. The session will conclude with an observation of how Spark’s unique and powerful construct opens new models of computation, not possible with state-of-the-art, for developing high-performant and scalable algorithms in data science and machine learning.
Operational Tips For Deploying Apache SparkDatabricks
Spark is providing a way to make big data applications easier to work with, but understanding how to actually deploy the platform can be quite confusing. This talk will present operational tips and best practices based on supporting our (Databricks) customers with Spark in production. We will discuss how your choice of storage and overall pipeline design influence performance. We will review Spark’s configuration subsystem and discuss which configuration properties are relevant to you. We’ll also review common misconfigurations that prevent users from getting the most of their Spark deployment. Finally, I’ll discuss frequently encountered issues working with customer environments and present debugging techniques to get to the root cause. This talk should help answer the following questions: How should I deploy my Spark application (cluster size, storage format, etc)? How can I improve the performance of my Spark application? What’s causing my Spark application to crash?
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDatabricks
eBay has been using enterprise ADBMS for over a decade, and our team is working on batch workload migration from ADBMS to Spark in 2018. There has been so many experiences and lessons we got during the whole migration journey (85% auto + 15% manual migration) - during which we exposed many unexpected issues and gaps between ADBMS and Spark SQL, we made a lot of decisions to fulfill the gaps in practice and contributed many fixes in Spark core in order to unblock ourselves. It will be a really interesting and should be helpful sharing for many folks especially data/software engineers to plan and execute their migration work. And during this session we will share many very specific issues each individually we encountered and how we resolve & work-around with team in real migration processes.
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowDatabricks
The data science lifecycle consists of multiple iterative steps: data collection, data cleaning/exploration, feature engineering, model training, model deployment and scoring among others. The process is often tedious and error-prone and requires considerable human effort. Apart from these challenges, when it comes to leveraging ML in enterprise applications, especially in regulated environments, the level of scrutiny for data handling, model fairness, user privacy, and debuggability is very high. In this talk, we present the basic features of Flock, an end-to-end platform that facilitates adoption of ML in enterprise applications. We refer to this new class of applications as Enterprise Grade Machine Learning (EGML). Flock leverages MLflow to simplify and automate some of the steps involved in supporting EGML applications, allowing data scientists to spend most of their time on improving their ML models. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, deeper integration with relational databases that often store confidential data, model optimizations and support for the ONNX model format and the ONNX Runtime for inference. We will also present our ongoing work on automatically tracking lineage between data and ML models which is crucial in regulated environments. We will showcase Flock’s features through a demo using Microsoft’s Azure Data Studio and MLflow.
Recent Developments In SparkR For Advanced AnalyticsDatabricks
Since its introduction in Spark 1.4, SparkR has received contributions from both the Spark community and the R community. In this talk, we will summarize recent community efforts on extending SparkR for scalable advanced analytics. We start with the computation of summary statistics on distributed datasets, including single-pass approximate algorithms. Then we demonstrate MLlib machine learning algorithms that have been ported to SparkR and compare them with existing solutions on R, e.g., generalized linear models, classification and clustering algorithms. We also show how to integrate existing R packages with SparkR to accelerate existing R workflows.
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
ROCm and Distributed Deep Learning on Spark and TensorFlowDatabricks
ROCm, the Radeon Open Ecosystem, is an open-source software foundation for GPU computing on Linux. ROCm supports TensorFlow and PyTorch using MIOpen, a library of highly optimized GPU routines for deep learning. In this talk, we describe how Apache Spark is a key enabling platform for distributed deep learning on ROCm, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end machine learning pipeline. We will analyse the different frameworks for integrating Spark with Tensorflow on ROCm, from Horovod to HopsML to Databrick's Project Hydrogen. We will also examine the surprising places where bottlenecks can surface when training models (everything from object stores to the Data Scientists themselves), and we will investigate ways to get around these bottlenecks. The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on Hopsworks with ROCm.
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit
While the performance delivered by Spark has enabled data scientists to undertake sophisticated analyses on big and complex data in actionable timeframes, too often, the process of manually configuring the underlying Spark jobs (including the number and size of the executors) can be a significant and time consuming undertaking. Not only it does this configuration process typically rely heavily on repeated trial-and-error, it necessitates that data scientists have a low-level understanding of Spark and detailed cluster sizing information. At Alpine Data we have been working to eliminate this requirement, and develop algorithms that can be used to automatically tune Spark jobs with minimal user involvement,
In this presentation, we discuss the algorithms we have developed and illustrate how they leverage information about the size of the data being analyzed, the analytical operations being used in the flow, the cluster size, configuration and real-time utilization, to automatically determine the optimal Spark job configuration for peak performance.
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
With components like Spark SQL, MLlib, and Streaming, Spark is a unified engine for building data applications. In this talk, we will take a look at how we use Spark on our own Databricks platform throughout our data pipeline for use cases such as ETL, data warehousing, and real time analysis. We will demonstrate how these applications empower engineering and data analytics. We will also share some lessons learned from building our data pipeline around security and operations. This talk will include examples on how to use Structured Streaming (a.k.a Streaming DataFrames) for online analysis, SparkR for offline analysis, and how we connect multiple sources to achieve a Just-In-Time Data Warehouse.
Build, Scale, and Deploy Deep Learning Pipelines with EaseDatabricks
Deep Learning has shown a tremendous success, yet it often requires a lot of effort to leverage its power. Existing Deep Learning frameworks require writing a lot of code to work with a model, let alone in a distributed manner.
This webinar is the first of a series in which we survey the state of Deep Learning at scale, and where we introduce the Deep Learning Pipelines, a new open-source package for Apache Spark. This package simplifies Deep Learning in three major ways:
1. It has a simple API that integrates well with enterprise Machine Learning pipelines.
2. It automatically scales out common Deep Learning patterns, thanks to Spark.
3. It enables exposing Deep Learning models through the familiar Spark APIs, such as MLlib and Spark SQL.
In this webinar, we will look at a complex problem of image classification, using Deep Learning and Spark. Using Deep Learning Pipelines, we will show:
* how to build deep learning models in a few lines of code;
* how to scale common tasks like transfer learning and prediction; and
* how to publish models in Spark SQL.
Willump: Optimizing Feature Computation in ML InferenceDatabricks
Systems for performing ML inference are increasingly important, but are far slower than they could be because they use techniques designed for conventional data serving workloads, neglecting the statistical nature of ML inference. As an alternative, this talk presents Willump, an optimizer for ML inference.
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
Sean Owen – Director of Data Science @Cloudera
Building machine learning models is all well and good, but how do they get productionized into a service? It's a long way from a Python script on a laptop, to a fault-tolerant system that learns continuously, serves thousands of queries per second, and scales to terabytes. The confederation of open source technologies we know as Hadoop now offers data scientists the raw materials from which to assemble an answer: the means to build models but also ingest data and serve queries, at scale.
This short talk will introduce Oryx 2, a blueprint for building this type of service on Hadoop technologies. It will survey the problem and the standard technologies and ideas that Oryx 2 combines: Apache Spark, Kafka, HDFS, the lambda architecture, PMML, REST APIs. The talk will touch on a key use case for this architecture -- recommendation engines.
Improving the Spark SQL usability and computing efficiency is one of the missions for Linkedin’s Spark team. In this talk, we will present the Spark SQL ecosystem and roadmaps at Linkedin, and introduce the highlighted projects we are working on, such as:
* Improving Dataset performance with automated column pruning
* Bringing an efficient 2d join algorithm to Spark SQL
* Fixing join skewness with adaptive execution
* Enhancing the cost-optimizer with a history-based learning approach
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
When making machine learning applications in Uber, we identified a sequence of common practices and painful procedures, and thus built a machine learning platform as a service. We here present the key components to build such a scalable and reliable machine learning service which serves both our online and offline data processing needs.
Low Latency Polyglot Model Scoring using Apache ApexApache Apex
Data science is fast becoming a complementary approach and process to solve business challenges today. The explosion of frameworks to help data scientists build models bears a testimony to this. However when a model needs to be turned into a production version in very low latency and enterprise grade environments, there are a very few choices with each one having their own strengths and weaknesses. Adding to this is the current disconnect between a data scientists world which is all about modelling and an engineers world which is about SLAs and service guarantees. A framework like Apache Apex can complement each of these roles and provide constructs for both these worlds. This would help enterprises to drastically cut down the cost of model deployment to production environments.
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
This talk highlights major improvements in Machine Learning (ML) targeted for Apache Spark 2.0. The MLlib 2.0 release focuses on ease of use for data science—both for casual and power users. We will discuss 3 key improvements: persisting models for production, customizing Pipelines, and improvements to models and APIs critical to data science.
(1) MLlib simplifies moving ML models to production by adding full support for model and Pipeline persistence. Individual models—and entire Pipelines including feature transformations—can be built on one Spark deployment, saved, and loaded onto other Spark deployments for production and serving.
(2) Users will find it much easier to implement custom feature transformers and models. Abstractions automatically handle input schema validation, as well as persistence for saving and loading models.
(3) For statisticians and data scientists, MLlib has doubled down on Generalized Linear Models (GLMs), which are key algorithms for many use cases. MLlib now supports more GLM families and link functions, handles corner cases more gracefully, and provides more model statistics. Also, expanded language APIs allow data scientists using Python and R to call many more algorithms.
Finally, we will demonstrate these improvements live and show how they facilitate getting started with ML on Spark, customizing implementations, and moving to production.
Random Walks on Large Scale Graphs with Apache Spark with Min ShenDatabricks
Random Walks on graphs is a useful technique in machine learning, with applications in personalized PageRank, representational learning and others. This session will describe a novel algorithm for enumerating walks on large-scale graphs that benefits from the several unique abilities of Apache Spark.
The algorithm generates a recursive branching DAG of stages that separates out the “closed” and “open” walks. Spark’s shuffle file management system is ingeniously used to accumulate the walks while the computation is progressing. In-memory caching over multi-core executors enables moving the walks several “steps” forward before shuffling to the next stage.
See performance benchmarks, and hear about LinkedIn’s experience with Spark in production clusters. The session will conclude with an observation of how Spark’s unique and powerful construct opens new models of computation, not possible with state-of-the-art, for developing high-performant and scalable algorithms in data science and machine learning.
Operational Tips For Deploying Apache SparkDatabricks
Spark is providing a way to make big data applications easier to work with, but understanding how to actually deploy the platform can be quite confusing. This talk will present operational tips and best practices based on supporting our (Databricks) customers with Spark in production. We will discuss how your choice of storage and overall pipeline design influence performance. We will review Spark’s configuration subsystem and discuss which configuration properties are relevant to you. We’ll also review common misconfigurations that prevent users from getting the most of their Spark deployment. Finally, I’ll discuss frequently encountered issues working with customer environments and present debugging techniques to get to the root cause. This talk should help answer the following questions: How should I deploy my Spark application (cluster size, storage format, etc)? How can I improve the performance of my Spark application? What’s causing my Spark application to crash?
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDatabricks
eBay has been using enterprise ADBMS for over a decade, and our team is working on batch workload migration from ADBMS to Spark in 2018. There has been so many experiences and lessons we got during the whole migration journey (85% auto + 15% manual migration) - during which we exposed many unexpected issues and gaps between ADBMS and Spark SQL, we made a lot of decisions to fulfill the gaps in practice and contributed many fixes in Spark core in order to unblock ourselves. It will be a really interesting and should be helpful sharing for many folks especially data/software engineers to plan and execute their migration work. And during this session we will share many very specific issues each individually we encountered and how we resolve & work-around with team in real migration processes.
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowDatabricks
The data science lifecycle consists of multiple iterative steps: data collection, data cleaning/exploration, feature engineering, model training, model deployment and scoring among others. The process is often tedious and error-prone and requires considerable human effort. Apart from these challenges, when it comes to leveraging ML in enterprise applications, especially in regulated environments, the level of scrutiny for data handling, model fairness, user privacy, and debuggability is very high. In this talk, we present the basic features of Flock, an end-to-end platform that facilitates adoption of ML in enterprise applications. We refer to this new class of applications as Enterprise Grade Machine Learning (EGML). Flock leverages MLflow to simplify and automate some of the steps involved in supporting EGML applications, allowing data scientists to spend most of their time on improving their ML models. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, deeper integration with relational databases that often store confidential data, model optimizations and support for the ONNX model format and the ONNX Runtime for inference. We will also present our ongoing work on automatically tracking lineage between data and ML models which is crucial in regulated environments. We will showcase Flock’s features through a demo using Microsoft’s Azure Data Studio and MLflow.
Recent Developments In SparkR For Advanced AnalyticsDatabricks
Since its introduction in Spark 1.4, SparkR has received contributions from both the Spark community and the R community. In this talk, we will summarize recent community efforts on extending SparkR for scalable advanced analytics. We start with the computation of summary statistics on distributed datasets, including single-pass approximate algorithms. Then we demonstrate MLlib machine learning algorithms that have been ported to SparkR and compare them with existing solutions on R, e.g., generalized linear models, classification and clustering algorithms. We also show how to integrate existing R packages with SparkR to accelerate existing R workflows.
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
ROCm and Distributed Deep Learning on Spark and TensorFlowDatabricks
ROCm, the Radeon Open Ecosystem, is an open-source software foundation for GPU computing on Linux. ROCm supports TensorFlow and PyTorch using MIOpen, a library of highly optimized GPU routines for deep learning. In this talk, we describe how Apache Spark is a key enabling platform for distributed deep learning on ROCm, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end machine learning pipeline. We will analyse the different frameworks for integrating Spark with Tensorflow on ROCm, from Horovod to HopsML to Databrick's Project Hydrogen. We will also examine the surprising places where bottlenecks can surface when training models (everything from object stores to the Data Scientists themselves), and we will investigate ways to get around these bottlenecks. The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on Hopsworks with ROCm.
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit
While the performance delivered by Spark has enabled data scientists to undertake sophisticated analyses on big and complex data in actionable timeframes, too often, the process of manually configuring the underlying Spark jobs (including the number and size of the executors) can be a significant and time consuming undertaking. Not only it does this configuration process typically rely heavily on repeated trial-and-error, it necessitates that data scientists have a low-level understanding of Spark and detailed cluster sizing information. At Alpine Data we have been working to eliminate this requirement, and develop algorithms that can be used to automatically tune Spark jobs with minimal user involvement,
In this presentation, we discuss the algorithms we have developed and illustrate how they leverage information about the size of the data being analyzed, the analytical operations being used in the flow, the cluster size, configuration and real-time utilization, to automatically determine the optimal Spark job configuration for peak performance.
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
With components like Spark SQL, MLlib, and Streaming, Spark is a unified engine for building data applications. In this talk, we will take a look at how we use Spark on our own Databricks platform throughout our data pipeline for use cases such as ETL, data warehousing, and real time analysis. We will demonstrate how these applications empower engineering and data analytics. We will also share some lessons learned from building our data pipeline around security and operations. This talk will include examples on how to use Structured Streaming (a.k.a Streaming DataFrames) for online analysis, SparkR for offline analysis, and how we connect multiple sources to achieve a Just-In-Time Data Warehouse.
Build, Scale, and Deploy Deep Learning Pipelines with EaseDatabricks
Deep Learning has shown a tremendous success, yet it often requires a lot of effort to leverage its power. Existing Deep Learning frameworks require writing a lot of code to work with a model, let alone in a distributed manner.
This webinar is the first of a series in which we survey the state of Deep Learning at scale, and where we introduce the Deep Learning Pipelines, a new open-source package for Apache Spark. This package simplifies Deep Learning in three major ways:
1. It has a simple API that integrates well with enterprise Machine Learning pipelines.
2. It automatically scales out common Deep Learning patterns, thanks to Spark.
3. It enables exposing Deep Learning models through the familiar Spark APIs, such as MLlib and Spark SQL.
In this webinar, we will look at a complex problem of image classification, using Deep Learning and Spark. Using Deep Learning Pipelines, we will show:
* how to build deep learning models in a few lines of code;
* how to scale common tasks like transfer learning and prediction; and
* how to publish models in Spark SQL.
Willump: Optimizing Feature Computation in ML InferenceDatabricks
Systems for performing ML inference are increasingly important, but are far slower than they could be because they use techniques designed for conventional data serving workloads, neglecting the statistical nature of ML inference. As an alternative, this talk presents Willump, an optimizer for ML inference.
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
Sean Owen – Director of Data Science @Cloudera
Building machine learning models is all well and good, but how do they get productionized into a service? It's a long way from a Python script on a laptop, to a fault-tolerant system that learns continuously, serves thousands of queries per second, and scales to terabytes. The confederation of open source technologies we know as Hadoop now offers data scientists the raw materials from which to assemble an answer: the means to build models but also ingest data and serve queries, at scale.
This short talk will introduce Oryx 2, a blueprint for building this type of service on Hadoop technologies. It will survey the problem and the standard technologies and ideas that Oryx 2 combines: Apache Spark, Kafka, HDFS, the lambda architecture, PMML, REST APIs. The talk will touch on a key use case for this architecture -- recommendation engines.
Improving the Spark SQL usability and computing efficiency is one of the missions for Linkedin’s Spark team. In this talk, we will present the Spark SQL ecosystem and roadmaps at Linkedin, and introduce the highlighted projects we are working on, such as:
* Improving Dataset performance with automated column pruning
* Bringing an efficient 2d join algorithm to Spark SQL
* Fixing join skewness with adaptive execution
* Enhancing the cost-optimizer with a history-based learning approach
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
When making machine learning applications in Uber, we identified a sequence of common practices and painful procedures, and thus built a machine learning platform as a service. We here present the key components to build such a scalable and reliable machine learning service which serves both our online and offline data processing needs.
Low Latency Polyglot Model Scoring using Apache ApexApache Apex
Data science is fast becoming a complementary approach and process to solve business challenges today. The explosion of frameworks to help data scientists build models bears a testimony to this. However when a model needs to be turned into a production version in very low latency and enterprise grade environments, there are a very few choices with each one having their own strengths and weaknesses. Adding to this is the current disconnect between a data scientists world which is all about modelling and an engineers world which is about SLAs and service guarantees. A framework like Apache Apex can complement each of these roles and provide constructs for both these worlds. This would help enterprises to drastically cut down the cost of model deployment to production environments.
This talk was held at the 10th meeting on February 3rd 2014 by Sean Owen.
Having collected Big Data, organizations are now keen on data science and “Big Learning”. Much of the focus has been on data science as exploratory analytics: offline, in the lab. However, building from that a production-ready large-scale operational analytics system remains a difficult and ad-hoc endeavor, especially when real-time answers are required. Design patterns for effective implementations are emerging, which take advantage of relaxed assumptions, adopt a new tiered "lambda" architecture, and pick the right scale-friendly algorithms to succeed. Drawing on experience from customer problems and the open source Oryx project at Cloudera, this session will provide examples of operational analytics projects in the field, and present a reference architecture and algorithm design choices for a successful implementation.
Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra
The traditional data warehouse has served us well for many years, but new trends are causing it to break in four different ways: data growth, fast query expectations from users, non-relational/unstructured data, and cloud-born data. How can you prevent this from happening? Enter the modern data warehouse, which is able to handle and excel with these new trends. It handles all types of data (Hadoop), provides a way to easily interface with all these types of data (PolyBase), and can handle “big data” and provide fast queries. Is there one appliance that can support this modern data warehouse? Yes! It is the Analytics Platform System (APS) from Microsoft (formally called Parallel Data Warehouse or PDW) , which is a Massively Parallel Processing (MPP) appliance that has been recently updated (v2 AU1). In this session I will dig into the details of the modern data warehouse and APS. I will give an overview of the APS hardware and software architecture, identify what makes APS different, and demonstrate the increased performance. In addition I will discuss how Hadoop, HDInsight, and PolyBase fit into this new modern data warehouse.
Auto-Train a Time-Series Forecast Model With AML + ADBDatabricks
Supply Chain, Healthcare, Insurance, and Finance often require highly accurate forecasting models in an enterprise large-scale fashion. With Azure Machine Learning on Azure Databricks, the scale and speed to large-scale many-models can be achieved and time-to-product decreases drastically. The better-together story poses an enterprise approach to AI/ML.
Azure AutoML offers an elegant solution efficiently to build forecasting models on Azure Databricks compute solving sophisticated business problems. The presentation covers the Azure Machine Learning + Azure Databricks approach (see slides attached) while the demo covers a hands-on business problem building a forecasting model in Azure Databricks using Azure Machine Learning. The AI/ML better-together story is elevated as MLFlow for Data Science Lifecycle Management and Hyperopt for distributed model execution completes AI/ML enterprise readiness for industry problems.
DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala
Sesja pokazująca zarówno Machine Learning Server (czyli algorytmy uczenia maszynowego w językach R i Python), ale także możliwość korzystania z danych JSON w SQL Server, czy też łączenia się do danych znajdujących się na HDFS, HADOOP, czy Spark poprzez Polybase w SQL Server, by te dane wykorzystywać do analizy, predykcji poprzez modele w językach R lub Python.
Scaling AI in production using PyTorchgeetachauhan
Slides from my talk at MLOps World' 21
Deploying AI models in production and scaling the ML services is still a big challenge. In this talk we will cover details of how to deploy your AI models, best practices for the deployment scenarios, and techniques for performance optimization and scaling the ML services. Come join us to learn how you can jumpstart the journey of taking your PyTorch models from Research to production.
Emerging technologies /frameworks in Big DataRahul Jain
A short overview presentation on Emerging technologies /frameworks in Big Data covering Apache Parquet, Apache Flink, Apache Drill with basic concepts of Columnar Storage and Dremel.
Applied Machine learning using H2O, python and R WorkshopAvkash Chauhan
Note: Get all workshop content at - https://github.com/h2oai/h2o-meetups/tree/master/2017_02_22_Seattle_STC_Meetup
Basic knowledge of R/python and general ML concepts
Note: This is bring-your-own-laptop workshop. Make sure you bring your laptop in order to be able to participate in the workshop
Level: 200
Time: 2 Hours
Agenda:
- Introduction to ML, H2O and Sparkling Water
- Refresher of data manipulation in R & Python
- Supervised learning
---- Understanding liner regression model with an example
---- Understanding binomial classification with an example
---- Understanding multinomial classification with an example
- Unsupervised learning
---- Understanding k-means clustering with an example
- Using machine learning models in production
- Sparkling Water Introduction & Demo
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do you deploy these ML model to a production environment? How do you embed what you’ve learned into customer facing data applications?
In this talk I will discuss best practices on how data scientists productionize machine learning models, do a deep dive with actual case studies, and show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
Building Analytic Apps for SaaS: “Analytics as a Service”Amazon Web Services
TIBCO Jaspersoft® for AWS is a business intelligence suite that helps you deliver stunning interactive reports and dashboards inside your app that make it easy for your customers to get answers. Purpose-built for AWS, our reporting and analytics server quickly and easily connects to Amazon Relational Database Service (RDS), Amazon Redshift, and Amazon EMR. It includes ad-hoc reporting, dashboards, data analysis, data visualization, and data blending. In less than 10 minutes, you can be analyzing and reporting on your data. You get a full Cloud BI server starting at less than $1/hour, with no user or data limits and no additional fees.
This webinar deck shows how embeddable analytics with TIBCO Jaspersoft for AWS gives you the power to create the experience your end users demand and how to scale and manage that experience across your customer base with AWS.
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)Sascha Wenninger
Provides an overview of popular integration approaches, maps them to SAP's integration tools and concludes with some lessons learnt in their application.
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
Chris Fregly, Founder @ PipelineAI, will walk you through a real-world, complete end-to-end Pipeline-optimization example. We highlight hyper-parameters - and model pipeline phases - that have never been exposed until now.
While most Hyperparameter Optimizers stop at the training phase (ie. learning rate, tree depth, ec2 instance type, etc), we extend model validation and tuning into a new post-training optimization phase including 8-bit reduced precision weight quantization and neural network layer fusing - among many other framework and hardware-specific optimizations.
Next, we introduce hyperparameters at the prediction phase including request-batch sizing and chipset (CPU v. GPU v. TPU).
Lastly, we determine a PipelineAI Efficiency Score of our overall Pipeline including Cost, Accuracy, and Time. We show techniques to maximize this PipelineAI Efficiency Score using our massive PipelineDB along with the Pipeline-wide hyper-parameter tuning techniques mentioned in this talk.
Bio
Chris Fregly is Founder and Applied AI Engineer at PipelineAI, a Real-Time Machine Learning and Artificial Intelligence Startup based in San Francisco.
He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Training and Video Series titled, "High Performance TensorFlow in Production with Kubernetes and GPUs."
Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member and Principal Engineer at the IBM Spark Technology Center in San Francisco.
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
Similar to Scaling Machine Learning To Billions Of Parameters (20)
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
2017 continues to be an exciting year for Apache Spark. I will talk about new updates in two major areas in the Spark community this year: stream processing with Structured Streaming, and deep learning with high-level libraries such as Deep Learning Pipelines and TensorFlowOnSpark. In both areas, the community is making powerful new functionality available in the same high-level APIs used in the rest of the Spark ecosystem (e.g., DataFrames and ML Pipelines), and improving both the scalability and ease of use of stream processing and machine learning.
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark.
Snorkel is open source on github and available from Snorkel.Stanford.edu.
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
Spark Summit East Keynote by Ion Stoica
A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
4. 1.2 1.2 -78
6.3 -8.1
5.4 -8
4.2 2.3 -3.4
-1.1
2.3 4.9 7.4
4.5 2.1 -15
2.3 2.3
0.5 1.2 -0.9
-24
-1.3 -2.2 1.8
-4.9 -2.1 1.2
Web Scale ML
Billions of features
Hundredsofbillionsofexamples
Big Model
BigData
Ex: Yahoo word2vec - 120 billion parameters and 500 billion samples
5. 1.2 1.2 -78
6.3 -8.1
5.4 -8
4.2 2.3 -3.4
-1.1
2.3 4.9 7.4
4.5 2.1 -15
2.3 2.3
0.5 1.2 -0.9
-24
-1.3 -2.2 1.8
-4.9 -2.1 1.2
Web Scale ML
Billions of features
Hundredsofbillionsofexamples
Big Model
BigData
Ex: Yahoo word2vec - 120 billion parameters and 500 billion samples
Store Store Store
6. 1.2 1.2 -78
6.3 -8.1
5.4 -8
4.2 2.3 -3.4
-1.1
2.3 4.9 7.4
4.5 2.1 -15
2.3 2.3
0.5 1.2 -0.9
-24
-1.3 -2.2 1.8
-4.9 -2.1 1.2
Web Scale ML
Billions of features
Hundredsofbillionsofexamples
Big Model
BigData
Ex: Yahoo word2vec - 120 billion parameters and 500 billion samples
Worker
Worker
Worker
Store Store Store
7. 1.2 1.2 -78
6.3 -8.1
5.4 -8
4.2 2.3 -3.4
-1.1
2.3 4.9 7.4
4.5 2.1 -15
2.3 2.3
0.5 1.2 -0.9
-24
-1.3 -2.2 1.8
-4.9 -2.1 1.2
Web Scale ML
Billions of features
Hundredsofbillionsofexamples
Big Model
BigData
Ex: Yahoo word2vec - 120 billion parameters and 500 billion samples
Worker
Worker
Worker
Store Store Store
Each example depends only on a tiny fraction of the model
8. Two Optimization Strategies
Model
Multiple epochs…
BATCH
Example: Gradient Descent, L-BFGS
Model
Model
Model
SEQUENTIAL
Multiple random samples…
Example: (Minibatch) stochastic gradient method,
perceptron
Examples
9. Two Optimization Strategies
Model
Multiple epochs…
BATCH
Example: Gradient Descent, L-BFGS
Model
Model
Model
SEQUENTIAL
Multiple random samples…
Example: (Minibatch) stochastic gradient method,
perceptron
• Small number of model updates
• Accurate
• Each epoch may be expensive.
• Easy to parallelize.
Examples
10. Two Optimization Strategies
Model
Multiple epochs…
BATCH
Example: Gradient Descent, L-BFGS
Model
Model
Model
SEQUENTIAL
Multiple random samples…
Example: (Minibatch) stochastic gradient method,
perceptron
• Small number of model updates
• Accurate
• Each epoch may be expensive.
• Easy to parallelize.
• Requires lots of model updates.
• Not as accurate, but often good enough
• A lot of progress in one pass* for big data.
• Not trivial to parallelize.
*also optimal in terms of generalization error (often with a lot of tuning)
Examples
13. Requirements
✓ Support both batch and sequential optimization
✓ Sequential training: Handle frequent updates to the model
14. Requirements
✓ Support both batch and sequential optimization
✓ Sequential training: Handle frequent updates to the model
✓ Batch training: 100+ passes each pass must be fast.
15. Parameter Server (PS)
Client
Data
Client
Data
Client
Data
Client
Data
Training state stored in PS shards, asynchronous updates
PS Shard PS ShardPS Shard
ΔM
Model Update
M
Model
Early work: Yahoo LDA by Smola and Narayanamurthy based on memcached (2010),
Introduced in Google’s Distbelief (2012), refined in Petuum / Bösen (2013), Mu Li et al (2014)
17. ML in Spark alone
Executor Executor
CoreCore Core Core Core
Driver
Holds model
18. ML in Spark alone
Executor Executor
CoreCore Core Core Core
Driver
Holds model
MLlib optimization
19. ML in Spark alone
• Sequential:
– Driver-based communication limits frequency of model updates.
– Large minibatch size limits model update frequency, convergence suffers.
Executor Executor
CoreCore Core Core Core
Driver
Holds model
MLlib optimization
20. ML in Spark alone
• Sequential:
– Driver-based communication limits frequency of model updates.
– Large minibatch size limits model update frequency, convergence suffers.
• Batch:
– Driver bandwidth can be a bottleneck
– Synchronous stage wise processing limits throughput.
Executor Executor
CoreCore Core Core Core
Driver
Holds model
MLlib optimization
21. ML in Spark alone
• Sequential:
– Driver-based communication limits frequency of model updates.
– Large minibatch size limits model update frequency, convergence suffers.
• Batch:
– Driver bandwidth can be a bottleneck
– Synchronous stage wise processing limits throughput.
Executor Executor
CoreCore Core Core Core
Driver
Holds model
MLlib optimization
PS Architecture circumvents both limitations…
23. • Leverage Spark for HDFS I/O, distributed processing, fine-grained
load balancing, failure recovery, in-memory operations
Spark + Parameter Server
24. • Leverage Spark for HDFS I/O, distributed processing, fine-grained
load balancing, failure recovery, in-memory operations
• Use PS to sync models, incremental updates during training, or
sometimes even some vector math.
Spark + Parameter Server
25. • Leverage Spark for HDFS I/O, distributed processing, fine-grained
load balancing, failure recovery, in-memory operations
• Use PS to sync models, incremental updates during training, or
sometimes even some vector math.
Spark + Parameter Server
HDFS
Training state stored in PS shards
Driver
Executor ExecutorExecutor
CoreCore Core Core
PS Shard PS ShardPS Shard
control
control
33. Map PS API
• Distributed key-value store abstraction
• Supports batched operations in addition to usual get and put
• Many operations return a future – you can operate asynchronously or block
34. Matrix PS API
• Vector math (BLAS style operations), in addition to everything Map API provides
• Increment and fetch sparse vectors (e.g., for gradient aggregation)
• We use other custom operations on shard (API not shown)
42. L-BFGS Background
Exact, impractical
Step Size computation
- Needs to satisfy some technical (Wolfe) conditions
- Adaptively determined from data
Inverse Hessian Approximation
(based on history of L-previous gradients and model deltas)
Approximate, practical
Newton’s method
Gradient Descent
Using curvature information,
you can converge faster…
43. L-BFGS Background
Exact, impractical
Step Size computation
- Needs to satisfy some technical (Wolfe) conditions
- Adaptively determined from data
Inverse Hessian Approximation
(based on history of L-previous gradients and model deltas)
Approximate, practical
Newton’s method
Gradient Descent
Using curvature information,
you can converge faster…
44. L-BFGS Background
Exact, impractical
Step Size computation
- Needs to satisfy some technical (Wolfe) conditions
- Adaptively determined from data
Inverse Hessian Approximation
(based on history of L-previous gradients and model deltas)
Approximate, practical
Newton’s method
Gradient Descent
Using curvature information,
you can converge faster…
45. L-BFGS Background
Exact, impractical
Step Size computation
- Needs to satisfy some technical (Wolfe) conditions
- Adaptively determined from data
Inverse Hessian Approximation
(based on history of L-previous gradients and model deltas)
Approximate, practical
Newton’s method
Gradient Descent
Using curvature information,
you can converge faster…
dotprod
axpy (y ← ax + y)
copy
axpy
scal
scal
Vector Math
dotprod
46. L-BFGS Background
Exact, impractical
Step Size computation
- Needs to satisfy some technical (Wolfe) conditions
- Adaptively determined from data
Inverse Hessian Approximation
(based on history of L-previous gradients and model deltas)
Approximate, practical
Newton’s method
Gradient Descent
Using curvature information,
you can converge faster…
47. Executor ExecutorExecutor
HDFS HDFSHDFS
Driver
PS PS PS PS
Distributed LBFGS*
Compute gradient and loss
1. Incremental sparse gradient update
2. Fetch sparse portions of model
Coordinates executor
Step 1: Compute and update Gradient
*Our design is very similar to Sandblaster L-BFGS, Jeff Dean et al, Large Scale Distributed Deep Networks (2012)
state vectors
48. Executor ExecutorExecutor
HDFS HDFSHDFS
Driver
PS PS PS PS
Distributed LBFGS*
Compute gradient and loss
1. Incremental sparse gradient update
2. Fetch sparse portions of model
Coordinates executor
Step 1: Compute and update Gradient
*Our design is very similar to Sandblaster L-BFGS, Jeff Dean et al, Large Scale Distributed Deep Networks (2012)
state vectors
54. Speedup tricks
• Intersperse communication and computation
• Quicker convergence
– Parallel line search for step size
– Curvature for initial Hessian approximation*
*borrowed from vowpal wabbit
55. Speedup tricks
• Intersperse communication and computation
• Quicker convergence
– Parallel line search for step size
– Curvature for initial Hessian approximation*
• Network bandwidth reduction
– Compressed integer arrays
– Only store indices for binary data
*borrowed from vowpal wabbit
56. Speedup tricks
• Intersperse communication and computation
• Quicker convergence
– Parallel line search for step size
– Curvature for initial Hessian approximation*
• Network bandwidth reduction
– Compressed integer arrays
– Only store indices for binary data
• Matrix math on minibatch
*borrowed from vowpal wabbit
57. Speedup tricks
• Intersperse communication and computation
• Quicker convergence
– Parallel line search for step size
– Curvature for initial Hessian approximation*
• Network bandwidth reduction
– Compressed integer arrays
– Only store indices for binary data
• Matrix math on minibatch
0
750
1500
2250
3000
10
20
100
221612
2880
1260
96
MLlib
PS + Spark
1.6 x 108 examples, 100 executors, 10 cores
time(inseconds)perepoch
feature size (millions)
*borrowed from vowpal wabbit
67. Word2vec
• Skipgram with negative sampling:
– training set includes pairs of words and neighbors in corpus,
along with randomly selected words for each neighbor
68. Word2vec
• Skipgram with negative sampling:
– training set includes pairs of words and neighbors in corpus,
along with randomly selected words for each neighbor
– determine w → u(w),v(w) so that sigmoid(u(w)•v(w’)) is close
to (minimizes log loss) the probability that w’ is a neighbor of
w as opposed to a randomly selected word.
69. Word2vec
• Skipgram with negative sampling:
– training set includes pairs of words and neighbors in corpus,
along with randomly selected words for each neighbor
– determine w → u(w),v(w) so that sigmoid(u(w)•v(w’)) is close
to (minimizes log loss) the probability that w’ is a neighbor of
w as opposed to a randomly selected word.
– SGD involves computing many vector dot products e.g.,
u(w)•v(w’) and vector linear combinations
e.g., u(w) += α v(w’).
70. Word2vec Application at Yahoo
• Example training data:
gas_cap_replacement_for_car
slc_679f037df54f5d9c41cab05bfae0926
gas_door_replacement_for_car
slc_466145af16a40717c84683db3f899d0a fuel_door_covers
adid_c_28540527225_285898621262
slc_348709d73214fdeb9782f8b71aff7b6e autozone_auto_parts
adid_b_3318310706_280452370893 auoto_zone
slc_8dcdab5d20a2caa02b8b1d1c8ccbd36b
slc_58f979b6deb6f40c640f7ca8a177af2d
[ Grbovic, et. al. SIGIR 2015 and SIGIR 2016 (to appear) ]
73. Distributed Word2vec
• Needed system to train 200 million 300
dimensional word2vec model using minibatch
SGD
• Achieved in a high throughput and network
efficient way using our matrix based PS server:
74. Distributed Word2vec
• Needed system to train 200 million 300
dimensional word2vec model using minibatch
SGD
• Achieved in a high throughput and network
efficient way using our matrix based PS server:
– Vectors don’t go over network.
75. Distributed Word2vec
• Needed system to train 200 million 300
dimensional word2vec model using minibatch
SGD
• Achieved in a high throughput and network
efficient way using our matrix based PS server:
– Vectors don’t go over network.
– Most compute on PS servers, with clients aggregating
partial results from shards.
83. Distributed Word2vec
• Network lower by factor of #shards/dimension
compared to conventional PS based system
(1/20 to 1/100 for useful scenarios).
84. Distributed Word2vec
• Network lower by factor of #shards/dimension
compared to conventional PS based system
(1/20 to 1/100 for useful scenarios).
• Trains 200 million vocab, 55 billion word search
session in 2.5 days.
85. Distributed Word2vec
• Network lower by factor of #shards/dimension
compared to conventional PS based system
(1/20 to 1/100 for useful scenarios).
• Trains 200 million vocab, 55 billion word search
session in 2.5 days.
• In production for regular training in Yahoo search
ad serving system.
86. Other Projects using Spark + PS
• Online learning on PS
– Personalization as a Service
– Sponsored Search
• Factorization Machines
– Large scale user profiling
94. Summary
• Parameter server indispensable for big models
• Spark + Parameter Server has proved to be very
flexible platform for our large scale computing
needs
• Direct computation on the parameter servers
accelerate training for our use-cases