This document discusses scaling out logistic regression with Apache Spark. It describes the need to classify a large number of websites using machine learning. Several approaches to logistic regression were tried, including a single machine Java implementation and moving to Spark for better scalability. Spark's L-BFGS algorithm was chosen for its out of the box distributed logistic regression solution. Challenges implementing logistic regression at large scale are discussed, such as overfitting and regularization. Methods used to address these challenges include L2 regularization, cross-validation to select the regularization parameter, and extensions made to Spark's LBFGS implementation.
2014-06-20 Multinomial Logistic Regression with Apache SparkDB Tsai
Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer.
Bio:
DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.
Multinomial Logistic Regression with Apache SparkDB Tsai
Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer.
Bio:
DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
Apache Spark is a new cluster computing engine offering a number of advantages over its predecessor MapReduce. In-memory cache is utilized in Apache Spark to scale and parallelize iterative algorithms which makes it ideal for large-scale machine learning. It is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. In this talk, DB will introduce Spark and show how to use Spark’s high-level API in Java, Scala or Python. Then, he will show how to use MLlib, a library of machine learning algorithms for big data included in Spark to do classification, regression, clustering, and recommendation in large scale.
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...DB Tsai
Nonlinear methods are widely used to produce higher performance compared with linear methods; however, nonlinear methods are generally more expensive in model size, training time, and scoring phase. With proper feature engineering techniques like polynomial expansion, the linear methods can be as competitive as those nonlinear methods. In the process of mapping the data to higher dimensional space, the linear methods will be subject to overfitting and instability of coefficients which can be addressed by penalization methods including Lasso and Elastic-Net. Finally, we'll show how to train linear models with Elastic-Net regularization using MLlib.
Several learning algorithms such as kernel methods, decision tress, and random forests are nonlinear approaches which are widely used to have better performance compared with linear methods. However, with feature engineering techniques like polynomial expansion by mapping the data into a higher dimensional space, the performance of linear methods can be as competitive as those nonlinear methods. As a result, linear methods remain to be very useful given that the training time of linear methods is significantly faster than the nonlinear ones, and the model is just a simple small vector which makes the prediction step very efficient and easy. However, by mapping the data into higher dimensional space, those linear methods are subject to overfitting and instability of coefficients, and those issues can be successfully addressed by penalization methods including Lasso and Elastic-Net. Lasso method with L1 penalty tends to result in many coefficients shrunk exactly to zero and a few other coefficients with comparatively little shrinkage. L2 penalty trends to result in all small but non-zero coefficients. Combining L1 and L2 penalties are called Elastic-Net method which tends to give a result in between. In the first part of the talk, we'll give an overview of linear methods including commonly used formulations and optimization techniques such as L-BFGS and OWLQN. In the second part of talk, we will talk about how to train linear models with Elastic-Net using our recent contribution to Spark MLlib. We'll also talk about how linear models are practically applied with big dataset, and how polynomial expansion can be used to dramatically increase the performance.
DB Tsai is an Apache Spark committer and a Senior Research Engineer at Netflix. He is recently working with Apache Spark community to add several new algorithms including Linear Regression and Binary Logistic Regression with ElasticNet (L1/L2) regularization, Multinomial Logistic Regression, and LBFGS optimizer. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where he developed innovative large-scale distributed linear algorithms, and then contributed back to open source Apache Spark project.
Stochastic gradient descent and its tuningArsalan Qadri
This paper talks about optimization algorithms used for big data applications. We start with explaining the gradient descent algorithms and its limitations. Later we delve into the stochastic gradient descent algorithms and explore methods to improve it it by adjusting learning rates.
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
Teaching K-Means New Tricks: Over 50 years old, the k-means algorithm remains one of the most popular clustering algorithms. In this talk we’ll cover some recent developments, including better initialization, the notion of coresets, clustering at scale, and clustering with outliers.
2014-06-20 Multinomial Logistic Regression with Apache SparkDB Tsai
Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer.
Bio:
DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.
Multinomial Logistic Regression with Apache SparkDB Tsai
Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer.
Bio:
DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
Apache Spark is a new cluster computing engine offering a number of advantages over its predecessor MapReduce. In-memory cache is utilized in Apache Spark to scale and parallelize iterative algorithms which makes it ideal for large-scale machine learning. It is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. In this talk, DB will introduce Spark and show how to use Spark’s high-level API in Java, Scala or Python. Then, he will show how to use MLlib, a library of machine learning algorithms for big data included in Spark to do classification, regression, clustering, and recommendation in large scale.
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...DB Tsai
Nonlinear methods are widely used to produce higher performance compared with linear methods; however, nonlinear methods are generally more expensive in model size, training time, and scoring phase. With proper feature engineering techniques like polynomial expansion, the linear methods can be as competitive as those nonlinear methods. In the process of mapping the data to higher dimensional space, the linear methods will be subject to overfitting and instability of coefficients which can be addressed by penalization methods including Lasso and Elastic-Net. Finally, we'll show how to train linear models with Elastic-Net regularization using MLlib.
Several learning algorithms such as kernel methods, decision tress, and random forests are nonlinear approaches which are widely used to have better performance compared with linear methods. However, with feature engineering techniques like polynomial expansion by mapping the data into a higher dimensional space, the performance of linear methods can be as competitive as those nonlinear methods. As a result, linear methods remain to be very useful given that the training time of linear methods is significantly faster than the nonlinear ones, and the model is just a simple small vector which makes the prediction step very efficient and easy. However, by mapping the data into higher dimensional space, those linear methods are subject to overfitting and instability of coefficients, and those issues can be successfully addressed by penalization methods including Lasso and Elastic-Net. Lasso method with L1 penalty tends to result in many coefficients shrunk exactly to zero and a few other coefficients with comparatively little shrinkage. L2 penalty trends to result in all small but non-zero coefficients. Combining L1 and L2 penalties are called Elastic-Net method which tends to give a result in between. In the first part of the talk, we'll give an overview of linear methods including commonly used formulations and optimization techniques such as L-BFGS and OWLQN. In the second part of talk, we will talk about how to train linear models with Elastic-Net using our recent contribution to Spark MLlib. We'll also talk about how linear models are practically applied with big dataset, and how polynomial expansion can be used to dramatically increase the performance.
DB Tsai is an Apache Spark committer and a Senior Research Engineer at Netflix. He is recently working with Apache Spark community to add several new algorithms including Linear Regression and Binary Logistic Regression with ElasticNet (L1/L2) regularization, Multinomial Logistic Regression, and LBFGS optimizer. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where he developed innovative large-scale distributed linear algorithms, and then contributed back to open source Apache Spark project.
Stochastic gradient descent and its tuningArsalan Qadri
This paper talks about optimization algorithms used for big data applications. We start with explaining the gradient descent algorithms and its limitations. Later we delve into the stochastic gradient descent algorithms and explore methods to improve it it by adjusting learning rates.
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
Teaching K-Means New Tricks: Over 50 years old, the k-means algorithm remains one of the most popular clustering algorithms. In this talk we’ll cover some recent developments, including better initialization, the notion of coresets, clustering at scale, and clustering with outliers.
Gradient descent optimization with simple examples. covers sgd, mini-batch, momentum, adagrad, rmsprop and adam.
Made for people with little knowledge of neural network.
Slides to support Austin Machine Learning Meetup, 1/19/2015.
Overview of techniques of recent Kaggle code to perform online logistic regression with FTRL-proximal (SGD, L1/L2 regularization) and hash trick.
Parallel External Memory Algorithms Applied to Generalized Linear ModelsRevolution Analytics
Presentation by Lee Edlefsen, Revolution Analytics to JSM 2012, San Diego CA, July 30 2012
For the past several decades the rising tide of technology has allowed the same data analysis code to handle the increase in sizes of typical data sets. That era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of RAM, and of hard drives. To deal with this, statistical software must be able to use multiple cores and computers. Parallel external memory algorithms (PEMA's) provide the foundation for such software. External memory algorithms (EMA's) are those that do not require all data to be in RAM, and are widely available. Parallel implementations of EMA's allow them to run on multiple cores and computers, and to process unlimited rows of data. This paper describes a general approach to efficiently parallelizing EMA's, using an R and C++ implementation of GLM as a detailed example. It examines the requirements for efficient PEMA's; the arrangement of code for automatic parallelization; efficient threading; and efficient inter-process communication. It includes billion row benchmarks showing linear scaling with rows and nodes, and demonstrating that extremely high performance is achievable.
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches: Multi-algorithm ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. The Super Learner algorithm, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. Although ensemble methods offer superior performance over their singleton counterparts, there is an implicit computational cost to ensembles, as it requires training and cross-validating multiple base learning algorithms.
We will demonstrate a variety of software- and hardware-based approaches that lead to more scalable ensemble learning software, including a highly scalable implementation of stacking called “H2O Ensemble”, built on top of the open source, distributed machine learning platform, H2O. H2O Ensemble scales across multi-node clusters and allows the user to create ensembles of deep neural networks, Gradient Boosting Machines, Random Forest, and others. As for algorithm-based approaches, we will present two algorithmic modifications to the original stacking algorithm that further reduce computation time — Subsemble algorithm and the Online Super Learner algorithm. This talk will also include benchmarks of the implementations of these new stacking variants.
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
Fast, Cheap and Deep – Scaling Machine Learning: Distributed high throughput machine learning is both a challenge and a key enabling technology. Using a Parameter Server template we are able to distribute algorithms efficiently over multiple GPUs and in the cloud. This allows us to design very fast recommender systems, factorization machines, classifiers, and deep networks. This degree of scalability allows us to tackle computationally expensive problems efficiently, yielding excellent results e.g. in visual question answering.
Native ads (ads that match the look and feel of the embedding page) have become a multi-billion dollar business in recent years. Gemini native is Yahoo’s native advertisement platform and this talk will overview some of the science behind its ad ranking.
The accurate prediction of an ad’s click-through rate (CTR) for a given impression is a key component of any such ad ranking system as it allows one to rank the ads according to their expected revenue. I will give a short overview of different CTR prediction models and deep dive into the major components of large-scale logistic regression models; a special focus will be given to implementing such a logistic regression model in Apache Spark.
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Chris Fregly
Advanced Spark and TensorFlow Meetup 08-04-2016
Fundamental Algorithms of Neural Networks including Gradient Descent, Back Propagation, Auto Differentiation, Partial Derivatives, Chain Rule
Ge aviation spark application experience porting analytics into py spark ml p...Databricks
GE is a world leader in the manufacture of commercial jet engines, offering products for many of the best-selling commercial airframes. With more than 33,000 engines in service, GE Aviation has a history of developing analytics for monitoring its commercial engines fleets. In recent years, GE Aviation Digital has developed advanced analytic solutions for engine monitoring, with the target of improving detection and reducing false alerts, when compared to conventional analytic approaches. The advanced analytics are implemented in a real-time monitoring system which notifies GE’s Fleet Support team on a per flight basis. These analytics are developed and validated using large, historical datasets.
Analytic tools such as SQL Server and MATLAB were used until recently, when GE’s data was moved to an Apache Spark environment. Consequently, our advanced analytics are now being migrated to Spark, where there should also be performance gains with bigger data sets. In this talk we will share experiences of converting our advanced algorithms to custom Spark ML pipelines, as well as outlining various case studies.
With Honor Powrie and Peter Knight
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
45 min talk about collecting home network performance measures, analyzing and forecasting time series data, and building anomaly detection system.
In this talk, we will go through the whole process of data mining and knowledge discovery. Firstly we write a script to run speed test periodically and log the metric. Then we parse the log data and convert them into a time series and visualize the data for a certain period.
Next we conduct some data analysis; finding trends, forecasting, and detecting anomalous data. There will be several statistic or deep learning techniques used for the analysis; ARIMA (Autoregressive Integrated Moving Average), LSTM (Long Short Term Memory).
Title: "Understanding PyTorch: PyTorch in Image Processing". Github: https://github.com/azarnyx/PyData_Meetup. The Dataset: https://goo.gl/CWmLWD.
The talk was given in PyData Meetup which took place in Munich on 06.03.2019 in Data Reply office. The talk was given by Dmitrii Azarnykh, data scientist in Data Reply.
Chap 8. Optimization for training deep modelsYoung-Geun Choi
연구실 내부 세미나 자료. Goodfellow et al. (2016), Deep Learning, MIT Press의 Chapter 8을 요약/발췌하였습니다. 깊은 신경망(deep neural network) 모형 훈련시 목적함수 최적화 방법으로 흔히 사용되는 방법들을 소개합니다.
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
Online learning, Vowpal Wabbit and Hadoop
Online learning has recently caught a lot of attention, following some competitions, and especially after Criteo released 11GB for the training set of a Kaggle contest.
Online learning allows to process massive data as the learner processes data in a sequential way using up a low amount of memory and limited CPU ressources. It is also particularly suited for handling time-evolving date.
Vowpal Wabbit has become quite popular: it is a handy, light and efficient command line tool allowing to do online learning on GB of data, even on a standard laptop with standard memory. After a reminder of the online learning principles, we present how to run Vowpal Wabbit on Hadoop in a distributed fashion.
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016MLconf
Using Bayesian Optimization to Tune Machine Learning Models: In this talk we briefly introduce Bayesian Global Optimization as an efficient way to optimize machine learning model parameters, especially when evaluating different parameters is time-consuming or expensive. We will motivate the problem and give example applications.
We will also talk about our development of a robust benchmark suite for our algorithms including test selection, metric design, infrastructure architecture, visualization, and comparison to other standard and open source methods. We will discuss how this evaluation framework empowers our research engineers to confidently and quickly make changes to our core optimization engine.
We will end with an in-depth example of using these methods to tune the features and hyperparameters of a real world problem and give several real world applications.
Gradient descent optimization with simple examples. covers sgd, mini-batch, momentum, adagrad, rmsprop and adam.
Made for people with little knowledge of neural network.
Slides to support Austin Machine Learning Meetup, 1/19/2015.
Overview of techniques of recent Kaggle code to perform online logistic regression with FTRL-proximal (SGD, L1/L2 regularization) and hash trick.
Parallel External Memory Algorithms Applied to Generalized Linear ModelsRevolution Analytics
Presentation by Lee Edlefsen, Revolution Analytics to JSM 2012, San Diego CA, July 30 2012
For the past several decades the rising tide of technology has allowed the same data analysis code to handle the increase in sizes of typical data sets. That era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of RAM, and of hard drives. To deal with this, statistical software must be able to use multiple cores and computers. Parallel external memory algorithms (PEMA's) provide the foundation for such software. External memory algorithms (EMA's) are those that do not require all data to be in RAM, and are widely available. Parallel implementations of EMA's allow them to run on multiple cores and computers, and to process unlimited rows of data. This paper describes a general approach to efficiently parallelizing EMA's, using an R and C++ implementation of GLM as a detailed example. It examines the requirements for efficient PEMA's; the arrangement of code for automatic parallelization; efficient threading; and efficient inter-process communication. It includes billion row benchmarks showing linear scaling with rows and nodes, and demonstrating that extremely high performance is achievable.
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches: Multi-algorithm ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. The Super Learner algorithm, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. Although ensemble methods offer superior performance over their singleton counterparts, there is an implicit computational cost to ensembles, as it requires training and cross-validating multiple base learning algorithms.
We will demonstrate a variety of software- and hardware-based approaches that lead to more scalable ensemble learning software, including a highly scalable implementation of stacking called “H2O Ensemble”, built on top of the open source, distributed machine learning platform, H2O. H2O Ensemble scales across multi-node clusters and allows the user to create ensembles of deep neural networks, Gradient Boosting Machines, Random Forest, and others. As for algorithm-based approaches, we will present two algorithmic modifications to the original stacking algorithm that further reduce computation time — Subsemble algorithm and the Online Super Learner algorithm. This talk will also include benchmarks of the implementations of these new stacking variants.
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
Fast, Cheap and Deep – Scaling Machine Learning: Distributed high throughput machine learning is both a challenge and a key enabling technology. Using a Parameter Server template we are able to distribute algorithms efficiently over multiple GPUs and in the cloud. This allows us to design very fast recommender systems, factorization machines, classifiers, and deep networks. This degree of scalability allows us to tackle computationally expensive problems efficiently, yielding excellent results e.g. in visual question answering.
Native ads (ads that match the look and feel of the embedding page) have become a multi-billion dollar business in recent years. Gemini native is Yahoo’s native advertisement platform and this talk will overview some of the science behind its ad ranking.
The accurate prediction of an ad’s click-through rate (CTR) for a given impression is a key component of any such ad ranking system as it allows one to rank the ads according to their expected revenue. I will give a short overview of different CTR prediction models and deep dive into the major components of large-scale logistic regression models; a special focus will be given to implementing such a logistic regression model in Apache Spark.
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Chris Fregly
Advanced Spark and TensorFlow Meetup 08-04-2016
Fundamental Algorithms of Neural Networks including Gradient Descent, Back Propagation, Auto Differentiation, Partial Derivatives, Chain Rule
Ge aviation spark application experience porting analytics into py spark ml p...Databricks
GE is a world leader in the manufacture of commercial jet engines, offering products for many of the best-selling commercial airframes. With more than 33,000 engines in service, GE Aviation has a history of developing analytics for monitoring its commercial engines fleets. In recent years, GE Aviation Digital has developed advanced analytic solutions for engine monitoring, with the target of improving detection and reducing false alerts, when compared to conventional analytic approaches. The advanced analytics are implemented in a real-time monitoring system which notifies GE’s Fleet Support team on a per flight basis. These analytics are developed and validated using large, historical datasets.
Analytic tools such as SQL Server and MATLAB were used until recently, when GE’s data was moved to an Apache Spark environment. Consequently, our advanced analytics are now being migrated to Spark, where there should also be performance gains with bigger data sets. In this talk we will share experiences of converting our advanced algorithms to custom Spark ML pipelines, as well as outlining various case studies.
With Honor Powrie and Peter Knight
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
45 min talk about collecting home network performance measures, analyzing and forecasting time series data, and building anomaly detection system.
In this talk, we will go through the whole process of data mining and knowledge discovery. Firstly we write a script to run speed test periodically and log the metric. Then we parse the log data and convert them into a time series and visualize the data for a certain period.
Next we conduct some data analysis; finding trends, forecasting, and detecting anomalous data. There will be several statistic or deep learning techniques used for the analysis; ARIMA (Autoregressive Integrated Moving Average), LSTM (Long Short Term Memory).
Title: "Understanding PyTorch: PyTorch in Image Processing". Github: https://github.com/azarnyx/PyData_Meetup. The Dataset: https://goo.gl/CWmLWD.
The talk was given in PyData Meetup which took place in Munich on 06.03.2019 in Data Reply office. The talk was given by Dmitrii Azarnykh, data scientist in Data Reply.
Chap 8. Optimization for training deep modelsYoung-Geun Choi
연구실 내부 세미나 자료. Goodfellow et al. (2016), Deep Learning, MIT Press의 Chapter 8을 요약/발췌하였습니다. 깊은 신경망(deep neural network) 모형 훈련시 목적함수 최적화 방법으로 흔히 사용되는 방법들을 소개합니다.
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
Online learning, Vowpal Wabbit and Hadoop
Online learning has recently caught a lot of attention, following some competitions, and especially after Criteo released 11GB for the training set of a Kaggle contest.
Online learning allows to process massive data as the learner processes data in a sequential way using up a low amount of memory and limited CPU ressources. It is also particularly suited for handling time-evolving date.
Vowpal Wabbit has become quite popular: it is a handy, light and efficient command line tool allowing to do online learning on GB of data, even on a standard laptop with standard memory. After a reminder of the online learning principles, we present how to run Vowpal Wabbit on Hadoop in a distributed fashion.
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016MLconf
Using Bayesian Optimization to Tune Machine Learning Models: In this talk we briefly introduce Bayesian Global Optimization as an efficient way to optimize machine learning model parameters, especially when evaluating different parameters is time-consuming or expensive. We will motivate the problem and give example applications.
We will also talk about our development of a robust benchmark suite for our algorithms including test selection, metric design, infrastructure architecture, visualization, and comparison to other standard and open source methods. We will discuss how this evaluation framework empowers our research engineers to confidently and quickly make changes to our core optimization engine.
We will end with an in-depth example of using these methods to tune the features and hyperparameters of a real world problem and give several real world applications.
Implementation of linear regression and logistic regression on SparkDalei Li
This presentation was developed for a course project at Technical University of Madrid. The course is massively parallel machine learning supervised by Alberto Mozo and Bruno Ordozgoiti.
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit
This talk tells the story of implementation and optimization of a sparse logistic regression algorithm in spark. I would like to share the lessons I learned and the steps I had to take to improve the speed of execution and convergence of my initial naive implementation. The message isn’t to convince the audience that logistic regression is great and my implementation is awesome, rather it will give details about how it works under the hood, and general tips for implementing an iterative parallel machine learning algorithm in spark. The talk is structured as a sequence of “lessons learned” that are shown in form of code examples building on the initial naive implementation. The performance impact of each “lesson” on execution time and speed of convergence is measured on benchmark datasets. You will see how to formulate logistic regression in a parallel setting, how to avoid data shuffles, when to use a custom partitioner, how to use the ‘aggregate’ and ‘treeAggregate’ functions, how momentum can accelerate the convergence of gradient descent, and much more. I will assume basic understanding of machine learning and some prior knowledge of spark. The code examples are written in scala, and the code will be made available for each step in the walkthrough.
Part 2 of the Deep Learning Fundamentals Series, this session discusses Tuning Training (including hyperparameters, overfitting/underfitting), Training Algorithms (including different learning rates, backpropagation), Optimization (including stochastic gradient descent, momentum, Nesterov Accelerated Gradient, RMSprop, Adaptive algorithms - Adam, Adadelta, etc.), and a primer on Convolutional Neural Networks. The demos included in these slides are running on Keras with TensorFlow backend on Databricks.
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdfKundjanasith Thonglek
Real-time processing is a fast and prompt processing technology that needs to complete the execution within a limited time constraint almost equal to the input time. Executing such real-time processing needs an efficient auto-scaling system which provides sufficient resources to compute the process within the time constraint. We use Apache Spark framework to build a cluster which supports real-time processing. The major challenge of scaling Apache Spark cluster automatically for the real-time processing is how to handle the unpredictable input data size and also the unpredictable resource availability of the underlying cloud infrastructure. If the scaling-out of the cluster is too slow then the application can not be executed within the time constraint as a result of insufficient resources. If the scaling-in of the cluster is slow, the resources are wasted without being utilized, and it leads less resource utilization. This research follows the real-world scenario where the computing resources are bounded by a certain number of computing nodes due to limited budget as well as the computing time is limited due to the nature of near real-time application. We design an auto-scaling system that applies a deep reinforcement learning technique, DQN (Deep Q-Network), to improve resource utilization efficiently. Our model-based DQN allows to automatically optimize the scaling of the cluster, because the DQN can autonomously learn the given environment features so that it can take suitable actions to get the maximum reward under the limited execution time and worker nodes.
Queuing Theory and the Theory of Constraints are two powerful theories that can increase your velocity. This session explains both theories in simple terms then covers how they can be applied in the real world by agile teams. 21 simple velocity increasing experiments are described that you can immediately use.
The effects of Queuing Theory impact our lives on a daily basis. Scrum uses Queuing Theory at its core and you can amplify those effects.
The Theory of Constraints can identify the one constraint that is preventing your team from increasing its velocity. It also shows us how to remove that constraint in the cheapest way possible.
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
Apache Spark 2.0 offers many enhancements that make continuous analytics quite simple. In this talk, we will discuss many other things that you can do with your Apache Spark cluster. We explain how a deep integration of Apache Spark 2.0 and in-memory databases can bring you the best of both worlds! In particular, we discuss how to manage mutable data in Apache Spark, run consistent transactions at the same speed as state-the-art in-memory grids, build and use indexes for point lookups, and run 100x more analytics queries at in-memory speeds. No need to bridge multiple products or manage, tune multiple clusters. We explain how one can take regulation Apache Spark SQL OLAP workloads and speed them up by up to 20x using optimizations in SnappyData.
We then walk through several use-case examples, including IoT scenarios, where one has to ingest streams from many sources, cleanse it, manage the deluge by pre-aggregating and tracking metrics per minute, store all recent data in a in-memory store along with history in a data lake and permit interactive analytic queries at this constantly growing data. Rather than stitching together multiple clusters as proposed in Lambda, we walk through a design where everything is achieved in a single, horizontally scalable Apache Spark 2.0 cluster. A design that is simpler, a lot more efficient, and let’s you do everything from Machine Learning and Data Science to Transactions and Visual Analytics all in one single cluster.
Why does big data always have to go through a pipeline? multiple data copies, slow, complex and stale analytics? We present a unified analytics platform that brings streaming, transactions and adhoc OLAP style interactive analytics in a single in-memory cluster based on Spark.
Similar to Scaling out logistic regression with Spark (20)
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
In the ever-evolving landscape of technology, enterprise software development is undergoing a significant transformation. Traditional coding methods are being challenged by innovative no-code solutions, which promise to streamline and democratize the software development process.
This shift is particularly impactful for enterprises, which require robust, scalable, and efficient software to manage their operations. In this article, we will explore the various facets of enterprise software development with no-code solutions, examining their benefits, challenges, and the future potential they hold.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
3. General Background about the company
› The company was founded 8 years ago
› 300 ~ employees world wide
› 240 employees in Israel
› Stay updated about our open positions in our website.
You can contact jobs@similarweb.com
› Nir Cohen – nir@similarweb.com
5. Data Size
› 650 servers total
› Several Hadoop Clusters – 120 Servers in the biggest.
› 5 Hbase clusters
› Couchbase clusters
› Kafka clusters
› MYSQL Galera clusters
› 5TB of new data every day
› Full data backup to s3
6. Plan for the next hour or so
› The need
› Some history
› Spark related algorithmic intuitions
› Dive into spark
› Our Additions
› Runtime issues
› Current Categorization Algorithm
11. Need: How would you classify the Web?
› Crawl the web
› Collect data about each website
› Manually classify a few
› Use machine learning to derive model
› Classify all the websites we’ve seen
14. LEARNING SET:
FEATURES
› Tag Count Source
– cnn.com | news | 1
– bbc.com | culture | 50
– …
› Html Analyzer Source
– cnn.com | money | 14
– nba.com | nba draft | 2
– …
11 basic sources
Feature is:
site | tag | score
Some reintroduced after
additional processing
Eventually – 16 sources
18 GB of data
4M Unique features
15. Our challenge
› Large Scale Logistic Regression
– ~500K site samples
– 4M Unique features
– ~800K features/source
– 246 classes
– Eventually apply model to 400M sites
16. FIRST LOGISTIC
REGRESSION
ATTEMPT
› Only scales up
› Pre-combination of features
reduces coverage
› Runtime: a few days
› Code is complex, and hard to tweak
algorithm
› Bus test
Single machine Java
logistic regression
implementation
highly optimized
Manually tuned loss
function
multi threaded
Uses plain arrays and
divides "stripes"
between threads
Works on “summed
features”
18. Why we choose spark
› Has out of the box distributed solution for large scale
multinomial logistic regression
› Simplicity
› Lower production maintenance costs compared to R
› Intent to move to Spark for large complex algorithmics
20. Basic Regression Method
› We want to estimate value of y based on samples (x, y)
𝑦 = 𝑓 𝑥, 𝛽 ; 𝛽 – unknown function constants
› Define loss function 𝑙 𝛽 that corresponds with accuracy
– for example : 𝑙 𝛽 ≡ 𝑖
𝑠𝑎𝑚𝑝𝑙𝑒𝑠
𝑓 𝑥 𝑖,𝛽 −𝑦 𝑖
2
#𝑠𝑎𝑚𝑝𝑙𝑒𝑠
› Find 𝛽 that minimize 𝑙 𝛽
21. Logistic Regression
› In case of classification we want to use logistic function
𝑦 = 𝑓 𝑥, 𝛽 = 𝑃(𝑦|𝑥; 𝛽) =
𝑒 𝛽𝑥
1 + 𝑒 𝛽𝑥
› Define differentiable loss function (log-likelihood)
𝑙 𝑥, 𝛽 = 𝑖
𝑠𝑎𝑚𝑝𝑙𝑒𝑠
𝑙𝑜𝑔𝑃(𝑦𝑖|𝑥𝑖; 𝛽)
› We cannot find 𝛽 analytically
› However, 𝑙 𝑥, 𝛽 is smooth,
continuous and convex!
– Has one global minimum
22. GRADIENT DESCENT
Generally
• Value of −𝛻𝑙(𝛽) is a vector
that points in direction of
steepest descent
• In every step
𝛽 𝑘+1 = 𝛽 𝑘 − 𝛼𝛻𝑙(𝛽 𝑘)
• 𝛼 – learning rate
• Converges when
𝛻𝑙 𝛽 → 0
Spark
• 𝑟𝑎𝑡𝑒 =
𝛼
𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑛𝑢𝑚𝑏𝑒𝑟
• SGD – stochastic mini-
batch GD
23. LINE SEARCH –
DETERMINING STEP
SIZE
Approximate method
At each iteration
• Find step size that
sufficiently decreases l
• By reducing the range of
possible steps sizes
Spark:
• StrongWolfeLineSearch
• Sufficiency check is a
function of
𝑙 𝛽 , 𝛻𝑙 𝛽
29. SECANT METHOD
(QUAZI-NEWTON)
Approximation of derivative
𝑙′′
𝛽1 ≈
𝑙′ 𝛽1 − 𝑙′(𝛽0)
𝛽1 − 𝛽0
𝑁𝑒𝑤𝑡𝑜𝑛′ 𝑠 𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑏𝑒𝑐𝑜𝑚𝑒𝑠
𝛽 𝑘+1 = 𝛽 𝑘−
𝑙′(𝛽 𝑘)
𝑙′′(𝛽 𝑘−1)
=
= 𝛽 𝑘−𝑙′(𝛽 𝑘)
𝛽 𝑘 − 𝛽 𝑘−1
𝑙′ 𝛽 𝑘 − 𝑙′(𝛽 𝑘−1)
! Hessian is not needed !
In our case, we need only 𝛻𝑙
Animation from here
30. Requirements and Convergence rate
Newton-Raphson Quazi-Newton
Analytical formula for gradient Analytical formula for gradient
Compute gradient at each step 𝑂(𝑀 × 𝑁) Compute gradient at each step 𝑂(𝑀 × 𝑁)
Analytical formula for Hessian
Compute Inverse Hessian at each step -
𝑂(𝑀2 𝑁)
Save last calculations of gradient
Order Of Convergence q=2 Order Of Convergence q=1.6
Which is faster?
Which is cheaper (memory, cpu) in 1000 iterations for M=100,000 features?
Which of Gradient Descent, Newton or Quazi-Newton should we use?
31. BFGS - Quazi-Newton with Line Search
› Initially guess 𝛽0 and set 𝐻0
−1
= 𝐼
› In each step k
– Calculate gradient value (direction)
𝑝 𝑘 = −𝛻𝑓(𝛽 𝑘) × 𝐻 𝑘
−1
– Find step size (𝛼 𝑘) using line search (with Wolfe conditions)
– Update 𝛽 𝑘+1 = 𝛽 𝑘 + 𝛼 𝑘 𝑝 𝑘
– Update 𝐻 𝑘+1
−1
= 𝐻 𝑘
−1
+ 𝑢𝑝𝑑𝑎𝑡𝑒𝐹𝑢𝑛𝑐(𝐻 𝑘
−1
, 𝛻𝑓 𝛽 𝑘 , 𝛻𝑓 𝛽 𝑘+1 , 𝛼 𝑘, 𝑝 𝑘 )
› Stop when improvement is small enough
› More info BFGS
33. Challenges Implementing Logistic Regression
› In order to get the values of gradient we need instantiate
the formula with the learning set
– For every iteration we need to go over the learning set
› If we want to speed this up by parallelization we need ship
model or learning set to each thread/process
› Single machine -> process is CPU bound
› Multiple machines -> network bound
› With large number of features, memory becomes a
problem as well
34. Why we choose to use L-BFGS
› Only out of the box multinomial logistic regression
› Gives good value for money
– Good tradeoff between cost per iteration and number of
iterations
› Uses spark’s GeneralizedLinearModel API:
35. L-BFGS
› L stands for Limited Memory
– Replace 𝐻𝑒𝑠𝑠𝑖𝑎𝑛 which is 𝑀 × 𝑀 matrix
with a few (~10) most recent updates of 𝛻𝑓 𝛽 𝑘 and 𝑓(𝛽 𝑘)
which are 𝑀 sized vectors
› spark.LBFGS
– Distributed wrapper over breeze.LBFGS
– Mostly, distribution of gradient calculation
› Rest is not
› Shipping around the model and collecting gradient values
– Uses L2 regularization
– Scaling Features
37. AGGREGATE &
TREE AGGREGATE
Aggregate
• Each executor holds a portion
of learning set
• Broadcast model to
executors
• Collect results to driver
TreeAggregate
• Simple heuristic to add level
• Perform partial aggregation
by shipping results to other
executors (by repartitioning)
Weights - 𝛽
Partial gradient
40. Overfitting
› We have more features then samples
› Some features are poorly represented
› For example:
– only one sample for “carbon” tag
– sample is labeled “automotive”
› Model would give high weight to this feature for
“automotive” class and 0 for others
– Do you think it is correct?
› How would you solve this?
41. Regularization
› Solution internal to regression mechanism
› We introduce regularization into the cost function
𝑙 𝑡𝑜𝑡𝑎𝑙 𝛽, 𝑥 =𝑙 𝑚𝑜𝑑𝑒𝑙 𝛽, 𝑥 + 𝜆 ∙ 𝑙 𝑟𝑒𝑔 𝛽
L2 regularization : 𝑙 𝑟𝑒𝑔 𝛽 =
1
2
𝛽 2
› 𝜆 – regularization constant
› What happens if 𝜆 is too large?
› What happens if 𝜆 is too small?
› Spark’s LBFGS has L2 built-in
42. Finding Best Lambda
› We choose best 𝜆 using cross-validation
– Set aside 30% of learning set, and use it for test
› Build model for every 𝜆 and compare precision
› Lets Parallelize? Is there more efficient way to do this?
– We use the fact that for large 𝜆, model is underfitted, converges
fast
– Start from large 𝜆 and use its model as a starting point of next
iteration
43. CHOOSING
REGULARIZATION
PARAMETER
Lambda Precision Iterations
25 35.06% 3
12.5 35.45% 12
6.25 36.68% 5
3.125 38.41% 5
1.563 Failure!
0.781 45.87% 13
0.391 50.64% 10
0.195 55.04% 13
0.098 58.33% 17
0.049 60.93% 19
0.024 62.33% 21
0.012 64.30% 25
0.006 65.95% 42
0.003 65.46% 38
After choosing the best
lambda, we can use
complete learning set to
calculate final model
Failures can be caused
externally or internally
Avg iteration time 2 sec
44. LBFGS EXTENSION
& BUGFIXES
› Spark layer of LBFGS swallows all
failures
– and returns bad weights
› Feature scaling was always on
– Redundant in our case
– Rendered passed weights unusable
– Lowered model precision
› Expose effective number of
iterations to external monitoring
• Enable passing starting
weights into LBFGS
• More transparency
45. SPARK ADDITIONS
& BUG FIXES
› PoliteLBFGS addition to spark.LBFGS
– 3-5% more precise (for our data)
– 30% faster calculation
› Planning to contribute back to spark
class PoliteLbfgs
extends spark.Lbfgs
Was it worth the trouble?
po·lite : pəˈlīt/
having or showing behavior that is respectful and considerate of
others.
synonyms: well mannered, civil, courteous, mannerly, respectful,
deferential, well behaved
48. Hardware
› 110 machines
› 5.20 TB Memory
› 6600 VCores
› Yarn
› Block size 128 MB
› Cluster is shared with other
MapReduce jobs and
HBase
› 60 Vcores per machine
› 64GB Memory
– ~1 GB per VCore
› 12 Cores
– 5 Vcores per physical core
(tuned for MapReduce)
› CentOS 6.6
› cdh-5.4.8
49. Execution – Good Neighboring
› Each source has different number of samples and
features
› Execution profiles for single learning run
Small Large
#Samples ~50K 500K
Input Size under 1gb 1g - 3g
#Executors 2 22
Executor Memory 2g 4g
Driver Memory 2g 18g
Yarn Driver Overhead 2g
Yarn Executor Overhead 1g
#Jobs per profile 200 180
50. Execution Example
Hardware : Driver 2 cores, 20g memory
Hardware : Executors 22 machines x (2 cores, 5g memory)
Number of Features 100,000
Number of Samples 500,000
Total Number of Iterations
(try out 14 different 𝜆)
152
Avg Iteration Time 18.8 sec
Total Learning Time 2863 sec (48 minutes)
Max Iterations for single 𝜆 30
51. Could you guess the reason for difference?
run Phase name
real time
[sec]
iteration time
[sec]
iterations
1 parent-glm-AVTags 29101 153.2 190
2 parent-glm-AVTags 15226 82.3 185
3 parent-glm-AVTags 2863 18.8 152
• OK, I admit, cluster was very loaded in first run
• What about the second ?
• org.apache.spark.shuffle.MetadataFetchFailedException:
Missing an output location for shuffle
• Increase spark.shuffle.memoryFraction=0.5
52. AKKA IN REAL
WORLD
› spark.akka.frameSize = 100
› spark.akka.askTimeout = 200
› spark.akka.lookupTimeout = 200Response times are
slower when cluster is
loaded
askTimeout - seems to
be particularly
responsible for
executors failures when
removing broadcasts
and unpersisting RDD
53. Kryo Stability
› Kryo uses quite a lot of memory,
– if buffer is not sufficient, process will crush
– spark.kryoserializer.buffer.max.mb = 512
56. LEARNING SET:
FEATURES
› Tag Count Source
– cnn.com | news | 1
– bbc.com | culture | 50
– …
› Html Analyzer Source
– cnn.com | money | 14
– nba.com | nba draft | 2
– …
11 basic sources
Feature is:
site | tag | score
Some reintroduced after
additional processing
Eventually – 16 sources
~500K site samples
18 GB of data
4M Unique features
~800K features/source
57. Need: How would you improve over time?
› We collect different kinds of data:
– Tags
– Links
– User behavior
– …
› How to identify where to focus collection efforts?
› How to improve classification algorithm?
58. Current Approach - Training
› foreach source
– choose 100K most influential features
– train model for L1
– foreach L1 class (avg 9.2 L2 classes per L1)
› train model for L2
› foreach source
– foreach sample in training set
› Calculate probabilities (𝜃) of belonging to any of L1 classes
› train Random Forest using L1 probabilities set
16 sources
25 L1 classes
59. Current Approach - Application
› foreach site to classify
– foreach source
› Calculate probabilities (𝜃) belonging to L1 class
– aggregate results and estimate L1 (using RF model)
– given estimated L1, foreach source
› calculate estimated L2
– choose (by voting) final L2
60. OTHER
EXTENSIONS
› Extend
mllib.LogisticRegressionModel to
return probabilities instead final
decision from “predict” method
› For Example
– Site : nhl.com
– Instead “is L1=sports”
– We produce
› P(news) = 30%
› P(sports) = 65%
› P(art) = 5%
model.advise(p:point)
61. Summary : This Approach vs Straight Logistic
Regression
› Increases precision by using more features
› Increases coverage by using very granular features
› Have feedback (from RF) regarding quality of each source
– Using out-of-bag error
› Natural parallelization by source
› No need for feature scaling
Editor's Notes
My customer is VirtualJoias, online jewelry retailer from Brazil. He would like to expand to UK market. Who are the major players in UK in this space?
Important to note here that, there is distributed calculation involved here
Regularization flattens the cost function
If 𝜆 is large, algorithm converges early but not accurately
If 𝜆 is small, we go back to overfitting
If passing in starting weights, they are messed up because of automatic feature scaling, which breaks the whole point of passing weights
18G is huge issue on loaded Yarn cluster in FIFO mode