Parallel Machine Learning
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Parallel Machine Learning



Scalability has been an essential factor for any kind of computational algorithm while considering its performance. In this Big Data era, gathering of large amounts of data is becoming easy. Data ...

Scalability has been an essential factor for any kind of computational algorithm while considering its performance. In this Big Data era, gathering of large amounts of data is becoming easy. Data analysis on Big Data is not feasible using the existing Machine Learning (ML) algorithms and it perceives them to perform poorly. This is due to the fact that the computational logic for these algorithms is previously designed in sequential way. MapReduce becomes the solution for handling billions of data efficiently. In this report we discuss the basic building block for the computations behind ML algorithms, two different attempts to parallelize machine learning algorithms using MapReduce and a brief description on the overhead in parallelization of ML algorithms.



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Parallel Machine Learning Document Transcript

  • 1. Parallel Machine Learning Janani Chakkaradhari Information Technology for Business Intelligence Technische Universit¨ t Berlin a February 13, 2014 Abstract Scalability has been an essential factor for any kind of computational algorithm while considering its performance. In this Big Data era, gathering of large amounts of data is becoming easy. Data analysis on Big Data is not feasible using the existing Machine Learning (ML) algorithms and it perceives them to perform poorly. This is due to the fact that the computational logic for these algorithms is previously designed in sequential way. MapReduce [1] becomes the solution for handling billions of data efficiently. In this report we discuss the basic building block for the computation behind ML algorithms, two different attempts to parallelize machine learning algorithms using MapReduce and a brief description on the overhead in parallelization of ML algorithms. 1 Introduction The significance of Machine Learning algorithms are widely known and its acquaintance in various applications brings in much more benefits in business as well as in research community. In traditional ML algorithms, the computational methods were built by thinking the data fits in memory. On the other hand, the current distributed infrastructure of Information Systems (IS) facilitates the computerized society to easily access and also generate data in almost every action involved in their day to-day life. This perpetual increase of data leads to degrade in performance of ML algorithms which had been proved to produce fast and prominent results with smaller datasets which in turn becomes the cause for “curse of modularity” [9]. With the advent of MapReduce programming model, data voluminous is handled efficiently in parallel as it follows divide and conquer methodology for execution. “Learning can become limited by computation time and not by data volume with help of MapReduce and large clusters of machines” [8] and this imposes the fact that ML algorithms has to be re-modified in order to be executed in parallel architecture. Thus parallelization of ML algorithms using MapReduce model would results in increase in speed of computation. Earlier works on this topic had been proved to produce increased performance. This report presents a gentle background study on the exploitation of Linear Algebra in ML in section 2, followed by an overview of one of the novel approach for parallelization of Stochastic Gradient Descent algorithm for Matrix Factorization [2] in section 3, and a brief summary on declarative ML which is an attempt to provide a declarative way of executing some of the ML algorithms and linear algebra primitives on Hadoop using a system called SystemML [3] in section 4. 1
  • 2. 2 Computational Engine for Machine Learning Mathematics and computer science are like the tracks of a train, they always go together to make sure a good journey for real world users. Linear algebra has prominent role in ML. Transforming problem space into linear functions is one of the elementary approaches used in predictive algorithms. Matrices are used as means of representing linear functions. In other words, the interaction between two entities of a system can be represented in two dimensional form known as matrix. The elements inside the matrix represents the magnitude of those interactions between two finite set of objects also known as dyadic data [4]. Analysis of the system using matrix technique allows one to predict the effect of individual interactions on the overall system. Some of the eminent applications in ML based on linear algebra are listed below, • Singular Value Decomposition (SVD) is one of famous method for its applications in image compression, determining oscillations or damages in structures like bridge during the design phase and many more. • Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are used as a feature extraction step before classification. • Eigen value and Eigen vectors has its proven results in PageRank algorithm. • Analysis based on dyads such as topic modeling, keyword search and recommender systems are based on Non Negative Matrix Factorization technique [6]. 3 Large Scale Matrix Factorization with DSGD In this section, an overview of Distributed Stochastic Gradient Descent algorithm is described with a brief review on optimization of Matrix Factorization using Stochastic Gradient Descent and a quick introduction to functional usage of Matrix Factorization and Stochastic Gradient Descent. 3.1 Matrix Factorization Matrix Factorization is mainly used to extract interaction structure from dyadic data [6]. The interaction structure includes the following [4] • Co-occurrence • Strength of preference or the association • Word clustering, word sense disambiguation and thesaurus construction in text based information retrieval • Modeling of preference and consumption behavior • The dyad in computer vision applications represents the feature observed at a particular image location. 2
  • 3. 3.2 Stochastic Gradient Descent (SGD) Gradient descent has fruitful applications in optimization problems. It predominantly helps in minimizing the cost function of ML algorithms such as linear regression where the weight vector or the parameter vector is determined by minimizing the average of sum of square errors between the predictions minus the actual values in the training set [7]. One main drawback of gradient descent is that it requires all the training data set for computing the average square error in each step of updating parameter vector and repeats this process until the parameter vector converges. This slows down the speed of algorithm. It is also termed as Batch Gradient Descent. In contrast, Stochastic Gradient Descent takes single training data at a time randomly and updates the parameter vector with respect to that training data in each step and repeats the process until it converges. So this eliminates the need to look at the entire data set in each step and scans the entire training set for repetition of the algorithm. 3.3 Stochastic Gradient Descent for Matrix Factorization Matrix Factorization helps to reconstruct the original matrix from the partially observed matrix using some approximation technique. For example in the Netflix matrix problem of recommendation [5], the rows represent the user and columns represents the movie. The matrix is partially filled with user ratings given to the movies. By considering the existing rating values, Matrix Factorization tries to find the missing values. In simplest form, this can be done by associating each user and each movie some numbers (factors) such that the product of these two numbers would be close as possible as the original rating. The discrepancies between the original input matrix and product of the factors here is the cost function. We would try to reduce this cost function to get the most appropriate factors. One way to do this, is by employing Stochastic Gradient Descent algorithm and SGD usually produces greater performance results in sequential execution. Since SGD approximation would end up with noisy values the cost function in here includes regularization and other informations along with prediction error. SGD tries to minimize sum of all losses in the entire matrix. SGD works as follows [2], • Step 1: Takes a random entry from the training set • Step 2: Evaluate loss function • Step 3: Update parameter spaces • Step 4: Repeat Step 1 to 3 for all the entries in the matrix We can not run this algorithm in parallel using MapReduce. The reason is the following, each mapper runs SGD on the subsets of large matrix. It reads current row and current column of the subset, evaluates local loss function and updates the parameters (i.e. the rows and columns) of the corresponding matrix subset. As we considered SGD runs in parallel, it could be possible for the algorithm to be executed on another subset of the matrix which is dependent (the same column but different row). This deliberately leads the second mapper to read the values that are updated by the first mapper at the same time. So this makes the algorithm not to run in parallel architecture. 3
  • 4. As described by Gemulla [2], not all the subsets are dependents in the matrix. In Most of the cases the subsets are completely independent to each other so that it could be possible to run SGD by locking the rows and columns of that subset. This idea forms the basis for parallelized SGD. 3.4 Distributed SGD for Matrix Factorization (DSGD) DSGD utilizes the concept of independent rows and columns. Suppose if we have d number of nodes in the cluster, we split the input matrix (the training set of known ratings) into d ¢ d smaller matrices and distribute the smaller matrix into the d blocks such that the each node has the blocks of entire row as shown in the Figure 1. Figure 1: Example Stratum of 3 Cluster nodes The interchangeable sub matrices is called stratum basically represents a partition of the underlying matrix dataset. In the paper [2], the stratification is performed by permutation such that d nodes has the possible independent block combinationsd!. For example 3 nodes have 6 possible stratums and this 6 stratums forms a single sequence of stratra. The DSGD algorithm works as follows, Assuming there are d nodes available, Z is training set input matrix, W and H are the parameter factors of the input matrix. • Step 1: Divide the input matrix to Z into dd and distribute it over the clusters. H and W parameters are equally distributed on d blocks on rows and columns such that W with d ¢ 1 and H with 1 ¢ d dimensions. Compute the strata sequence for the input blocks using permutations. For each stratum in the strata, do step 2 and step 3 • Step 2: Select a stratum that are independent, for example the blocks along the diagonal the red boxes as shown in the figure from the sequence of strata (all possible combinations of stratum). • Step 3: Run SGD on the selected blocks in parallel to find the local minimum for loss function. Sum up the results of local losses computed at each block and update the corresponding factor matrices W and H This is how DSGD runs SGD algorithm in a distributed manner within a stratum. DSGD outperforms ALS (Alternating Least Squares) method for matrix factorization [2]. Since DSGD avoid averaging over loss functions when executed in parallel which makes the algorithm simpler and versatile 4
  • 5. 4 Declarative Machine Learning: SystemML The overhead in parallelizing ML algorithms can be easily understood by simple SGD algorithm as we discussed in previous section. This makes a very clear argument that the researchers have to carefully analyze each sequentially powerful ML algorithm to make it parallel and to be executed in MapReduce programming model. The cost of implementing as MapReduce jobs is high and also for better performance sometimes the same algorithm has to be hand tuned. Hence there is no space for the discussion of optimization in MapReduce jobs. For example in case of matrix multiplication problem, the order execution of multiplication has higher performance impact [3]. Researchers from IBM Almaden and Watson research center has proposed a new approach for handling parallelization of ML algorithms which also considers optimization into account and it is called SystemML. SystemML is analogous to HiveQL developed by Facebook for executing data warehouse queries on large clusters where the queries are converted to MapReduce jobs which will be executed on Hadoop by the HiveQL engine. Similarly SystemML provides a declarative platform for expressing ML algorithms and linear algebra primitives and converts the abstract representation into executable MapReduce jobs on Hadoop. 4.1 Application areas of SystemML In SystemML, ML algorithms are expressed in High Level Language called Declarative Machine Learning (DML) which is comparable to R. DML supports operations such as transpose of a matrix, matrix multiplication, iterative algorithms using “for” and “while” constructs and soon. So this makes user to focus on writing scripts that answers to what constructs to use for computation rather than how to express computation. SystemML is highly scalable and efficiently tunes the performance. It is used in different fields such as predictive modeling, recommender systems, and search analysis. 4.2 System Architecture of SystemML SystemML takes the DML script as input and passes through the different components [3] and results in parsed representation of the initial script. It supports built in data types for representing matrices and scalars. The first step in SystemML is Identifying the statement blocks based on the constructs that breaks the sequential flow of DML program. For each statement block it does the following, 4.3 High level Operator (HOP) HOP component analysis consumes and results in the following input and output. Input: Parsed statement blocks Action: The computation in each statement block instantiates one HOP Dag (Directed Acyclic Graph). HOP Dag represents the basic operations on Matrices and scalar such as an operation or transformation. Optimizations: Algebraic rewrites, selection of physical representation for intermediate matrices and cost based optimizations Output: High level execution plan (HOP Dags) representing dataflow 5
  • 6. 4.4 Low level Operator (LOP) LOP component analysis is following by HOP and the corresponding input and output are as follows, Input: High level execution plan (HOP Dags) Action: HOP Dags are converted into Low level physical plans (LOP Dags) that can be executed as MapReduce jobs. HOP Dags are parsed from bottom to top. Each HOP Dag is converted into one or more LOP Dags. The input and the output formats of each LOP is key value pairs. Since single computation leads to multiple LOPs, SystemML tries to combine these LOPs to fit into a single MapReduce job. This is implemented by using a novel algorithm named piggybacking which reduces the number of scans performed on input data during the execution of MR jobs. This is described in section Output: Low level execution plan (LOP Dags) 4.5 Runtime The runtime makes sure that the input matrices are represented as key value pairs by disregarding the cells without a value in the matrices and by that way it reduces the size of input matrix representation as they are inherently sparse. SystemML collects the local sparsity information by employing blocking operation on the input matrix. The input matrix is divided into smaller matrices called blocks and each block is represented with a block id and the cells represent the block value along with parameter indicating whether the block is dense or sparse. The block size has major impact on generated number of key value pairs by runtime [3]. Generic MapReduce Job (GM-R) is the main Execution engine in System ML and it is instantiated by the Piggybacking algorithm (Multiple LOPs inside single MR jobs) Control Module helps in coordinating the execution MapReduce jobs and involved in computations such as arithmetic operations, predictive evaluations and soon. Multiple optimizations are performed in the runtime component (dynamically deciding based on data characteristics) 4.6 Piggybacking This algorithm packages multiple LOPs in the SystemML into a single MapReduce job by considering the execution locations of each LOP at runtime. The execution location identifies whether a LOP operation can be executed in Map or Reduce or it requires both Map and Reduce for complete execution of the operation. 2 shows the list of different LOP operations and their corresponding execution location. For example the group operation of LOP has to be executed on both Map and Reduce phase and so it is marked as MapAndReduce. We consider the following example in 3 to layout the logic behind piggybacking algorithm. The left part of the diagram represents the LOP Dag for a matrix multiplication of matrix W with its transpose. LOP Dags are parsed from bottom up fashion. The algorithm starts by sorting LOP operations in topological order and the result of sort is represented in center of the diagram. The algorithm works iteratively where it creates a new MR job at the beginning of each iteration. The order of assigning each LOP into the MR job is as follows, it first assigns the LOPs that only requires Map phase indicated by Map or Reduce location in 2 followed by assigning LOPs that needs both MapAndReduce phases and finally ends by assigning LOPs that requires only Reduce 6
  • 7. Figure 2: Execution locations of LOP from [3] phase. The algorithm makes sure that another descendant with execution location of MapAndReduce will not be assigned to the same job. Figure 3: Example Piggybacking In our example since Data W and Transform LOPs spans only Map or Reduce operation, it is assigned to the Map of first MR job. mmcj is the first LOP that spans Map and Reduce phases, it is assigned to the both Map and Reduce phases of first MR job. Since the first MR job is already has a LOP with location MapAndReduce, the Group LOP which also has the same location of execution can not be assigned to the first MR job. Hence the iteration ends and the next iteration start by instantiating the second new MR job. Finally, Group and Aggregation operations are assigned to this second MR job which completes the piggy backing algorithm in this examples. 5 Conclusion In this report we have seen the requirements and the importance of research works in the parallelization of ML algorithms and the role of the branch of mathematics, Linear Algebra in ML algorithms. The realization of the level of difficulty in parallelizing ML algorithms is covered by explaining a novel approach employed by DSGD algorithm which is an effort to parallelize SGD for large clusters of data. Moreover we also discussed about SystemML which provides an easier declarative platform for executing ML algorithms to the users in different fields. Even though SystemML is concise and provides user friendly platform for executing limited forms of ML algorithms and some linear algebra primitives such as matrix multiplication, arithmetic operations and MF, DML does not support more complex 7
  • 8. features of object oriented paradigm. It also does not support data structures such as Arrays and Lists that are frequently used in most of the ML algorithms instead this is possible in R, a language that provides a comprehensive set of flexible constructs statistical and ML algorithms. On the other hand, Apache Mahout also provides complete set of ML algorithms that are Hadoop based packages but it still needs to be hand tuned for different data sets and it is more complex in users perspective. References [1] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. [2] Rainer Gemulla, Erik Nijkamp, Peter J Haas, and Yannis Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 69–77. ACM, 2011. [3] Amol Ghoting, Rajasekar Krishnamurthy, Edwin Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and Shivakumar Vaithyanathan. Systemml: Declarative machine learning on mapreduce. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 231– 242. IEEE, 2011. [4] Thomas Hofmann, Jan Puzicha, and Michael I Jordan. Learning from dyadic data. Advances in neural information processing systems, pages 466–472, 1999. [5] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009. [6] Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, and Yi-Min Wang. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce. In Proceedings of the 19th international conference on World wide web, pages 681–690. ACM, 2010. [7] Andrew Ng. Cs229 lecture notes. CS229 Lecture notes, 1(1):1–3, 2000. [8] Tutorial on Modeling with Hadoop in KDD2011 by Vijay Narayanan and Milind Bhandarkar. Modeling with hadoop. [9] Charles Parker. Unexpected challenges in large scale machine learning. In Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, pages 1–6. ACM, 2012. 8