1. Map-Reduce for Machine Learning on Multi-Core Gary Bradski, work with Stanford: Prof. Andrew Ng Cheng-Tao, Chu, Yi-An Lin, Yuan Yuan Yu. Sang Kyun Kim with Prof. Kunle Olukotun. Ben Sapp, Intern
8. Google’s Map-Reduce Programming Model Map Phase Process each record independently Example : Determine if business is a restaurant AND address is within 50 miles Reduce Phase Aggregate the filter output Example : Top(10) 10s of lines code 1 line code Find all restaurants within 25 miles of RNB Input Data A large collection of records Stored across multiple disks Example : Business Addresses Filter Filter Map Reduce Results
9. More Detail: Google Search Index Generation In Reality, Uses 5-10 Map-Reduce Stages Efficiency impacts how often this is run. Impacts quality of the search. Web Page 3 Web Page 4 Web Page 5 Web Page 6 Web Page 1 Web Page 2 Web Page N Lexicon Table Filter Filter Filter A Aggregator A Web Graph Filter Filter Filter B Aggregator B Page Rank Table Filter Filter Filter C Aggregator C
10. More Detail: In Use: Google Search Index Lookup Google Web Server Spell Check Server Ad Server Map Reduce Map Reduce Processing a Google Search Query Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Index Server Document Server
11.
12.
13.
14.
15. Map-Reduce Infrastructure Overview Protocol Buffers Data description language like XML Google File System Customized Files system optimized for access pattern and disk failure Work Queue Manages processes creation on a cluster for different jobs Map-Reduce (MR-C++) Transparent distribution on a cluster: data distribution and scheduling Provides fault tolerance: handles machine failure Building Blocks Programming Model and Implementation C/C++ Library Sawzall Domain Specific Language Language
16.
17. Examples of Google Applications Google Search Offline : Minutes to Weeks Moderately Parallel Online : Seconds per Query “ Embarrassingly” Parallel Google News Froogle Search 100s of small programs (Scripts) Monitor, Analyze, Prototype, etc. IO-Bound Search Index News Index Product Index Search Index Generation Search Index Lookup News Index Generation News Index Lookup Froogle Index Generation Froogle Index Lookup
22. Machine Learning Summation Form: Map Reduce Example Simple example: locally weighted linear regression (LWLR): [1] To parallelize this algorithm, re-write it in “summation form” as follows: MAP : By writing LWLR as a summation over data points, we now easily parallelize the computation of A and b by dividing the data over the number of threads. For example, with two cores: REDUCE : A is obtained by adding A 1 and A 2 (at negligible additional cost). The computation for b is similarly parallelized. [1] Here, the w i are “weights.” If all weights = 1, this reduces to the ordinary least squares.
23. ML fit with Many-Core Each implies an NxN communication To be more clear: We have a set of M data points each of length N N N Decomposition for Many-Core: Threads T 1 T 2 . . . T P N M Each “ T ” incurres a NxN communication cost N<<M or we haven't formulated a ML problem This is a good fit for Map-Reduce
24.
25.
26. List of Implemented and Analyzed Algorithms Analyzed : Yes Histogram Intersection 18. Yes Multiple Discriminant Analysis 17. Unknown Gaussian process regression 15. Yes Fisher Linear Discriminant 16. Unknown SVM (with kernel) 14. Unknown * Boosting 13. Yes Policy search (PEGASUS) 12. Yes ICA (Independent components analysis) 11. Yes PCA (Principal components analysis) 10. Yes Neural networks 9. Yes EM for mixture distributions 8. Yes K-means clustering 7. Yes SVM (without kernel) 6. Yes Naïve Bayes 5. Yes Gaussian discriminant analysis 4. Yes Logistic regression 3. Yes Locally weighted linear regression 2. Yes Linear regression 1. Have summation form? Algorithm
27. Complexity Analysis : m : number of data points; n : number of features; P : number of mappers (processors/cores); c : iterations or nodes. Assume matrix inverse is O(n^3/P’) where P’ are iterations that may be parallelized. For L ocally W eighted L inear R egression ( LWLR ) training: ( ‘ => transpose below): ============================================================== X ’ WX: mn 2 Reducer: log(P)n 2 X ’ WY: mn Reducer: log(P)n (X ’ WX) -1 : n 3 /P’ (X ’ WX) -1 X”WY: n 2 Serial: O(mn 2 + n 3 /P’) Parallel: O(mn 2 /P + n 3 /P’ + log(P)n 2 map invert reduce Basically: Linear speedup with increasing numbers of cores.
28.
29.
30.
31. Map-Reduce for Machine Learning: Summary Speedups over 10 datasets Bold line is avg. Dashed is std. Bars are max/min RESULT: Basically linear or slightly better speedup.
48. Basic Flow For All Algorithms main () Create the training set and the algorithm instance Register map-reduce functions to Engine class Run Algorithm Algorithm::run(matrix x, matrix y) Create Engine instance & initialize Sequential computation1 (initialize variables, etc) Engine->run(“Map-Reduce1”) Sequential computation2 Engine->run(“Map-Reduce2”) etc main () Test Result?
Our existing library (PNL) and future (Statistical) machine learning libraries divide the pattern recognition space into Modeless and Model based machine learning.
Map Operations (also called Filters) operate on each record independently. Consequently, they are trivially parallelizable. Typical Map Operations are once which check if the record matches a condition OR process a record to extract information (like outgoing links from the web page). Aggregators are from a predefined set. All currently supported aggregators are associative and commutative. With aggregators, there are 3 possibilities: 1) Aggregator is a simple opeation (like Max). So, a sequential implementation suffices. 2) Aggregator can be easily parallelized. For instance, aggregator is counting occurances per key. In this case, the keys can be partitioned using a has function and the counts can be performed in parallel. 3) Aggregator is difficult to parallelize. However, since they belong to a small set, each of these can be efficiently implemented in parallel/distributed and hidden from the programmer. Typical Reduce Operation include (1) Sorting its inputs (2) Collecting an unbiased sample of its inputs (3) Aggregate its inputs (4) Top N values (4) Quantile (distribution) of its inputs (5) Unique values (6) Build a graph and perform a graph algorithm (7) Run a clustering algorithm
Not sure if they us the Map-Reduce infrastructure for this. But they potentially could.
1. Use a cluster to complete the job quickly Parallelism: Distribute subtasks to different machines to exploit Parallelism Locality: Locate data and Schedule subtask close to the data Avoid heavily loaded machines 2. Map Phase: For each business, check if restaurant and within 25 miles of RNB (Execute the 10s of line of code of the Map filter for each record) 3. Gather the list of restaurants from the different machines Reduce Phase: Sort the list by distance (A common subset of Reduce operations like “Sort” would be provided by the library) Generate a web page with the results At any stage : Recover from faults. Machine/disk/network crashes
“ Google Applications” here refers to the Applications Google runs on their clusters and not newer Desktop applications like Google Earth and Google Desktop. Similar Model: Maps.google.com, Local.google.com, etc. Red: Can use Map-Reduce. Might not get all the benefits because the Reduce operations can be quite complicated (Clustering, Graph Algorithms). Might have to parallelize explicitly if need be. Green: Extremely Parallel (by design so that response can be generated interactively). Map-Reduce is a good match. Blue: IO-Bound (Might not have indices to speedup the process). Map-Reduce is a good match.
The lookup applications (that respond to queries at their various sites: search, news, froogle, etc.) appear to be their most cycle-consuming apps One of the main metrics that Google uses to evaluate a platform: “Queries per Second per Watt”.
The goal of the paintable computing project is to jointly develop millimeter scale computing elements and techniques for self-assembly of distributed processes. The vision is a machine consisting of thousands of sand-grain sized processing nodes, each fitted with modest amounts of processing and memory, positioned randomly and communicating locally. Individually, the nodes are resource poor, vanishingly cheap and necessarily treated as freely expendable. Yet, when mixed together they cooperate to form a machine whose aggregate compute capacity grows with the addition of more node. This approach ultimately recasts computing as a particulate additive to ordinary materials such as building materials or paint. The motivations for this work are low node complexity, ease of deployment, and tight coupling to dense sensor arrays, which all work to enable extraordinary scaling performance and to reduce the overall cost per unit compute capacity -- in some cases, dramatically. The challenge is the explosive complexity at the system level; the cumulative effect of the unbounded node count, inherent asynchrony, variable topology, and poor unit reliability. In this regime, the limiting problem has consistently been the lack of a programming model capable of reliably directing the global behavior of such a system. The strategy is to admit that we as human programmers will never be able to structure code for unrestricted use on this architecture, and to employ the principles of self-assembly to enable the code to structure itself.
In functional languages, map and fold (reduce) are fundamental operations (Typically, in place of For and While loops)