Scalable Computing In Education Workshop

987 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
987
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Scalable Computing In Education Workshop

  1. 1. 2008 NSF Data-Intensive Scalable Computing in Education Workshop Module II: Hadoop-Based Course Components This presentation includes content © Google, Inc. Redistributed under the Creative Commons Attribution 2.5 license. All other contents:
  2. 2. Overview <ul><li>University of Washington Curriculum </li></ul><ul><ul><li>Teaching Methods </li></ul></ul><ul><ul><li>Reflections </li></ul></ul><ul><ul><li>Student Background </li></ul></ul><ul><ul><li>Course Staff Requirements </li></ul></ul><ul><li>MapReduce Algorithms </li></ul>
  3. 3. UW: Course Summary <ul><li>Course title: “Problem Solving on Large Scale Clusters” </li></ul><ul><li>Primary purpose: developing large-scale problem solving skills </li></ul><ul><li>Format: 6 weeks of lectures + labs, 4 week project </li></ul>
  4. 4. UW: Course Goals <ul><li>Think creatively about large-scale problems in a parallel fashion; design parallel solutions </li></ul><ul><li>Manage large data sets under memory, bandwidth limitations </li></ul><ul><li>Develop a foundation in parallel algorithms for large-scale data </li></ul><ul><li>Identify and understand engineering trade-offs in real systems </li></ul>
  5. 5. Lectures <ul><li>2 hours, once per week </li></ul><ul><li>Half formal lecture, half discussion </li></ul><ul><li>Mostly covered systems & background </li></ul><ul><li>Included group activities for reinforcement </li></ul>
  6. 6. Classroom Activities <ul><li>Worksheets included pseudo-code programming, working through examples </li></ul><ul><ul><li>Performed in groups of 2—3 </li></ul></ul><ul><li>Small-group discussions about engineering and systems design </li></ul><ul><ul><li>Groups of ~10 </li></ul></ul><ul><ul><li>Course staff facilitated, but mostly open-ended </li></ul></ul>
  7. 7. Readings <ul><li>No textbook </li></ul><ul><li>One academic paper per week </li></ul><ul><ul><li>E.g., “Simplified Data Processing on Large Clusters” </li></ul></ul><ul><ul><li>Short homework covered comprehension </li></ul></ul><ul><li>Formed basis for discussion </li></ul>
  8. 8. Lecture Schedule <ul><li>Introduction to Distributed Computing </li></ul><ul><li>MapReduce: Theory and Implementation </li></ul><ul><li>Networks and Distributed Reliability </li></ul><ul><li>Real-World Distributed Systems </li></ul><ul><li>Distributed File Systems </li></ul><ul><li>Other Distributed Systems </li></ul>
  9. 9. Intro to Distributed Computing <ul><li>What is distributed computing? </li></ul><ul><li>Flynn’s Taxonomy </li></ul><ul><li>Brief history of distributed computing </li></ul><ul><li>Some background on synchronization and memory sharing </li></ul>
  10. 10. MapReduce <ul><li>Brief refresher on functional programming </li></ul><ul><li>MapReduce slides </li></ul><ul><ul><li>More detailed version of module I </li></ul></ul><ul><li>Discussion on MapReduce </li></ul>
  11. 11. Networking and Reliability <ul><li>Crash course in networking </li></ul><ul><li>Distributed systems reliability </li></ul><ul><ul><li>What is reliability? </li></ul></ul><ul><ul><li>How do distributed systems fail? </li></ul></ul><ul><ul><li>ACID, other metrics </li></ul></ul><ul><li>Discussion: Does MapReduce provide reliability? </li></ul>
  12. 12. Real Systems <ul><li>Design and implementation of Nutch </li></ul><ul><li>Tech talk from Googler on Google Maps </li></ul>
  13. 13. Distributed File Systems <ul><li>Introduced GFS </li></ul><ul><li>Discussed implementation of NFS and AndrewFS for comparison </li></ul>
  14. 14. Other Distributed Systems <ul><li>BOINC: Another platform </li></ul><ul><li>Broader definition of distributed systems </li></ul><ul><ul><li>DNS </li></ul></ul><ul><ul><li>One Laptop per Child project </li></ul></ul>
  15. 15. Labs <ul><li>Also 2 hours, once per week </li></ul><ul><li>Focused on applications of distributed systems </li></ul><ul><li>Four lab projects over six weeks </li></ul>
  16. 16. Lab Schedule <ul><li>Introduction to Hadoop, Eclipse Setup, Word Count </li></ul><ul><li>Inverted Index </li></ul><ul><li>PageRank on Wikipedia </li></ul><ul><li>Clustering on Netflix Prize Data </li></ul>
  17. 17. Design Projects <ul><li>Final four weeks of quarter </li></ul><ul><li>Teams of 1—3 students </li></ul><ul><li>Students proposed topic, gathered data, developed software, and presented solution </li></ul>
  18. 18. Example: Geozette Image © Julia Schwartz
  19. 19. Example: Galaxy Simulation Image © Slava Chernyak, Mike Hoak
  20. 20. Other Projects <ul><li>Bayesian Wikipedia spam filter </li></ul><ul><li>Unsupervised synonym extraction </li></ul><ul><li>Video collage rendering </li></ul>
  21. 21. Ongoing research: traceroutes <ul><li>Analyze time-stamped traceroute data to model changes in Internet router topology </li></ul><ul><ul><li>4.5 GB of data/day * 1.5 years </li></ul></ul><ul><ul><li>12 billion traces from 200 PlanetLab sites </li></ul></ul><ul><li>Calculates prevalence and persistence of routes between hosts </li></ul>
  22. 22. Ongoing research: dynamic program traces <ul><li>Dynamic memory trace data from simulators can reach hundreds of GB </li></ul><ul><li>Existing work focuses on sampling </li></ul><ul><li>New capability: record all accesses and post-process with Hadoop </li></ul>
  23. 23. Common Features <ul><li>Hadoop! </li></ul><ul><li>Used publicly-available web APIs for data </li></ul><ul><li>Many involved reading papers for algorithms and translating into MapReduce framework </li></ul>
  24. 24. Background Topics <ul><li>Programming Languages </li></ul><ul><li>Systems: </li></ul><ul><ul><li>Operating Systems </li></ul></ul><ul><ul><li>File Systems </li></ul></ul><ul><ul><li>Networking </li></ul></ul><ul><li>Databases </li></ul>
  25. 25. Programming Languages <ul><li>MapReduce is based on functional programming map and fold </li></ul><ul><li>FP is taught in one quarter, but not reinforced </li></ul><ul><ul><li>“ Crash course” necessary </li></ul></ul><ul><ul><li>Worksheets to pose short problems in terms of map and fold </li></ul></ul><ul><ul><li>Immutable data a key concept </li></ul></ul>
  26. 26. Multithreaded programming <ul><li>Taught in OS course at Washington </li></ul><ul><ul><li>Not a prerequisite! </li></ul></ul><ul><li>Students need to understand multiple copies of same method running in parallel </li></ul>
  27. 27. File Systems <ul><li>Necessary to understand GFS </li></ul><ul><li>Comparison to NFS, other distributed file systems relevant </li></ul>
  28. 28. Networking <ul><li>TCP/IP </li></ul><ul><li>Concepts of “connection,” network splits, other failure modes </li></ul><ul><li>Bandwidth issues </li></ul>
  29. 29. Other Systems Topics <ul><li>Process Scheduling </li></ul><ul><li>Synchronization </li></ul><ul><li>Memory coherency </li></ul>
  30. 30. Databases <ul><li>Concept of shared consistency model </li></ul><ul><li>Consensus </li></ul><ul><li>ACID characteristics </li></ul><ul><ul><li>Journaling </li></ul></ul><ul><ul><li>Multi-phase commit processes </li></ul></ul>
  31. 31. Course Staff <ul><li>Instructor (me!) </li></ul><ul><li>Two undergrad teaching assistants </li></ul><ul><ul><li>Helped facilitate discussions, directed labs </li></ul></ul><ul><li>One student sys admin </li></ul><ul><ul><li>Worked only about three hours/week </li></ul></ul>
  32. 32. Preparation <ul><li>Teaching assistants had taken previous iteration of course in winter </li></ul><ul><li>Lectures retooled based on feedback from that quarter </li></ul><ul><ul><li>Added reasonably large amount of background material </li></ul></ul><ul><li>Ran & solved all labs in advance </li></ul>
  33. 33. The Course: What Worked <ul><li>Discussions </li></ul><ul><ul><li>Often covered broad range of subjects </li></ul></ul><ul><li>Hands-on lab projects </li></ul><ul><li>“Active learning” in classroom </li></ul><ul><li>Independent design projects </li></ul>
  34. 34. Things to Improve: Coverage <ul><li>Algorithms were not reinforced during lecture </li></ul><ul><ul><li>Students requested much more time be spent on “how to parallelize an iterative algorithm” </li></ul></ul><ul><li>Background material was very fast-paced </li></ul>
  35. 35. Things to Improve: Projects <ul><li>Labs could have used a moderated/scripted discussion component </li></ul><ul><ul><li>Just “jumping in” to the code proved difficult </li></ul></ul><ul><ul><li>No time was devoted to Hadoop itself in lecture </li></ul></ul><ul><ul><li>Clustering lab should be split in two </li></ul></ul><ul><li>Design projects could have used more time </li></ul>
  36. 36. Part 2: Algorithms
  37. 37. Algorithms for MapReduce <ul><li>Sorting </li></ul><ul><li>Searching </li></ul><ul><li>Indexing </li></ul><ul><li>Classification </li></ul><ul><li>TF-IDF </li></ul><ul><li>PageRank </li></ul><ul><li>Clustering </li></ul>
  38. 38. MapReduce Jobs <ul><li>Tend to be very short, code-wise </li></ul><ul><ul><li>IdentityReducer is very common </li></ul></ul><ul><li>“ Utility” jobs can be composed </li></ul><ul><li>Represent a data flow , more so than a procedure </li></ul>
  39. 39. Sort: Inputs <ul><li>A set of files, one value per line. </li></ul><ul><li>Mapper key is file name, line number </li></ul><ul><li>Mapper value is the contents of the line </li></ul>
  40. 40. Sort Algorithm <ul><li>Takes advantage of reducer properties: (key, value) pairs are processed in order by key; reducers are themselves ordered </li></ul><ul><li>Mapper: Identity function for value </li></ul><ul><ul><li>(k, v)  (v, _) </li></ul></ul><ul><li>Reducer: Identity function (k’, _) -> (k’, “”) </li></ul>
  41. 41. Sort: The Trick <ul><li>(key, value) pairs from mappers are sent to a particular reducer based on hash(key) </li></ul><ul><li>Must pick the hash function for your data such that k 1 < k 2 => hash(k 1 ) < hash(k 2 ) </li></ul>
  42. 42. Final Thoughts on Sort <ul><li>Used as a test of Hadoop’s raw speed </li></ul><ul><li>Essentially “IO drag race” </li></ul><ul><li>Highlights utility of GFS </li></ul>
  43. 43. Search: Inputs <ul><li>A set of files containing lines of text </li></ul><ul><li>A search pattern to find </li></ul><ul><li>Mapper key is file name, line number </li></ul><ul><li>Mapper value is the contents of the line </li></ul><ul><li>Search pattern sent as special parameter </li></ul>
  44. 44. Search Algorithm <ul><li>Mapper: </li></ul><ul><ul><li>Given (filename, some text) and “pattern”, if “text” matches “pattern” output (filename, _) </li></ul></ul><ul><li>Reducer: </li></ul><ul><ul><li>Identity function </li></ul></ul>
  45. 45. Search: An Optimization <ul><li>Once a file is found to be interesting, we only need to mark it that way once </li></ul><ul><li>Use Combiner function to fold redundant (filename, _) pairs into a single one </li></ul><ul><ul><li>Reduces network I/O </li></ul></ul>
  46. 46. Indexing: Inputs <ul><li>A set of files containing lines of text </li></ul><ul><li>Mapper key is file name, line number </li></ul><ul><li>Mapper value is the contents of the line </li></ul>
  47. 47. Inverted Index Algorithm <ul><li>Mapper: For each word in (file, words), map to (word, file) </li></ul><ul><li>Reducer: Identity function </li></ul>
  48. 48. Index: MapReduce <ul><li>map(pageName, pageText): </li></ul><ul><ul><li>foreach word w in pageText: </li></ul></ul><ul><ul><li>emitIntermediate(w, pageName); </li></ul></ul><ul><ul><li>done </li></ul></ul><ul><li>reduce(word, values): </li></ul><ul><ul><li>foreach pageName in values: </li></ul></ul><ul><ul><li>AddToOutputList(pageName); </li></ul></ul><ul><ul><li>done </li></ul></ul><ul><ul><li>emitFinal(FormattedPageListForWord); </li></ul></ul>
  49. 49. Index: Data Flow
  50. 50. An Aside: Word Count <ul><li>Word count was described in module I </li></ul><ul><li>Mapper for Word Count is (word, 1) for each word in input line </li></ul><ul><ul><li>Strikingly similar to inverted index </li></ul></ul><ul><ul><li>Common theme: reuse/modify existing mappers </li></ul></ul>
  51. 51. Bayesian Classification <ul><li>Files containing classification instances are sent to mappers </li></ul><ul><li>Map (filename, instance)  (instance, class) </li></ul><ul><li>Identity Reducer </li></ul>
  52. 52. Bayesian Classification <ul><li>Existing toolsets exist to perform Bayes classification on instance </li></ul><ul><ul><li>E.g., WEKA, already in Java! </li></ul></ul><ul><li>Another example of discarding input key </li></ul>
  53. 53. TF-IDF <ul><li>Term Frequency – Inverse Document Frequency </li></ul><ul><ul><li>Relevant to text processing </li></ul></ul><ul><ul><li>Common web analysis algorithm </li></ul></ul>
  54. 54. The Algorithm, Formally <ul><li>| D | : total number of documents in the corpus </li></ul><ul><li>                  : number of documents where the term t i appears (that is         ). </li></ul>
  55. 55. Information We Need <ul><li>Number of times term X appears in a given document </li></ul><ul><li>Number of terms in each document </li></ul><ul><li>Number of documents X appears in </li></ul><ul><li>Total number of documents </li></ul>
  56. 56. Job 1: Word Frequency in Doc <ul><li>Mapper </li></ul><ul><ul><li>Input: (docname, contents) </li></ul></ul><ul><ul><li>Output: ((word, docname), 1) </li></ul></ul><ul><li>Reducer </li></ul><ul><ul><li>Sums counts for word in document </li></ul></ul><ul><ul><li>Outputs ((word, docname), n ) </li></ul></ul><ul><li>Combiner is same as Reducer </li></ul>
  57. 57. Job 2: Word Counts For Docs <ul><li>Mapper </li></ul><ul><ul><li>Input: ((word, docname), n ) </li></ul></ul><ul><ul><li>Output: (docname, (word, n) ) </li></ul></ul><ul><li>Reducer </li></ul><ul><ul><li>Sums frequency of individual n ’s in same doc </li></ul></ul><ul><ul><li>Feeds original data through </li></ul></ul><ul><ul><li>Outputs ((word, docname), ( n , N) ) </li></ul></ul>
  58. 58. Job 3: Word Frequency In Corpus <ul><li>Mapper </li></ul><ul><ul><li>Input: ((word, docname), ( n, N) ) </li></ul></ul><ul><ul><li>Output: (word, (docname, n , N , 1)) </li></ul></ul><ul><li>Reducer </li></ul><ul><ul><li>Sums counts for word in corpus </li></ul></ul><ul><ul><li>Outputs ((word, docname), ( n , N , m )) </li></ul></ul>
  59. 59. Job 4: Calculate TF-IDF <ul><li>Mapper </li></ul><ul><ul><li>Input: ((word, docname), (n, N, m)) </li></ul></ul><ul><ul><li>Assume D is known (or, easy MR to find it) </li></ul></ul><ul><ul><li>Output ((word, docname), TF*IDF) </li></ul></ul><ul><li>Reducer </li></ul><ul><ul><li>Just the identity function </li></ul></ul>
  60. 60. Working At Scale <ul><li>Buffering (doc, n , N ) counts while summing 1’s into m may not fit in memory </li></ul><ul><ul><li>How many documents does the word “the” occur in? </li></ul></ul><ul><li>Possible solutions </li></ul><ul><ul><li>Ignore very-high-frequency words </li></ul></ul><ul><ul><li>Write out intermediate data to a file </li></ul></ul><ul><ul><li>Use another MR pass </li></ul></ul>
  61. 61. Final Thoughts on TF-IDF <ul><li>Several small jobs add up to full algorithm </li></ul><ul><li>Lots of code reuse possible </li></ul><ul><ul><li>Stock classes exist for aggregation, identity </li></ul></ul><ul><li>Jobs 3 and 4 can really be done at once in same reducer, saving a write/read cycle </li></ul><ul><li>Very easy to handle medium-large scale, but must take care to ensure flat memory usage for largest scale </li></ul>
  62. 62. PageRank: Random Walks Over The Web <ul><li>If a user starts at a random web page and surfs by clicking links and randomly entering new URLs, what is the probability that s/he will arrive at a given page? </li></ul><ul><li>The PageRank of a page captures this notion </li></ul><ul><ul><li>More “popular” or “worthwhile” pages get a higher rank </li></ul></ul>
  63. 63. PageRank: Visually
  64. 64. PageRank: Formula <ul><li>Given page A, and pages T 1 through T n linking to A, PageRank is defined as: </li></ul><ul><li>PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) + ... + </li></ul><ul><li>PR(T n )/C(T n )) </li></ul><ul><li>C(P) is the cardinality (out-degree) of page P </li></ul><ul><li>d is the damping (“random URL”) factor </li></ul>
  65. 65. PageRank: Intuition <ul><li>Calculation is iterative: PR i+1 is based on PR i </li></ul><ul><li>Each page distributes its PR i to all pages it links to. Linkees add up their awarded rank fragments to find their PR i+1 </li></ul><ul><li>d is a tunable parameter (usually = 0.85) encapsulating the “random jump factor” </li></ul><ul><ul><li>PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) + ... + PR(T n )/C(T n )) </li></ul></ul>
  66. 66. Graph Representations <ul><li>The most straightforward representation of graphs uses references from each node to its neighbors </li></ul>
  67. 67. Direct References <ul><li>Structure is inherent to object </li></ul><ul><li>Iteration requires linked list “threaded through” graph </li></ul><ul><li>Requires common view of shared memory (synchronization!) </li></ul><ul><li>Not easily serializable </li></ul>class GraphNode { Object data; Vector<GraphNode> out_edges; GraphNode iter_next; }
  68. 68. Adjacency Matrices <ul><li>Another classic graph representation. M[i][j]= '1' implies a link from node i to j. </li></ul><ul><li>Naturally encapsulates iteration over nodes </li></ul>0 1 0 1 4 0 0 1 0 3 1 1 0 1 2 1 0 1 0 1 4 3 2 1
  69. 69. Adjacency Matrices: Sparse Representation <ul><li>Adjacency matrix for most large graphs (e.g., the web) will be overwhelmingly full of zeros. </li></ul><ul><li>Each row of the graph is absurdly long </li></ul><ul><li>Sparse matrices only include non-zero elements </li></ul>
  70. 70. Sparse Matrix Representation <ul><li>1: (3, 1), (18, 1), (200, 1) </li></ul><ul><li>2: (6, 1), (12, 1), (80, 1), (400, 1) </li></ul><ul><li>3: (1, 1), (14, 1) </li></ul><ul><li>… </li></ul>
  71. 71. Sparse Matrix Representation <ul><li>1: 3, 18, 200 </li></ul><ul><li>2: 6, 12, 80, 400 </li></ul><ul><li>3: 1, 14 </li></ul><ul><li>… </li></ul>
  72. 72. PageRank: First Implementation <ul><li>Create two tables 'current' and 'next' holding the PageRank for each page. Seed 'current' with initial PR values </li></ul><ul><li>Iterate over all pages in the graph, distributing PR from 'current' into 'next' of linkees </li></ul><ul><li>current := next; next := fresh_table(); </li></ul><ul><li>Go back to iteration step or end if converged </li></ul>
  73. 73. Distribution of the Algorithm <ul><li>Key insights allowing parallelization: </li></ul><ul><ul><li>The 'next' table depends on 'current', but not on any other rows of 'next' </li></ul></ul><ul><ul><li>Individual rows of the adjacency matrix can be processed in parallel </li></ul></ul><ul><ul><li>Sparse matrix rows are relatively small </li></ul></ul>
  74. 74. Distribution of the Algorithm <ul><li>Consequences of insights: </li></ul><ul><ul><li>We can map each row of 'current' to a list of PageRank “fragments” to assign to linkees </li></ul></ul><ul><ul><li>These fragments can be reduced into a single PageRank value for a page by summing </li></ul></ul><ul><ul><li>Graph representation can be even more compact; since each element is simply 0 or 1, only transmit column numbers where it's 1 </li></ul></ul>
  75. 76. Phase 1: Parse HTML <ul><li>Map task takes (URL, page content) pairs and maps them to (URL, (PR init , list-of-urls)) </li></ul><ul><ul><li>PR init is the “seed” PageRank for URL </li></ul></ul><ul><ul><li>list-of-urls contains all pages pointed to by URL </li></ul></ul><ul><li>Reduce task is just the identity function </li></ul>
  76. 77. Phase 2: PageRank Distribution <ul><li>Map task takes (URL, (cur_rank, url_list)) </li></ul><ul><ul><li>For each u in url_list, emit ( u , cur_rank/|url_list|) </li></ul></ul><ul><ul><li>Emit (URL, url_list) to carry the points-to list along through iterations </li></ul></ul><ul><ul><li>PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) + ... + PR(T n )/C(T n )) </li></ul></ul>
  77. 78. Phase 2: PageRank Distribution <ul><li>Reduce task gets (URL, url_list) and many (URL, val ) values </li></ul><ul><ul><li>Sum val s and fix up with d </li></ul></ul><ul><ul><li>Emit (URL, (new_rank, url_list)) </li></ul></ul><ul><ul><li>PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) + ... + PR(T n )/C(T n )) </li></ul></ul>
  78. 79. Finishing up... <ul><li>A subsequent component determines whether convergence has been achieved (Fixed number of iterations? Comparison of key values?) </li></ul><ul><li>If so, write out the PageRank lists - done! </li></ul><ul><li>Otherwise, feed output of Phase 2 into another Phase 2 iteration </li></ul>
  79. 80. PageRank Conclusions <ul><li>MapReduce runs the “heavy lifting” in iterated computation </li></ul><ul><li>Key element in parallelization is independent PageRank computations in a given step </li></ul><ul><li>Parallelization requires thinking about minimum data partitions to transmit (e.g., compact representations of graph rows) </li></ul><ul><ul><li>Even the implementation shown today doesn't actually scale to the whole Internet; but it works for intermediate-sized graphs </li></ul></ul>
  80. 81. Clustering <ul><li>What is clustering? </li></ul>
  81. 82. Google News <ul><li>They didn’t pick all 3,400,217 related articles by hand… </li></ul><ul><li>Or Amazon.com </li></ul><ul><li>Or Netflix… </li></ul>
  82. 83. Other less glamorous things... <ul><li>Hospital Records </li></ul><ul><li>Scientific Imaging </li></ul><ul><ul><li>Related genes, related stars, related sequences </li></ul></ul><ul><li>Market Research </li></ul><ul><ul><li>Segmenting markets, product positioning </li></ul></ul><ul><li>Social Network Analysis </li></ul><ul><li>Data mining </li></ul><ul><li>Image segmentation… </li></ul>
  83. 84. The Distance Measure <ul><li>How the similarity of two elements in a set is determined, e.g. </li></ul><ul><ul><li>Euclidean Distance </li></ul></ul><ul><ul><li>Manhattan Distance </li></ul></ul><ul><ul><li>Inner Product Space </li></ul></ul><ul><ul><li>Maximum Norm </li></ul></ul><ul><ul><li>Or any metric you define over the space… </li></ul></ul>
  84. 85. <ul><li>Hierarchical Clustering vs. </li></ul><ul><li>Partitional Clustering </li></ul>Types of Algorithms
  85. 86. Hierarchical Clustering <ul><li>Builds or breaks up a hierarchy of clusters. </li></ul>
  86. 87. Partitional Clustering <ul><li>Partitions set into all clusters simultaneously. </li></ul>
  87. 88. Partitional Clustering <ul><li>Partitions set into all clusters simultaneously. </li></ul>
  88. 89. K-Means Clustering <ul><li>Simple Partitional Clustering </li></ul><ul><li>Choose the number of clusters, k </li></ul><ul><li>Choose k points to be cluster centers </li></ul><ul><li>Then… </li></ul>
  89. 90. K-Means Clustering <ul><li>iterate { </li></ul><ul><li>Compute distance from all points to all k- </li></ul><ul><li>centers </li></ul><ul><li>Assign each point to the nearest k-center </li></ul><ul><li>Compute the average of all points assigned to </li></ul><ul><li> all specific k-centers </li></ul><ul><li>Replace the k-centers with the new averages </li></ul><ul><li>} </li></ul>
  90. 91. But! <ul><li>The complexity is pretty high: </li></ul><ul><ul><li>k * n * O ( distance metric ) * num (iterations) </li></ul></ul><ul><li>Moreover, it can be necessary to send tons of data to each Mapper Node. Depending on your bandwidth and memory available, this could be impossible. </li></ul>
  91. 92. Furthermore <ul><li>There are three big ways a data set can be large: </li></ul><ul><ul><li>There are a large number of elements in the set. </li></ul></ul><ul><ul><li>Each element can have many features. </li></ul></ul><ul><ul><li>There can be many clusters to discover </li></ul></ul><ul><li>Conclusion – Clustering can be huge, even when you distribute it. </li></ul>
  92. 93. Canopy Clustering <ul><li>Preliminary step to help parallelize computation. </li></ul><ul><li>Clusters data into overlapping Canopies using super cheap distance metric. </li></ul><ul><li>Efficient </li></ul><ul><li>Accurate </li></ul>
  93. 94. Canopy Clustering <ul><li>While there are unmarked points { </li></ul><ul><li>pick a point which is not strongly marked </li></ul><ul><li>call it a canopy center </li></ul><ul><li>mark all points within some threshold of </li></ul><ul><li>it as in it’s canopy </li></ul><ul><li>strongly mark all points within some </li></ul><ul><li>stronger threshold </li></ul><ul><li>} </li></ul>
  94. 95. After the canopy clustering… <ul><li>Resume hierarchical or partitional clustering as usual. </li></ul><ul><li>Treat objects in separate clusters as being at infinite distances. </li></ul>
  95. 96. MapReduce Implementation: <ul><li>Problem – Efficiently partition a large data set (say… movies with user ratings!) into a fixed number of clusters using Canopy Clustering, K-Means Clustering, and a Euclidean distance measure. </li></ul>
  96. 97. The Distance Metric <ul><li>The Canopy Metric ($) </li></ul><ul><li>The K-Means Metric ($$$) </li></ul>
  97. 98. Steps! <ul><li>Get Data into a form you can use (MR) </li></ul><ul><li>Picking Canopy Centers (MR) </li></ul><ul><li>Assign Data Points to Canopies (MR) </li></ul><ul><li>Pick K-Means Cluster Centers </li></ul><ul><li>K-Means algorithm (MR) </li></ul><ul><ul><li>Iterate! </li></ul></ul>
  98. 99. Selecting Canopy Centers
  99. 115. Assigning Points to Canopies
  100. 121. K-Means Map
  101. 129. Elbow Criterion <ul><li>Choose a number of clusters s.t. adding a cluster doesn’t add interesting information. </li></ul><ul><li>Rule of thumb to determine what number of Clusters should be chosen. </li></ul><ul><li>Initial assignment of cluster seeds has bearing on final model performance. </li></ul><ul><li>Often required to run clustering several times to get maximal performance </li></ul>
  102. 130. Clustering Conclusions <ul><li>Clustering is slick </li></ul><ul><li>And it can be done super efficiently </li></ul><ul><li>And in lots of different ways </li></ul>
  103. 131. Conclusions <ul><li>Lots of high level algorithms </li></ul><ul><li>Lots of deep connections to low-level systems </li></ul><ul><li>Discussion-based classes helped students think critically about real issues </li></ul><ul><li>Code labs made them real </li></ul>

×