Successfully reported this slideshow.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Distributed computing poli

  1. 1. Distributed computing<br />Nov. 29th, 2010<br />
  2. 2. Agenda<br />Who am I?<br />What am I talking about?<br />Just a bit of history … repeating<br />Applications of distributed computing<br />Enter Google<br />Designing a distributed computing system fallacies<br />Map-Reduce going public: Hadoop<br />Hbase<br />Mahout<br />Q&A<br />2<br />
  3. 3. Who am I?<br />3<br />Computer Scientist with Adobe Systems Inc. for 5 years<br />Worked on desktop app<br />Worked on scalable services<br />Experimented with Hadoop > it turned into a product eventually<br />Now doing research<br />ivascucristian@twitter / http://facebook.com/ivascucristian<br />Contact at: civascu@adobe.com<br />
  4. 4. Distributed computing?<br />Run some code over lots of machines, over the network<br />Without shared state<br />Run over a ton of data<br />Shines only when data to process >> network capability<br />Run it in reasonable time<br />No, 1 week is not OK<br />It’s not new, contrary to popular belief.<br />4<br />
  5. 5. History time<br />Local computing<br />Parallel computing<br />Grid computing<br />Distributed computing<br />Evolution proportional with increase in data size & computation complexity<br />5<br />
  6. 6. Local computing<br />Everything happens on a single machine<br />No overhead (network, sync, etc)<br />Limited to how much you can add in a box<br />6<br />
  7. 7. Parallel computing<br />Everything happens on a single machine<br />Enter overhead: multiple computation units fighting for memory<br />Limited to how much $$ you have and physical limitation (do you really need a CRAY?<br />7<br />
  8. 8. Grid computing<br />Moved computation units away from the data<br />More overhead: all data stored on SAN, must move it over network to computation units<br />Limited to how much $$ you have to grow the SAN and how much data you must process<br />8<br />
  9. 9. Distributed computing<br />Moved computation units with the data, but away from each other<br />Overhead galore: network, synchronization, different types of machines, development time<br />Limited to how much $$ you have to add machines<br />9<br />
  10. 10. Why distributed computing?<br />But it’s webscale!Really ….<br />Large data sets that need to be crunched, offline<br />Web indexing<br />Svm over tons of data<br />Predictions based on huge histories (e.g. credit-card fraud patterns)<br />MMORPGs<br />Distributed databases<br />…..<br />10<br />
  11. 11. Adobe Media Player<br /><ul><li>Clusters with users with similar interests
  12. 12. 6 GB of logs, one month, 700k AMP users subscribing to shows in 114 genres
  13. 13. Processed in Mahout, over Hadoop, method canopy clustering
  14. 14. 7 testing servers
  15. 15. 5 hours of data crunching
  16. 16. 27 preferences clusters</li></li></ul><li>Game Constellations<br /><ul><li>Processing Shockwave logs</li></li></ul><li>Why so popular all of a sudden?<br />Telecom did it for years, in the shadows<br />Then came Google.<br />2004 paper on Map-Reduce / GFS / Chubby + Google success = BOOM!<br />They proved it works and everyone felt its use<br />Was not new; just well thought out<br />13<br />
  17. 17. Designing a distributed computing system<br />The network is reliable<br />Latency is zero<br />Bandwidth is infinite<br />The network is secure<br />Topology doesn’t change<br />There is one administrator<br />Transport cost is zero<br />The network is homogeneous<br />14<br />
  18. 18. Designing a distributed computing system<br />The network is reliable<br />Latency is zero<br />Bandwidth is infinite<br />The network is secure<br />Topology doesn’t change<br />There is one administrator<br />Transport cost is zero<br />The network is homogeneous<br />15<br />FALLACIES<br />
  19. 19. Where does Hadoop fit in?<br />Google’s implementation is secret sauce; so no dice in using it<br />But others needed it (Nutch) so they copied it (ish)<br />Hadoop – open source implementation of Map-Reduce / GFS<br />16<br />
  20. 20. Hadoop components<br />Hadoop Distributed File System (HDFS)<br />Distributes and stores data across a cluster (brief intro only)<br />Hadoop Map Reduce (MR)<br />Provides a parallel programming model<br />Moves computation to where the data is<br />Handles scheduling, fault tolerance<br />Status reporting and monitoring<br />17<br />
  21. 21. HDFS<br />Mitigates failure through replication<br />Algorithm keeps track of machine location: one copy on another machine in the same rack, one in another rack, one random; never 2 copies on the same machine, even if multiple drives<br />Tries to have data locality<br />Running computation on the data uses the location of the replicas<br />18<br />
  22. 22. HDFS Architecture<br />Stores FS metadata – namespace, block locations<br />Namenode<br />(Master)<br />Replication<br />Meta data ops<br />Datanode<br />Datanode<br />Client<br />Datanode<br />Read<br />Write<br />Stores the data blocks as linux files<br />19<br />
  23. 23. MapReduce<br />How to scale large data processing applications ?<br />Divide the data and process on many nodes<br />Each such application has to handle<br />Communication between nodes<br />Division and scheduling of work<br /> fault tolerance<br /> monitoring and reporting<br />Map Reduce handles and hides all these issues<br />Provides a clean abstraction for programmer<br />20<br />
  24. 24. Map-Reduce Architecture<br />Jobtracker<br />Input Job (mapper, reducer, input)<br />Assign tasks<br />tasktracker<br />tasktracker<br />tasktracker<br />Data transfer<br /><ul><li>Each node is part of a HDFS cluster.
  25. 25. Input data is stored in HDFS spread across nodes and replicated
  26. 26. Programmer submits job (mapper, reducer, input) to Job tracker
  27. 27. Job tracker - Master
  28. 28. splits input data
  29. 29. Schedules and monitors various map and reduce tasks
  30. 30. Task tracker – Slaves
  31. 31. Execute map and reduce tasks</li></li></ul><li>MapReduce Programming Model<br />Inspired by functional language primitives<br />map f list : applies a given function f to a each element of list and returns a new list<br /> map square [1 2 3 4 5] = [1 4 9 16 25]<br />reduce g list : combines elements of list using function g to generate a new value<br /> reduce sum [1 2 3 4 5] = [15]<br />Map and reduce do not modify input data. They always create new data <br />A Hadoop Map Reduce job consists of a mapper and a reducer<br />22<br />
  32. 32. Map Reduce Programming Model<br />Mapper<br />Records (lines, database rows etc) are input as key/value pairs <br />Mapper outputs one or more intermediate key/value pairs for each input<br />map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)<br />Reducer<br />After the map phase, all the intermediate values for a given output key are combined together into a list<br />reducer combines those intermediate values into one or more final key/value pairs<br />reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter)<br />Input and output key/value types can be different<br />23<br />
  33. 33. Map Reduce Programming Model<br />Mapper<br />Records (lines, database rows etc) are input as key/value pairs <br />Mapper outputs one or more intermediate key/value pairs for each input<br />map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)<br />Reducer<br />After the map phase, all the intermediate values for a given output key are combined together into a list<br />reducer combines those intermediate values into one or more final key/value pairs<br />reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter)<br />Input and output key/value types can be different<br />24<br />
  34. 34. Parallel execution<br />25<br />
  35. 35. Map Reduce Advantages<br />Locality<br />Job tracker divides tasks based on location of data: it tries to schedule map tasks on same machine that has the physical data<br />Parallelism<br />Map tasks run in parallel working different input data splits<br />Reduce tasks run in parallel working on different intermediate keys<br />Reduce tasks wait until all map tasks are finished<br />Fault tolerance<br />Job tracker maintains a heartbeat with task trackers<br />Failures are handled by re-execution<br />If a task tracker node fails then all tasks scheduled on it (completed or incomplete) are re-executed on another node<br />26<br />
  36. 36. HBase<br />Distributed database on top of HDFS<br />Map-Reduce enabled<br />Fault-tolerant and scalable – relies on the core Hadoop values<br />27<br />
  37. 37. Mahout<br />An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License<br />http://mahout.apache.org<br />Why Mahout?<br />Many Open Source ML libraries either:<br />Lack Community<br />Lack Documentation and Examples<br />Lack Scalability<br />Lack the Apache License<br />Or are research-oriented<br />28<br />
  38. 38. Machine learning?<br />29<br />Amazon.com<br />Google News<br />
  39. 39. Machine learning!<br />“Machine Learning is programming computers to optimize a performance criterion using example data or past experience”<br />Intro. To Machine Learning by E. Alpaydin<br />Subset of Artificial Intelligence<br />Lots of related fields:<br />Information Retrieval<br />Stats<br />Biology<br />Linear algebra<br />Many more<br />30<br />
  40. 40. ML Use-cases<br />Recommend friends/dates/products<br />Classify content into predefined groups<br />Find similar content based on object properties<br />Find associations/patterns in actions/behaviors<br />Identify key topics in large collections of text<br />Detect anomalies in machine output<br />Ranking search results<br />Others?<br />31<br />
  41. 41. Getting Started with ML<br />Get your data<br />Decide on your features per your algorithm<br />Prep the data<br />Different approaches for different algorithms<br />Run your algorithm(s)<br />Lather, rinse, repeat<br />Validate your results<br />Smell test, A/B testing, more formal methods<br />32<br />
  42. 42. Focus: Machine Learning<br />33<br />Applications<br />Examples<br />Recommenders<br />Clustering<br />Classification<br />Freq. Pattern<br />Mining<br />Genetic<br />Math<br />Vectors/Matrices/SVD<br />Utilities<br />Lucene/Vectorizer<br />Collections (primitives)<br />Apache Hadoop<br />
  43. 43. Focus: Scalable<br />Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm<br />Some algorithms won’t scale to massive machine clusters<br />Others fit logically on a Map Reduce framework like Apache Hadoop<br />Still others will need alternative distributed programming models<br />Be pragmatic<br />Most Mahout implementations are Map Reduce enabled<br />34<br />
  44. 44. Implemented Algorithms<br />Classification<br />Clustering<br />Pattern mining<br />Regression<br />Dimension reduction<br />Evolutionary algorithms<br />Collaborative filtering<br />35<br />
  45. 45. Recommendations <br />Extensive framework for collaborative filtering<br />Recommenders<br />User based<br />Item based<br />Online and Offline support<br />Offline can utilize Hadoop<br />Many different Similarity measures<br />Cosine, LLR, Tanimoto, Pearson, others<br />36<br />
  46. 46. Clustering<br />37<br />Document level<br />Group documents based on a notion of similarity<br />K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift<br />Distance Measures<br />Manhattan, Euclidean, other<br />Topic Modeling <br />Cluster words across documents to identify topics<br />Latent Dirichlet Allocation<br />
  47. 47. Categorization<br />38<br />Place new items into predefined categories:<br />Sports, politics, entertainment<br />Mahout has several implementations<br />Naïve Bayes<br />Complementary Naïve Bayes<br />Decision Forests<br />Logistic Regression (Almost done)<br />
  48. 48. Freq. Pattern Mining<br />39<br />Identify frequently co-occurrent items<br />Useful for:<br />Query Recommendations<br />Apple -> iPhone, orange, OS X<br />Related product placement<br />“Beer and Diapers”<br />Spam Detection<br />
  49. 49. Evolutionary<br />40<br />Map-Reduce ready fitness functions for genetic programming<br />Integration with Watchmaker<br />http://watchmaker.uncommons.org/index.php<br />Problems solved:<br />Traveling salesman<br />Class discovery<br />Many others<br />
  50. 50. Singular Value Decomposition<br />41<br />Reduces a big matrix into a much smaller matrix by amplifying the important parts while removing/reducing the less important parts<br />Mahout has fully distributed Lanczosimplementation<br />https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction<br />
  51. 51. Resources <br />42<br />http://mahout.apache.org<br />http://cwiki.apache.org/MAHOUT<br />{user|dev}@mahout.apache.org<br />http://svn.apache.org/repos/asf/mahout/trunk<br />http://hadoop.apache.org<br />Hadoop. http://hadoop.apache.org/<br />Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. http://labs.google.com/papers/mapreduce.html<br />http://code.google.com/edu/parallel/index.html<br />http://www.youtube.com/watch?v=yjPBkvYh-ss<br />http://www.youtube.com/watch?v=-vD6PUdf3Js<br />S. Ghemawat, H. Gobioff, and S. Leung. The Google File System. http://labs.google.com/papers/gfs.html<br />
  52. 52. Q&A<br />?<br />43<br />

×