2. Agenda Who am I? What am I talking about? Just a bit of history … repeating Applications of distributed computing Enter Google Designing a distributed computing system fallacies Map-Reduce going public: Hadoop Hbase Mahout Q&A 2
3. Who am I? 3 Computer Scientist with Adobe Systems Inc. for 5 years Worked on desktop app Worked on scalable services Experimented with Hadoop > it turned into a product eventually Now doing research ivascucristian@twitter / http://facebook.com/ivascucristian Contact at: civascu@adobe.com
4. Distributed computing? Run some code over lots of machines, over the network Without shared state Run over a ton of data Shines only when data to process >> network capability Run it in reasonable time No, 1 week is not OK It’s not new, contrary to popular belief. 4
5. History time Local computing Parallel computing Grid computing Distributed computing Evolution proportional with increase in data size & computation complexity 5
6. Local computing Everything happens on a single machine No overhead (network, sync, etc) Limited to how much you can add in a box 6
7. Parallel computing Everything happens on a single machine Enter overhead: multiple computation units fighting for memory Limited to how much $$ you have and physical limitation (do you really need a CRAY? 7
8. Grid computing Moved computation units away from the data More overhead: all data stored on SAN, must move it over network to computation units Limited to how much $$ you have to grow the SAN and how much data you must process 8
9. Distributed computing Moved computation units with the data, but away from each other Overhead galore: network, synchronization, different types of machines, development time Limited to how much $$ you have to add machines 9
10. Why distributed computing? But it’s webscale!Really …. Large data sets that need to be crunched, offline Web indexing Svm over tons of data Predictions based on huge histories (e.g. credit-card fraud patterns) MMORPGs Distributed databases ….. 10
11.
12. 6 GB of logs, one month, 700k AMP users subscribing to shows in 114 genres
13. Processed in Mahout, over Hadoop, method canopy clustering
17. Designing a distributed computing system The network is reliable Latency is zero Bandwidth is infinite The network is secure Topology doesn’t change There is one administrator Transport cost is zero The network is homogeneous 14
18. Designing a distributed computing system The network is reliable Latency is zero Bandwidth is infinite The network is secure Topology doesn’t change There is one administrator Transport cost is zero The network is homogeneous 15 FALLACIES
19. Where does Hadoop fit in? Google’s implementation is secret sauce; so no dice in using it But others needed it (Nutch) so they copied it (ish) Hadoop – open source implementation of Map-Reduce / GFS 16
20. Hadoop components Hadoop Distributed File System (HDFS) Distributes and stores data across a cluster (brief intro only) Hadoop Map Reduce (MR) Provides a parallel programming model Moves computation to where the data is Handles scheduling, fault tolerance Status reporting and monitoring 17
21. HDFS Mitigates failure through replication Algorithm keeps track of machine location: one copy on another machine in the same rack, one in another rack, one random; never 2 copies on the same machine, even if multiple drives Tries to have data locality Running computation on the data uses the location of the replicas 18
22. HDFS Architecture Stores FS metadata – namespace, block locations Namenode (Master) Replication Meta data ops Datanode Datanode Client Datanode Read Write Stores the data blocks as linux files 19
23. MapReduce How to scale large data processing applications ? Divide the data and process on many nodes Each such application has to handle Communication between nodes Division and scheduling of work fault tolerance monitoring and reporting Map Reduce handles and hides all these issues Provides a clean abstraction for programmer 20
24.
25. Input data is stored in HDFS spread across nodes and replicated
32. Map Reduce Programming Model Mapper Records (lines, database rows etc) are input as key/value pairs Mapper outputs one or more intermediate key/value pairs for each input map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) Reducer After the map phase, all the intermediate values for a given output key are combined together into a list reducer combines those intermediate values into one or more final key/value pairs reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter) Input and output key/value types can be different 23
33. Map Reduce Programming Model Mapper Records (lines, database rows etc) are input as key/value pairs Mapper outputs one or more intermediate key/value pairs for each input map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) Reducer After the map phase, all the intermediate values for a given output key are combined together into a list reducer combines those intermediate values into one or more final key/value pairs reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter) Input and output key/value types can be different 24
35. Map Reduce Advantages Locality Job tracker divides tasks based on location of data: it tries to schedule map tasks on same machine that has the physical data Parallelism Map tasks run in parallel working different input data splits Reduce tasks run in parallel working on different intermediate keys Reduce tasks wait until all map tasks are finished Fault tolerance Job tracker maintains a heartbeat with task trackers Failures are handled by re-execution If a task tracker node fails then all tasks scheduled on it (completed or incomplete) are re-executed on another node 26
36. HBase Distributed database on top of HDFS Map-Reduce enabled Fault-tolerant and scalable – relies on the core Hadoop values 27
37. Mahout An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License http://mahout.apache.org Why Mahout? Many Open Source ML libraries either: Lack Community Lack Documentation and Examples Lack Scalability Lack the Apache License Or are research-oriented 28
39. Machine learning! “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” Intro. To Machine Learning by E. Alpaydin Subset of Artificial Intelligence Lots of related fields: Information Retrieval Stats Biology Linear algebra Many more 30
40. ML Use-cases Recommend friends/dates/products Classify content into predefined groups Find similar content based on object properties Find associations/patterns in actions/behaviors Identify key topics in large collections of text Detect anomalies in machine output Ranking search results Others? 31
41. Getting Started with ML Get your data Decide on your features per your algorithm Prep the data Different approaches for different algorithms Run your algorithm(s) Lather, rinse, repeat Validate your results Smell test, A/B testing, more formal methods 32
43. Focus: Scalable Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm Some algorithms won’t scale to massive machine clusters Others fit logically on a Map Reduce framework like Apache Hadoop Still others will need alternative distributed programming models Be pragmatic Most Mahout implementations are Map Reduce enabled 34
45. Recommendations Extensive framework for collaborative filtering Recommenders User based Item based Online and Offline support Offline can utilize Hadoop Many different Similarity measures Cosine, LLR, Tanimoto, Pearson, others 36
46. Clustering 37 Document level Group documents based on a notion of similarity K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift Distance Measures Manhattan, Euclidean, other Topic Modeling Cluster words across documents to identify topics Latent Dirichlet Allocation
47. Categorization 38 Place new items into predefined categories: Sports, politics, entertainment Mahout has several implementations Naïve Bayes Complementary Naïve Bayes Decision Forests Logistic Regression (Almost done)
48. Freq. Pattern Mining 39 Identify frequently co-occurrent items Useful for: Query Recommendations Apple -> iPhone, orange, OS X Related product placement “Beer and Diapers” Spam Detection
49. Evolutionary 40 Map-Reduce ready fitness functions for genetic programming Integration with Watchmaker http://watchmaker.uncommons.org/index.php Problems solved: Traveling salesman Class discovery Many others
50. Singular Value Decomposition 41 Reduces a big matrix into a much smaller matrix by amplifying the important parts while removing/reducing the less important parts Mahout has fully distributed Lanczosimplementation https://cwiki.apache.org/confluence/display/MAHOUT/Dimensional+Reduction
51. Resources 42 http://mahout.apache.org http://cwiki.apache.org/MAHOUT {user|dev}@mahout.apache.org http://svn.apache.org/repos/asf/mahout/trunk http://hadoop.apache.org Hadoop. http://hadoop.apache.org/ Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. http://labs.google.com/papers/mapreduce.html http://code.google.com/edu/parallel/index.html http://www.youtube.com/watch?v=yjPBkvYh-ss http://www.youtube.com/watch?v=-vD6PUdf3Js S. Ghemawat, H. Gobioff, and S. Leung. The Google File System. http://labs.google.com/papers/gfs.html