Distributed computing poli

Distributed computing Nov. 29th, 2010

Agenda Who am I? What am I talking about? Just a bit of history … repeating Applications of distributed computing Enter Google Designing a distributed computing system fallacies Map-Reduce going public: Hadoop Hbase Mahout Q&A 2

Who am I? 3 Computer Scientist with Adobe Systems Inc. for 5 years Worked on desktop app Worked on scalable services Experimented with Hadoop > it turned into a product eventually Now doing research ivascucristian@twitter / http://facebook.com/ivascucristian Contact at: civascu@adobe.com

Distributed computing? Run some code over lots of machines, over the network Without shared state Run over a ton of data Shines only when data to process >> network capability Run it in reasonable time No, 1 week is not OK It’s not new, contrary to popular belief. 4

History time Local computing Parallel computing Grid computing Distributed computing Evolution proportional with increase in data size & computation complexity 5

Local computing Everything happens on a single machine No overhead (network, sync, etc) Limited to how much you can add in a box 6

Parallel computing Everything happens on a single machine Enter overhead: multiple computation units fighting for memory Limited to how much $$ you have and physical limitation (do you really need a CRAY? 7

Grid computing Moved computation units away from the data More overhead: all data stored on SAN, must move it over network to computation units Limited to how much $$ you have to grow the SAN and how much data you must process 8

Distributed computing Moved computation units with the data, but away from each other Overhead galore: network, synchronization, different types of machines, development time Limited to how much $$ you have to add machines 9

Why distributed computing? But it’s webscale!Really …. Large data sets that need to be crunched, offline Web indexing Svm over tons of data Predictions based on huge histories (e.g. credit-card fraud patterns) MMORPGs Distributed databases ….. 10

Adobe Media Player ,[object Object]

6 GB of logs, one month, 700k AMP users subscribing to shows in 114 genres

Processed in Mahout, over Hadoop, method canopy clustering

27 preferences clusters,[object Object],[object Object]

Designing a distributed computing system The network is reliable Latency is zero Bandwidth is infinite The network is secure Topology doesn’t change There is one administrator Transport cost is zero The network is homogeneous 14

Designing a distributed computing system The network is reliable Latency is zero Bandwidth is infinite The network is secure Topology doesn’t change There is one administrator Transport cost is zero The network is homogeneous 15 FALLACIES

Where does Hadoop fit in? Google’s implementation is secret sauce; so no dice in using it But others needed it (Nutch) so they copied it (ish) Hadoop – open source implementation of Map-Reduce / GFS 16

Hadoop components Hadoop Distributed File System (HDFS) Distributes and stores data across a cluster (brief intro only) Hadoop Map Reduce (MR) Provides a parallel programming model Moves computation to where the data is Handles scheduling, fault tolerance Status reporting and monitoring 17

HDFS Mitigates failure through replication Algorithm keeps track of machine location: one copy on another machine in the same rack, one in another rack, one random; never 2 copies on the same machine, even if multiple drives Tries to have data locality Running computation on the data uses the location of the replicas 18

HDFS Architecture Stores FS metadata – namespace, block locations Namenode (Master) Replication Meta data ops Datanode Datanode Client Datanode Read Write Stores the data blocks as linux files 19

MapReduce How to scale large data processing applications ? Divide the data and process on many nodes Each such application has to handle Communication between nodes Division and scheduling of work fault tolerance monitoring and reporting Map Reduce handles and hides all these issues Provides a clean abstraction for programmer 20

Map-Reduce Architecture Jobtracker Input Job (mapper, reducer, input) Assign tasks tasktracker tasktracker tasktracker Data transfer ,[object Object]

Input data is stored in HDFS spread across nodes and replicated

Programmer submits job (mapper, reducer, input) to Job tracker

Schedules and monitors various map and reduce tasks

Execute map and reduce tasks,[object Object]

Map Reduce Programming Model Mapper Records (lines, database rows etc) are input as key/value pairs Mapper outputs one or more intermediate key/value pairs for each input map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) Reducer After the map phase, all the intermediate values for a given output key are combined together into a list reducer combines those intermediate values into one or more final key/value pairs reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter) Input and output key/value types can be different 23

Map Reduce Programming Model Mapper Records (lines, database rows etc) are input as key/value pairs Mapper outputs one or more intermediate key/value pairs for each input map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) Reducer After the map phase, all the intermediate values for a given output key are combined together into a list reducer combines those intermediate values into one or more final key/value pairs reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter) Input and output key/value types can be different 24

Map Reduce Advantages Locality Job tracker divides tasks based on location of data: it tries to schedule map tasks on same machine that has the physical data Parallelism Map tasks run in parallel working different input data splits Reduce tasks run in parallel working on different intermediate keys Reduce tasks wait until all map tasks are finished Fault tolerance Job tracker maintains a heartbeat with task trackers Failures are handled by re-execution If a task tracker node fails then all tasks scheduled on it (completed or incomplete) are re-executed on another node 26

HBase Distributed database on top of HDFS Map-Reduce enabled Fault-tolerant and scalable – relies on the core Hadoop values 27

Mahout An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License http://mahout.apache.org Why Mahout? Many Open Source ML libraries either: Lack Community Lack Documentation and Examples Lack Scalability Lack the Apache License Or are research-oriented 28

Machine learning? 29 Amazon.com Google News

Machine learning! “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” Intro. To Machine Learning by E. Alpaydin Subset of Artificial Intelligence Lots of related fields: Information Retrieval Stats Biology Linear algebra Many more 30

ML Use-cases Recommend friends/dates/products Classify content into predefined groups Find similar content based on object properties Find associations/patterns in actions/behaviors Identify key topics in large collections of text Detect anomalies in machine output Ranking search results Others? 31

Getting Started with ML Get your data Decide on your features per your algorithm Prep the data Different approaches for different algorithms Run your algorithm(s) Lather, rinse, repeat Validate your results Smell test, A/B testing, more formal methods 32

Focus: Machine Learning 33 Applications Examples Recommenders Clustering Classification Freq. Pattern Mining Genetic Math Vectors/Matrices/SVD Utilities Lucene/Vectorizer Collections (primitives) Apache Hadoop

Focus: Scalable Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm Some algorithms won’t scale to massive machine clusters Others fit logically on a Map Reduce framework like Apache Hadoop Still others will need alternative distributed programming models Be pragmatic Most Mahout implementations are Map Reduce enabled 34

Implemented Algorithms Classification Clustering Pattern mining Regression Dimension reduction Evolutionary algorithms Collaborative filtering 35

Distributed computing poli

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Distributed computing poli

Similar to Distributed computing poli (20)

Recently uploaded

Recently uploaded (20)

Distributed computing poli