Distributed computingNov. 29th, 2010
AgendaWho am I?What am I talking about?Just a bit of history … repeatingApplications of distributed computingEnter GoogleDesigning a distributed computing system fallaciesMap-Reduce going public: HadoopHbaseMahoutQ&A2
Who am I?3Computer Scientist with Adobe Systems Inc. for 5 yearsWorked on desktop appWorked on scalable servicesExperimented with Hadoop > it turned into a product eventuallyNow doing researchivascucristian@twitter / http://facebook.com/ivascucristianContact at: civascu@adobe.com
Distributed computing?Run some code over lots of machines, over the networkWithout shared stateRun over a ton of dataShines only when data to process >> network capabilityRun it in reasonable timeNo, 1 week is not OKIt’s not new, contrary to popular belief.4
History timeLocal computingParallel computingGrid computingDistributed computingEvolution proportional with increase in data size & computation complexity5
Local computingEverything happens on a single machineNo overhead (network, sync, etc)Limited to how much you can add in a box6
Parallel computingEverything happens on a single machineEnter overhead: multiple computation units fighting for memoryLimited to how much $$ you have and physical limitation (do you really need a CRAY?7
Grid computingMoved computation units away from the dataMore overhead: all data stored on SAN, must move it over network to computation unitsLimited to how much $$ you have to grow the SAN and how much data you must process8
Distributed computingMoved computation units with the data, but away from each otherOverhead galore: network, synchronization, different types of machines, development timeLimited to how much $$ you have to add machines9
Why distributed computing?But it’s webscale!Really ….Large data sets that need to be crunched, offlineWeb indexingSvm over tons of dataPredictions based on huge histories (e.g. credit-card fraud patterns)MMORPGsDistributed databases…..10
Adobe Media PlayerClusters with users with similar interests
 6 GB of  logs, one month, 700k AMP users subscribing to shows in 114 genres
 Processed in Mahout, over Hadoop, method canopy clustering
7 testing servers
 5 hours of data crunching
 27 preferences clustersGame ConstellationsProcessing Shockwave logsWhy so popular all of a sudden?Telecom did it for years, in the shadowsThen came Google.2004 paper on Map-Reduce / GFS / Chubby + Google success = BOOM!They proved it works and everyone felt its useWas not new; just well thought out13
Designing a distributed computing systemThe network is reliableLatency is zeroBandwidth is infiniteThe network is secureTopology doesn’t changeThere is one administratorTransport cost is zeroThe network is homogeneous14
Designing a distributed computing systemThe network is reliableLatency is zeroBandwidth is infiniteThe network is secureTopology doesn’t changeThere is one administratorTransport cost is zeroThe network is homogeneous15FALLACIES
Where does Hadoop fit in?Google’s implementation is secret sauce; so no dice in using itBut others needed it (Nutch) so they copied it (ish)Hadoop – open source implementation of Map-Reduce / GFS16
Hadoop componentsHadoop Distributed File System (HDFS)Distributes and stores data across a cluster (brief intro only)Hadoop Map Reduce  (MR)Provides a parallel programming modelMoves computation to where the data isHandles scheduling, fault toleranceStatus reporting and monitoring17
HDFSMitigates failure through replicationAlgorithm keeps track of machine location: one copy on another machine in the same rack, one in another rack, one random; never 2 copies on the same machine, even if multiple drivesTries to have data localityRunning computation on the data uses the location of the replicas18
HDFS ArchitectureStores FS metadata – namespace, block locationsNamenode(Master)ReplicationMeta data opsDatanodeDatanodeClientDatanodeReadWriteStores the data blocks as linux files19
MapReduceHow to scale large data processing applications ?Divide the data and process on many nodesEach such application has to handleCommunication between nodesDivision and scheduling of work fault tolerance monitoring and reportingMap Reduce handles and hides all these issuesProvides a clean abstraction for programmer20
Map-Reduce ArchitectureJobtrackerInput Job (mapper, reducer, input)Assign taskstasktrackertasktrackertasktrackerData transferEach node is part of a HDFS cluster.
Input data is stored in HDFS spread across nodes and replicated
Programmer submits job (mapper, reducer, input) to Job tracker
Job tracker  - Master
splits input data
Schedules and monitors various map and reduce tasks
Task tracker – Slaves
Execute map and reduce tasksMapReduce Programming ModelInspired by functional language primitivesmap f list : applies a given function f to a each element of list and returns a new list map square [1 2 3 4 5] = [1 4 9 16 25]reduce g list : combines elements of list using function g to generate a new value		reduce sum [1 2 3 4 5] = [15]Map and reduce do not modify input data. They always create new data A Hadoop Map Reduce job consists of a mapper and a reducer22
Map Reduce Programming ModelMapperRecords (lines, database rows etc) are input as key/value pairs Mapper outputs one or more intermediate key/value pairs for each inputmap(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)ReducerAfter the map phase, all the intermediate values for a given output key are combined together into a listreducer combines those intermediate values into one or more final key/value pairsreduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter)Input and output key/value types can be different23
Map Reduce Programming ModelMapperRecords (lines, database rows etc) are input as key/value pairs Mapper outputs one or more intermediate key/value pairs for each inputmap(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)ReducerAfter the map phase, all the intermediate values for a given output key are combined together into a listreducer combines those intermediate values into one or more final key/value pairsreduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter)Input and output key/value types can be different24
Parallel execution25
Map Reduce AdvantagesLocalityJob tracker divides tasks based on location of data: it tries to schedule map tasks on same machine that has the physical dataParallelismMap tasks run in parallel working different input data splitsReduce tasks run in parallel working on different intermediate keysReduce tasks wait until all map tasks are finishedFault toleranceJob tracker maintains a heartbeat with task trackersFailures are handled by re-executionIf a task tracker node fails then all tasks scheduled on it (completed or incomplete) are re-executed on another node26
HBaseDistributed database on top of HDFSMap-Reduce enabledFault-tolerant and scalable – relies on the core Hadoop values27
MahoutAn Apache Software Foundation project to create scalable machine learning libraries under the Apache Software Licensehttp://mahout.apache.orgWhy Mahout?Many Open Source ML libraries either:Lack CommunityLack Documentation and ExamplesLack ScalabilityLack the Apache LicenseOr are research-oriented28
Machine learning?29Amazon.comGoogle News
Machine learning!“Machine Learning is programming computers to optimize a performance criterion using example data or past experience”Intro. To Machine Learning by E. AlpaydinSubset of Artificial IntelligenceLots of related fields:Information RetrievalStatsBiologyLinear algebraMany more30
ML Use-casesRecommend friends/dates/productsClassify content into predefined groupsFind similar content based on object propertiesFind associations/patterns in actions/behaviorsIdentify key topics in large collections of textDetect anomalies in machine outputRanking search resultsOthers?31
Getting Started with MLGet your dataDecide on your features per your algorithmPrep the dataDifferent approaches for different algorithmsRun your algorithm(s)Lather, rinse, repeatValidate your resultsSmell test, A/B testing, more formal methods32
Focus: Machine Learning33ApplicationsExamplesRecommendersClusteringClassificationFreq. PatternMiningGeneticMathVectors/Matrices/SVDUtilitiesLucene/VectorizerCollections (primitives)Apache Hadoop
Focus: ScalableGoal: Be as fast and efficient as possible given the intrinsic design of the algorithmSome algorithms won’t scale to massive machine clustersOthers fit logically on a Map Reduce framework like Apache HadoopStill others will need alternative distributed programming modelsBe pragmaticMost Mahout implementations are Map Reduce enabled34
Implemented AlgorithmsClassificationClusteringPattern miningRegressionDimension reductionEvolutionary algorithmsCollaborative filtering35

Distributed computing poli

  • 1.
  • 2.
    AgendaWho am I?Whatam I talking about?Just a bit of history … repeatingApplications of distributed computingEnter GoogleDesigning a distributed computing system fallaciesMap-Reduce going public: HadoopHbaseMahoutQ&A2
  • 3.
    Who am I?3ComputerScientist with Adobe Systems Inc. for 5 yearsWorked on desktop appWorked on scalable servicesExperimented with Hadoop > it turned into a product eventuallyNow doing researchivascucristian@twitter / http://facebook.com/ivascucristianContact at: civascu@adobe.com
  • 4.
    Distributed computing?Run somecode over lots of machines, over the networkWithout shared stateRun over a ton of dataShines only when data to process >> network capabilityRun it in reasonable timeNo, 1 week is not OKIt’s not new, contrary to popular belief.4
  • 5.
    History timeLocal computingParallelcomputingGrid computingDistributed computingEvolution proportional with increase in data size & computation complexity5
  • 6.
    Local computingEverything happenson a single machineNo overhead (network, sync, etc)Limited to how much you can add in a box6
  • 7.
    Parallel computingEverything happenson a single machineEnter overhead: multiple computation units fighting for memoryLimited to how much $$ you have and physical limitation (do you really need a CRAY?7
  • 8.
    Grid computingMoved computationunits away from the dataMore overhead: all data stored on SAN, must move it over network to computation unitsLimited to how much $$ you have to grow the SAN and how much data you must process8
  • 9.
    Distributed computingMoved computationunits with the data, but away from each otherOverhead galore: network, synchronization, different types of machines, development timeLimited to how much $$ you have to add machines9
  • 10.
    Why distributed computing?Butit’s webscale!Really ….Large data sets that need to be crunched, offlineWeb indexingSvm over tons of dataPredictions based on huge histories (e.g. credit-card fraud patterns)MMORPGsDistributed databases…..10
  • 11.
    Adobe Media PlayerClusterswith users with similar interests
  • 12.
    6 GBof logs, one month, 700k AMP users subscribing to shows in 114 genres
  • 13.
    Processed inMahout, over Hadoop, method canopy clustering
  • 14.
  • 15.
    5 hoursof data crunching
  • 16.
    27 preferencesclustersGame ConstellationsProcessing Shockwave logsWhy so popular all of a sudden?Telecom did it for years, in the shadowsThen came Google.2004 paper on Map-Reduce / GFS / Chubby + Google success = BOOM!They proved it works and everyone felt its useWas not new; just well thought out13
  • 17.
    Designing a distributedcomputing systemThe network is reliableLatency is zeroBandwidth is infiniteThe network is secureTopology doesn’t changeThere is one administratorTransport cost is zeroThe network is homogeneous14
  • 18.
    Designing a distributedcomputing systemThe network is reliableLatency is zeroBandwidth is infiniteThe network is secureTopology doesn’t changeThere is one administratorTransport cost is zeroThe network is homogeneous15FALLACIES
  • 19.
    Where does Hadoopfit in?Google’s implementation is secret sauce; so no dice in using itBut others needed it (Nutch) so they copied it (ish)Hadoop – open source implementation of Map-Reduce / GFS16
  • 20.
    Hadoop componentsHadoop DistributedFile System (HDFS)Distributes and stores data across a cluster (brief intro only)Hadoop Map Reduce (MR)Provides a parallel programming modelMoves computation to where the data isHandles scheduling, fault toleranceStatus reporting and monitoring17
  • 21.
    HDFSMitigates failure throughreplicationAlgorithm keeps track of machine location: one copy on another machine in the same rack, one in another rack, one random; never 2 copies on the same machine, even if multiple drivesTries to have data localityRunning computation on the data uses the location of the replicas18
  • 22.
    HDFS ArchitectureStores FSmetadata – namespace, block locationsNamenode(Master)ReplicationMeta data opsDatanodeDatanodeClientDatanodeReadWriteStores the data blocks as linux files19
  • 23.
    MapReduceHow to scalelarge data processing applications ?Divide the data and process on many nodesEach such application has to handleCommunication between nodesDivision and scheduling of work fault tolerance monitoring and reportingMap Reduce handles and hides all these issuesProvides a clean abstraction for programmer20
  • 24.
    Map-Reduce ArchitectureJobtrackerInput Job(mapper, reducer, input)Assign taskstasktrackertasktrackertasktrackerData transferEach node is part of a HDFS cluster.
  • 25.
    Input data isstored in HDFS spread across nodes and replicated
  • 26.
    Programmer submits job(mapper, reducer, input) to Job tracker
  • 27.
    Job tracker - Master
  • 28.
  • 29.
    Schedules and monitorsvarious map and reduce tasks
  • 30.
  • 31.
    Execute map andreduce tasksMapReduce Programming ModelInspired by functional language primitivesmap f list : applies a given function f to a each element of list and returns a new list map square [1 2 3 4 5] = [1 4 9 16 25]reduce g list : combines elements of list using function g to generate a new value reduce sum [1 2 3 4 5] = [15]Map and reduce do not modify input data. They always create new data A Hadoop Map Reduce job consists of a mapper and a reducer22
  • 32.
    Map Reduce ProgrammingModelMapperRecords (lines, database rows etc) are input as key/value pairs Mapper outputs one or more intermediate key/value pairs for each inputmap(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)ReducerAfter the map phase, all the intermediate values for a given output key are combined together into a listreducer combines those intermediate values into one or more final key/value pairsreduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter)Input and output key/value types can be different23
  • 33.
    Map Reduce ProgrammingModelMapperRecords (lines, database rows etc) are input as key/value pairs Mapper outputs one or more intermediate key/value pairs for each inputmap(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)ReducerAfter the map phase, all the intermediate values for a given output key are combined together into a listreducer combines those intermediate values into one or more final key/value pairsreduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter)Input and output key/value types can be different24
  • 34.
  • 35.
    Map Reduce AdvantagesLocalityJobtracker divides tasks based on location of data: it tries to schedule map tasks on same machine that has the physical dataParallelismMap tasks run in parallel working different input data splitsReduce tasks run in parallel working on different intermediate keysReduce tasks wait until all map tasks are finishedFault toleranceJob tracker maintains a heartbeat with task trackersFailures are handled by re-executionIf a task tracker node fails then all tasks scheduled on it (completed or incomplete) are re-executed on another node26
  • 36.
    HBaseDistributed database ontop of HDFSMap-Reduce enabledFault-tolerant and scalable – relies on the core Hadoop values27
  • 37.
    MahoutAn Apache SoftwareFoundation project to create scalable machine learning libraries under the Apache Software Licensehttp://mahout.apache.orgWhy Mahout?Many Open Source ML libraries either:Lack CommunityLack Documentation and ExamplesLack ScalabilityLack the Apache LicenseOr are research-oriented28
  • 38.
  • 39.
    Machine learning!“Machine Learningis programming computers to optimize a performance criterion using example data or past experience”Intro. To Machine Learning by E. AlpaydinSubset of Artificial IntelligenceLots of related fields:Information RetrievalStatsBiologyLinear algebraMany more30
  • 40.
    ML Use-casesRecommend friends/dates/productsClassifycontent into predefined groupsFind similar content based on object propertiesFind associations/patterns in actions/behaviorsIdentify key topics in large collections of textDetect anomalies in machine outputRanking search resultsOthers?31
  • 41.
    Getting Started withMLGet your dataDecide on your features per your algorithmPrep the dataDifferent approaches for different algorithmsRun your algorithm(s)Lather, rinse, repeatValidate your resultsSmell test, A/B testing, more formal methods32
  • 42.
    Focus: Machine Learning33ApplicationsExamplesRecommendersClusteringClassificationFreq.PatternMiningGeneticMathVectors/Matrices/SVDUtilitiesLucene/VectorizerCollections (primitives)Apache Hadoop
  • 43.
    Focus: ScalableGoal: Beas fast and efficient as possible given the intrinsic design of the algorithmSome algorithms won’t scale to massive machine clustersOthers fit logically on a Map Reduce framework like Apache HadoopStill others will need alternative distributed programming modelsBe pragmaticMost Mahout implementations are Map Reduce enabled34
  • 44.
    Implemented AlgorithmsClassificationClusteringPattern miningRegressionDimensionreductionEvolutionary algorithmsCollaborative filtering35