Advertisement
Advertisement

More Related Content

Advertisement
Advertisement

Big data, map reduce and beyond

  1. Big Data, MapReduce and beyond Iván de Prado Alonso // @ivanprado Pere Ferrera Bertran // @ferrerabertran @datasalt
  2. Outline Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Big Data. What and why 2. MapReduce & Hadoop 3. MapReduce Design Patterns 4. Real-life MapReduce 1. Data analytics 2. Crawling 3. Full-text indexing 4. Reputation systems 5. Data Mining 5. Tuple MapReduce & Pangool 2 / 112
  3. Big Data What and why
  4. In the past... Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Data and computation fits on one monolithic machine ● Monolithic databases: RDBMS ● Scalability: – Vertical: buy better hardware ● Distributed systems – No very common – Logic centric: Data move where the logic is ● Distributed storage: SAN 4 / 112
  5. Distributed systems are hard Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Building distributed systems is hard – If you can scale vertically at a reasonable cost, why to deal with distributed systems complexity? ● But circumstances are changing: – Big Data ● Big data refers to the massive amounts of data that are difficult to analyze and handle using common database management tools 5 / 112
  6. BIG DATA “MAC” 6 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
  7. Big Data Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Data is the new bottleneck – Web data ● Web pages ● Interaction Logs – Social networks data – Mobile devices – Data generated by Sensors ● Old systems/techniques are not appropriated ● A new approach is needed 7 / 112
  8. Big Data project parts Serving Acquiring Processing 8 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
  9. Acquiring Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Gathering/receiving/storing data from sources ● Many kind of sources – Internet – Sensors – User behavior – Mobile devices – Health care data – Banking data – Social Networks – ….. 9 / 112
  10. Processing Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Data is present in the system (acquired) ● This step is responsible of extracting value from data – Eliminate duplicates – Infer relations – Calculate statistics – Correlate information – Ensure quality – Generate recommendations – …. 10 / 112
  11. Serving Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Most of the cases, some interface has to be provided to access the processed information ● Possibilities – Big Data / No Big Data – Real time access to results / non real time access ● Some examples: – Search engine → inverted index – Banking data → relational database – Social Network → NoSQL database 11 / 112
  12. Big Data system types Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Offline – Latency is not a problem ● Online – Response immediacy is important ● Mixed – Online behavior, but internally is a mixture of two systems ● One online ● One offline Offline Online MapReduce NoSQL Hadoop Search engines Distributed RDBMS 12 / 112
  13. A Mixed Online Offline A P AS P P P S A S S Big Data Systems types II 13 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
  14. MapReduce & Hadoop
  15. “Swiss army knife of the 21st century” Media Guardian Innovation Awards http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop 15 / 112
  16. History Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● 2004-2006 – GFS and MapReduce papers published by Google – Doug Cutting implements an open source version for Nutch ● 2006-2008 – Hadoop project becomes independent from Nutch – Web scale reached in 2008 ● 2008-now – Hadoop becomes popular and is commercially exploited Source: Hadoop: a brief history. Doug Cutting 16 / 112
  17. Hadoop “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model” From Apache Hadoop page 17 / 112
  18. Main ideas Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Distributed – Distributed storage – Distributed computation platform ● Built to be fault tolerant ● Shared nothing architecture ● Programmer isolation from distributed system difficulties – By providing an simply primitives for programming 18 / 112
  19. Hadoop Distributed File System (HDFS) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Distributed – Aggregates the individual storage of each node ● Files formed by blocks – Typically 64 or 128 Mb (configurable) – Stored in the OS filesystem: Xfs, Ext3, etc. ● Fault tolerant – Blocks replicated more than once 19 / 112
  20. How files are stored Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. DataNode 1 (DN1) NameNode 1 DataNode 2 (DN2) Data.txt: 2 Blocks: 1 DN1 1 DN2 2 3 DN1 DN4 3 DataNode 4 (DN4) DN2 DN3 2 4 DN4 4 DN3 DataNode 3 (DN3) 3 4 20 / 112
  21. MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Map Reduce is the abstraction behind Hadoop ● The unit of execution is the Job ● Job has – An input – An output – A map function – A reduce function ● Input and output are sequences of key/value pairs ● The map and reduce functions are provided by the developer – The execution is distributed and parallelized by Hadoop 21 / 112
  22. Job phases Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Two different phases: mapping and reducing ● Mapping phase – Map function is applied to Input data ● Intermediate data is generated ● Reducing phase – Reduce function is applied to intermediate data ● Final output is generated 22 / 112
  23. MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Two functions (Map & Reduce) – Map(u, v) : [w,x]* – Reduce(w, x*) : [y, z]* ● Example: word count – Map([document, null]) -> [word, 1]* – Reduce(word, 1*) -> [word, total] ● MapReduce & SQL – SELECT word, count(*) GROUP BY word ● Distributed execution in a cluster – Horizontal scalability 23 / 112
  24. Word Count Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. This is a line Also this Map Reduce reduce(a, {1}) = map(“This is a line”) = a, 1 this, 1 reduce(also, {1}) = is, 1 also, 1 a, 1 reduce(is, {1}) = line, 1 is, 1 map(“Also this”) = reduce(line, {1}) = also, 1 line, 1 this, 1 reduce(this, {1, 1}) = this, 2 a, 1 also, 1 Result: is, 1 line, 1 this, 2 24 / 112
  25. Map examples Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Swap Mapper – Swaps key and value map(key, value): emit (value, key) ● Split Key Mapper – Splits key in words and emit a pair per each word map(key, value): words = key.split(“ “) for each word in words: emit (word, value) 25 / 112
  26. Map examples (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Filter Mapper – Filter out some records map(key, value): if (key <> “the”): emit (key, value) ● Key/value concatenation mapper – Concatenates the key and the value in the key map(key, value): emit (key + “ “ + value, null) 26 / 112
  27. Reduce examples Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Count reducer – Counts the number of elements per each key reduce(key, values): count = 0 for each value in values: count++ emit(key, count) ● Average reducer – Computes the average value for each key reduce(key, values): count = 0 total = 0 for each value in values: count++ total += value emit(key, total / count) 27 / 112
  28. Reduce examples (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Keep first reducer – Keeps the first key/value input pair reduce(key, values): emit(key, first(values)) ● Value concatenation reducer – Concatenates the values in one string reduce(key, values): result = “” for each value in values: result += “ “ + value emit(key, result) 28 / 112
  29. Identity map and reduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● The identity functions are those that keeps the input unchanged – Map identity map(key, value): emit (key, value) – Reduce identity reduce(key, values): for each value in values: emit (key, value) 29 / 112
  30. Putting all together Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. map(k, v) → [w, x]* reduce(w, [x]+) → [y, z]* ● Job flow: – The mapper generates key/value pairs – This pairs are grouped by the key – Hadoop calls the reduce function for each group – The output of the reduce function is the final Job output ● Hadoop will distribute the work – Different nodes in the cluster will process the data in parallel 30 / 112
  31. data Tasks Output Reduce Map Tasks Intermediate Job Execution Node 1 Node 1 Input Splits (blocks) Node 2 Node 2 31 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
  32. Job Execution (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Key/value pair are sorted by key in the shuffle & sorting phase – That is needed in order to group registers by the key when calling the reducer – It also means that calls to the reduce function are done in key-order ● Reduce function with key “A” is always called before than reduce function with key “B” whiting the same reduce task ● Reducers starts downloading data from the mappers as soon as possible – In order to reduce the shuffle & sorting phase time – Number of reducers can be configured by the programmer 32 / 112
  33. Partial Sort Job Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● A job formed with the identity map and the identity reducer – It just sort data by the key per each reducer Input file D B A B C D E A Map 1 Map 2 Intermediate D A B B D A C E data Reduce 1 Reduce 2 A A D D B B C E Output files 33 / 112
  34. Input Splits Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Each map task process one input split – Map task starts processing at the first complete record, and finishes processing the record crossed by the rightmost boundary Input Input Input Input Split Split Split Split 1 2 3 4 File Records Map Map Map Map Task Task Task Task 1 2 3 4 34 / 112
  35. Combiner Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Intermediate data goes from the map tasks to the reduce tasks through the network – Network can be saturated ● Combiners can be used to reduce the amount of data sent to the reducers – When the operation is commutative and associative ● A combiner is a function similar to the reducer – But it is executed in the map task, just after all the mapping has been done ● Combiners can't have side effects – Because Hadoop can decide to execute them or not 35 / 112
  36. Design Patterns
  37. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Filtering 2. Secondary sorting 3. Distributed execution 4. Computing statistics 5. Count distinct 6. Sorting 7. Joins 8. Reconciliation 37 / 112
  38. Filtering Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Filtering: Input data ● We process the input file in parallel with Hadoop and if(condition) { emit(); } emit a smaller dataset in the end Output data 38 / 112
  39. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Filtering 2. Secondary sorting 3. Distributed execution 4. Computing statistics 5. Count distinct 6. Sorting 7. Joins 8. Reconciliation 39 / 112
  40. Secondary sorting Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Receive reducer values in a specific order ● Moving averages: – Secondary sort by timestamp – Fill an in-memory window and perform average. ● Top N items in a group: – Secondary sort by <X> – Emit the first N elements in a group ● Useful, yet quite difficult to implement in Hadoop. Sort Comparator Key Group Comparator Partitioner 40 / 112
  41. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Filtering 2. Secondary sorting 3. Distributed execution 4. Computing statistics 5. Count distinct 6. Sorting 7. Joins 8. Reconciliation 41 / 112
  42. Distributed execution without Hadoop Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Distributed queue – It is needed a common queue used to coordinate and assign work ● Distributed workers – Consumers working on each node, getting work from the queue ● Problems: – Difficult to coordinate ● Failover ● Loosing messages ● Load balance – Queue must scale 42 / 112
  43. Distributed execution with Hadoop Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Map-only Jobs. ● Use Hadoop just for the sake of “parallelizing something”. ● Anything that doesn't involve a “group by” (no shuffle/reducer) ● Examples: – Text categorization – Filtering – Crawling Map 1 Map 2 … Map n – Updating a DB – Distributed grep ● NlineInputFormat can be handy for that. 43 / 112
  44. Disadvantages Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Work is done in batches – And distribution is not probably even ● Some resources are wasted ● There are some tricks to alleviate the problem – Task timeout + saving remaining work to next execution 44 / 112
  45. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Filtering 2. Secondary sorting 3. Distributed execution 4. Computing statistics 5. Count distinct 6. Sorting 7. Joins 8. Reconciliation 45 / 112
  46. Computing statistics (I) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Count, Sum, Average, Std. Dev... ● Aggregated by something ● Recall SQL: select user, count(clicks) … group by user user, click user, click Map: emit (user, click) Reduce by user: count values user, count(click) 46 / 112
  47. Computing statistics (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● When sum(), avg(), etc, Combiners are often needed ● Imagine a user performed 3 million clicks – Then, a reducer will receive 3 million registers – This reducer will be the bottleneck of the Job. Everyone needs to wait for it to count 3 million things. ● Solution: Perform partial counts in a Combiner ● Combiner is executed before shuffling, after Mapper. 47 / 112
  48. Computing statistics (III) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Using a Combiner: user, click user, click Map user, count(click) user, count(click) Combine user, sum(count(click)) Reduce ● For each group, reducer aggregates N counts in the worst case! (N = #mappers) 48 / 112
  49. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Filtering 2. Secondary sorting 3. Distributed execution 4. Computing statistics 5. Count distinct 6. Sorting 7. Joins 8. Reconciliation 49 / 112
  50. Distinct Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● How to calculate distinct count(something) group by X ? ● It is somewhat easy (2 M/Rs): M/R 1 (eliminates duplicates): – emit ({X, something}, null) – so rows are grouped by ({X, something}) – In the reducer, just emit the first (ignore duplicates) M/R 2 (groups by X and count): – For each input X, something → emit (X, 1) – group by (X) – The reducer counts incoming values 50 / 112
  51. Distinct: example Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. M/R 1 M/R 2 (X1, s1) (X1, s1) (X1, s2) (X1, s1) (X1, s1) (X1, s1) (X1, s2) (X1, s2) X1 → 2 (X1, s2) (X2, s1) (X2, s1) X2 → 2 (X2, s1) (X2, s3) (X2, s3) (X2, s1) (X2, s3) (X2, s1) 51 / 112
  52. Distinct: Secondary sort Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● We can calculate distinct in only one Job ● Using Secondary Sorting M/R: – emit ({X, something}, null) – group by (X), secondary sort by (something) – The reducer: iterate, count & emit “something, count” when “something” changes. Reset the counter each “something” change. ● Need to use a Combiner to eliminate duplicates (otherwise reducer would receive too many records). ● disctinct count() is more parallelizable with 2 Jobs than with 1! 52 / 112
  53. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Filtering 2. Secondary sorting 3. Distributed execution 4. Computing statistics 5. Count distinct 6. Sorting 7. Joins 8. Reconciliation 53 / 112
  54. Sorting Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● We have seen how sorting is (partially) inherent in Hadoop. ● But if we want “pure” sorting: – Use one Reducer (not scalable) – Use an advanced partitioning strategy ● Yahoo! TeraSort (http://sortbenchmark.org/Yahoo2009.pdf) ● Use sampling to calculate data distribution ● Implement custom Partitioning according to distribution 54 / 112
  55. Sorting (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Hash partitioning: 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 ... ● Distribution-aware partitioning: 0 1 2 55 / 112
  56. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Filtering 2. Secondary sorting 3. Distributed execution 4. Computing statistics 5. Count distinct 6. Sorting 7. Joins 8. Reconciliation 56 / 112
  57. Joins Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Joining two (or more) datasets is a quite common need. ● The difficulty is that both datasets may be “too big” – Otherwise, an in-memory join can be do quite easily just by reading one of the datasets in RAM. ● “Big joins” are commonly done “reduce-side”: Map dataset 1: (K1, d21) Map dataset 2: (K1, d11) (K2, d22) (K2, d12) Reduce by common key (K1, K2,...) K1 → d11, d21 Reduce: Join K2 → d12, d22 ● The so-called “map-side joins” are more complex and tricky. 57 / 112
  58. Joins: 1-N relation Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Use secondary sorting to get the “one-side” of the relation the first – Otherwise you need to use memory to perform the join ● Does not scale ● Employee (E) – Sales join (S) SSESSS You need to use memory ESSSSS Memory not needed 58 / 112
  59. Left – Right – Inner joins Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Join between Employee and Sales reducer(key, values): employee = null first = first(values) rest = rest(values) If isEmployee(first) employee = first If employee = null // rigth join SSSSS else if size(rest) = 0 // left join E else // inner join ESSSSS 59 / 112
  60. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Filtering 2. Secondary sorting 3. Distributed execution 4. Computing statistics 5. Count distinct 6. Sorting 7. Joins 8. Reconciliation 60 / 112
  61. Reconciliation Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Hadoop can be used to “simulate a database”. ● For that, we need to: – Merge data state s(t) with past state s(t-1) using a Join. – Update rows with the same ID (performing whatever logic). – Store the result as the next full data state. – Rotate states: ● s(t-1) = s(t) ● s(t) = s(t + 1) s(t-1) s(t) s(t+1) M/R 61 / 112
  62. Real-life Hadoop projects Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● 95% Real-life Hadoop projects are a mixture of the patterns we just saw. ● Example: A vertical search-engine. – Distributed execution: Feed crawl / parse – Data reconciliation: Merge data by listing ID – Join: Augment listings with geographical info – … ● Example: Payments data stats. – Secondary sort: weekly, daily & monthly stats – Distributed execution: Random-updates to a DB ● ... 62 / 112
  63. Real-life MapReduce
  64. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Data analytics 2. Crawling 3. Full-text indexing 4. Reputation systems 5. Data Mining 64 / 112
  65. Data analytics Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Obvious use case for MapReduce. ● Examples: calculate unique visits per page. – Top products per month. – Unique clicks per banner. – Etc. ● Offline analytics (Batch-oriented). – Online analytics not a good fit for MapReduce. 65 / 112
  66. Data analytics: How it works Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● A batch process that uses all historical data. – Recompute everything always. – Easier to manage and maintain than incremental computation. ● A chain of MapReduce steps produce the final output. ● There are tools that ease building / maintaining the MapReduce chain: – Hive, Pig, Cascading, Pangool for programming a MapReduce flow easily. – Oozie, Azkaban for connecting existing MapReduce jobs easily. ● Scheduling flows and such. 66 / 112
  67. Data analytics: Difficulties Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Some things are harder to calculate than others. ● Calculating unique visits per page. – A simple solution in two MapReduce steps or a more sophisticated one in a single MapReduce step. – Approximated methods can be used as well. ● Calculating the median. – Need to sort all the dataset and iterate twice if we don’t know the number of elements. 67 / 112
  68. Data analytics: Examples Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Gather clicks on pages. – Save (click, page, timestamp) in the HDFS. ● A MapReduce job groups by page and counts the number of clicks: ● map: emit(page, click). ● reduce: (page, list<click>) emits (page, totalClicks). ● We now have the total number of clicks per page. 68 / 112
  69. Data analytics: Examples (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Another MapReduce job groups by day and page and counts the number of clicks: ● map: emit((page, day), click). ● reduce: Same as before. ● We now have the total number of clicks per page and day. ● These are simple examples, but data analytics can get as sophisticated as you want. – Example: calculate a 10 bar histogram of the distribution of clicks over the hours of the day for each page. 69 / 112
  70. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Data analytics 2. Crawling 3. Full-text indexing 4. Reputation systems 5. Data Mining 70 / 112
  71. Crawling: Web Crawling Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Web Crawling: – “A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.” ● Applications: – Search engines. – NLP (Sentiment analysis). ● Examples: 71 / 112
  72. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Data analytics 2. Crawling 3. Full-text indexing 4. Reputation systems 5. Data Mining 72 / 112
  73. Crawling: Web Crawling (at scale) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● How to parallelize storage and bandwidth? ● How to deduplicate stored URLs? ● Other complexities: politeness, infinite loops, robots.txt, canonical URLs, pagination, parsing, … ● Relevancy: Pagerank. 73 / 112
  74. Crawling: Nutch Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● What is Nutch? – Open source web-search software project. – Apache project. – Hadoop, Tike, Lucene, SOLR. ● (Brief) history – Started in 2002/2003 – 2005: MapReduce – 2006: Hadoop – 2006/2007: Tika – 2010 TLP Apache project 74 / 112
  75. Crawling: Nutch: How it works Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● “Select, Crawl, Parse, Dedup by URL” loop. ● Lucene, SOLR for indexing. – We will see them later. ● CrawlDB: Pages are saved in HDFS. ● MapReduce makes storage and computing scalable. – Helps in deduplicating pages by URL. – Helps in identifying new pages to crawl. 75 / 112
  76. Crawling: Not-Only Web Crawling Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Custom crawlers: – Tweets. – XML feeds. ● Simpler, as we usually don’t need to traverse a tree. – Sometimes only crawling a fixed seed of resources is enough. ● Applications – Vertical search engines. – Reputation systems. 76 / 112
  77. Crawling: Example: Crawling tweets at scale Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Use a scalable computing engine for fetching tweets. – Storm is a good fit. – Hadoop can be used as well. ● Tricky usage of MapReduce: Create as many groups as crawlers and embed a Crawler in them. ● Save raw feed data (JSON) in HDFS. ● MapReduce: Parse JSON tweets. ● MapReduce: Deduplicate tweets. ● MapReduce: Analyze tweets and perform data analysis. 77 / 112
  78. M/R Parse HDFS M/R Dedup M/R Analysis Results Crawling: Example: Crawling tweets at scale 78 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
  79. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Data analytics 2. Crawling 3. Full-text indexing 4. Reputation systems 5. Data Mining 79 / 112
  80. Full-text indexing: Definitions Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Search engine: – An information retrieval system designed to help find information stored on a computer system. ● Inverted index: – Index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. – B-Trees are not inverted indexes! ● Stemming. ● Relevancy in results. 80 / 112
  81. Full-text indexing: Applications Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Web search engines – Finding relevant pages for a topic ● Vertical search engines – Finding jobs by description ● Social networks – Finding messages by text ● e-Commerce – Finding articles by description ● In general, any service or application needing efficient text information retrieval 81 / 112
  82. Full-text indexing (at scale) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Real-time indexing versus batch-indexing: – The first is cool: it is real-time, but it is difficult. We will not address it now. – The second is not real-time, but it is simpler. ● How to batch-index a big corpus dataset? – Need a scalable storage, (HDFS). ● How to deduplicate documents? – MapReduce to the rescue (like we saw before). ● How to generate multiple indexes? – MapReduce can help (we will see how). 82 / 112
  83. Full-text indexing: MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● MapReduce can be used to generate an inverted index. – Vertical partitioning v.s. Horizontal partitioning. ● Example: – Map: emit(word, docId) – Reduce: emit(word, list<docIds>) ● Quite simple. But what about stop words, stemming, etc? ● How to store the index? ● Better not to reinvent the wheel. 83 / 112
  84. Full-text indexing: Lucene / SOLR Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Lucene: Doug Cutting’s – From Nutch. – Mainstream open-source implementation of an inverted index. – Efficient disk allocation, highly performant. ● SOLR: Mainstream open-source search server. – Provides stemming, analyzers, HTTP servlets, etc. – Lacks some other desirable properties: ● Elasticity, real-time indexing, horizontal partitioning (although work in progress). ● Still the reference technology for creating and serving inverted indexes. 84 / 112
  85. Full-text indexing: MapReduce meets SOLR Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● We can use MapReduce for scaling the indexing process. ● At the same time, we can use SOLR for creating the resulting index. – SOLR is used as-a-library. ● Generated indexes are later deployed to the search servers. 85 / 112
  86. Full-text indexing: Example Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● A vertical job search engine. ● Jobs are parsed from crawled feeds and saved in the HDFS. ● MapReduce for deduplicating job offers. – map: emit(jobId, job) – reduce (jobId, list<job>) -> emit (jobId, job) ● Retention policy: keep latest job. 86 / 112
  87. Full-text indexing: Example (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● MapReduce for augmenting job information (adding geographical information). – map1: emit(job.city, job) – map2: emit(city, geoInfo) – reduce (job.city, list<job>)(city, geoInfo) -> for all jobs emit(job, geoInfo) ● MapReduce for distributing the index process: – map: emit(job.country, job) – reduce: (job.country, list<job>) -> Create index for country “job.country” using SOLR. ● Deploy per-country indexes to search cluster. 87 / 112
  88. Full-text indexing: Example Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. XML feeds Search Cluster Deploy Geo Indexes HDFS info M/R M/R M/R M/R Parse Dedup Geo info Index 88 / 112
  89. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Data analytics 2. Crawling 3. Full-text indexing 4. Reputation systems 5. Data Mining 89 / 112
  90. Reputation: Definitions Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● What is reputation? ● Reputation in social communities. – eBay, StackOverflow... ● Reputation in social media. – Twitter, Facebook... ● Why is it important? 90 / 112
  91. Reputation: Relationships Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Modelling relationships is needed for calculating reputation ● Graph-like models arise ● Usually stored as vertices – A interacts with B – or, A trusts B – or, A → B ● Internet-scale graphs can be stored in HDFS – Each vertex in a row – Add needed metadata to vertices: date, etc. 91 / 112
  92. Reputation: MapReduce analysis on vertices Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● All people whom “A” interacted with. – map: (a, b) – reduce: (a, list<b>). ● Essentially things like PageRank can be very easily implemented. ● PageRank as a measure of page relevancy from page inlinks. – But it can be extrapolated to any kind of authority and trustiness metric. – E.g. People relevancy from social networks. 92 / 112
  93. Reputation: Going deeper on graphs Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Friends of friends of friends. – 1 MapReduce step: My friends. – 2 MapReduce steps: Friends of my friends. – 3 MapReduce steps: Friends of friends of my friends. ● Iterative MapReduce solves it. ● But there are better foundational models such as Google’s Pregel. – Exploiting data locality in graphs. – Apache Giraph. – Apache Hama. 93 / 112
  94. Reputation: Difficulties Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Sometimes multiple MapReduce steps are needed for calculating a final metric. – Because data doesn’t fit in memory. ● Intermediate relations need to be calculated. – And later filtered out. ● “Polynomial effect”: Calculate all pairwise relations in a set: N*(N-1)/2 – Possible bottleneck. 94 / 112
  95. Reputation: Difficulties: Data imbalance Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● When grouping by something, some groups may be much bigger than others. – Causing “data imbalance”. ● Data imbalance in MapReduce is a big problem. – Some machines will finish quickly while one will be busy for hours. ● Inefficient usage of resources. ● Data processing doesn’t scale linearly anymore. – Next MapReduce step can’t start as long as current one hasn’t yet finished. 95 / 112
  96. Reputation: Example Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Input: Tweets. ● Distributed crawling for fetching the tweets. – Save them in the HDFS. ● Parse the tweets. Define the graph of trustiness. – A trusts B if A follows B. ● Execute PageRank over the graph. – Spreads trustiness to all vertices. 96 / 112
  97. M/R Parse HDFS Reputation: Example C A B D Graph of trustiness M/R Results PageRank 97 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
  98. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Data analytics 2. Crawling 3. Full-text indexing 4. Reputation systems 5. Data Mining 98 / 112
  99. Data mining: Text classification Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Document classification – Documents are texts in this case. ● Assigns a text one or more categories. – Binary classifiers versus multi-category classifiers – Multi-category classifiers can be built from multiple binary classifiers ● Two steps: generating the model and classifying. 99 / 112
  100. Data mining: Text classification: Steps Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Generating the model – The resultant model may or may not fit in memory. ● Let’s assume the final model fits in memory. – Use a large dataset for generating the model. ● MapReduce helps scaling the model generation process. ● Example: Build multiple binary classifiers -> parallelize by classifier. ● Example: Calculate conditional probabilities of a Bayesian model. Paralellize by word (like in WordCount example). ● Classifying – MapReduce also helps in classifying a large dataset. ● If model fits in memory, parallelize documents to classify and load the model in memory. – Batch-classifying: parallelize documents to classify. Output is the set of documents with the assigned categories. 100 / 112
  101. Data mining: Others Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Mahout library for other data mining problems. – Clustering, logistic regression, etc. ● Recommendation algorithms. – Many recommendation algorithms are based on calculating correlations. – Calculating correlations in parallel with MapReduce is easy. ● Remember: always in the “batch” or “offline” domain. – Recommendations are reloaded after batch process finishes. 101 / 112
  102. Tuple Mapreduce Pere Ferrera, Ivan de Prado, Eric Palacios, Jose Luis Fernandez-Marquez, Giovanna Di Marzo Serugendo: Tuple MapReduce: Beyond classic MapReduce. In ICDM 2012: Proceedings of the IEEE International Conference on Data Mining (To appear) (10.7% Acceptance rate)
  103. Common MapReduce problems Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Lack of compound records – By default, key & value are considered monolithic entities. ● In real life, this case is rare. – Alleviated by some serialization libraries (Thrift, Protocol Buffers) ● Sorting within a group – MapReduce foundation does nos support it ● Although MapReduce implementations overcome this problem with “tricks” ● Joins – Needs of compound records and sorting within a group to be implemented – Not directly supported by MapReduce 103 / 112
  104. Tuple Map Reduce: rationale Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Compound records, sorting within a group and join are design patterns that arises in most MapReduce applications... ● … but MapReduce does not make the implementation easy ● An evolution of MapReduce paradigm is needed to cover these design patterns: Tuple MapReduce 104 / 112
  105. Tuple MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● We extended the classic (key, value) MapReduce model ● Use n-sized Tuples instead of (key, value) ● Define a Tuple-based M/R – Covering most common use cases 105 / 112
  106. Group by / Sort by Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● You can think of a M/R as a SELECT … GROUP BY … ● With Tuple MapReduce, you simply “group by” a subset of Tuple fields – Easier, more intuitive than having objects for each kind of Key you want to group by. ● Alternatively, you may “sort by” a wider subset – Hiding all complex logic behind secondary sort 106 / 112
  107. Tuple-Join MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Extend the whole idea to allow for easier joins Tuple1: (a,b,c,d) Tuple2: (a,b,f,g,h) Join by (a,b) ● Formally speaking: 107 / 112
  108. Pangool http://pangool.net 108 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
  109. Pangool: What? Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Better, simpler, powerful API replacement for Hadoop's API ● What do we mean by API replacement? – APIs on top of Hadoop: Pig, Hive, Cascading. – Using them always comes with a tradeoff. – Paradigms other than MapReduce, not always the best choice. – Performance restrictions. ● Pangool is still MapReduce, low-level and high performing – Yet a lot simpler! 109 / 112
  110. Pangool: Why? Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Hadoop has a steep learning curve ● Default API is too low-level ● Making things efficient is harsh (binary comparisons, ser/de...) ● There are some common patterns (joins, secondary sorting...) Common pattern How can we make them simpler Common pattern without loosing flexibility and power? Common pattern 110 / 112
  111. Pangool API Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Schema, Tuple, … ● Reducers, Mappers, etc are instances instead of static classes – Easier to configure them: new MyReducer(5, 2.0); ● Still tied to Hadoop's particularities in some ways – NullWritable, etc ● Let's see an example 111 / 112
  112. Thanks!! Iván de Prado Alonso Pere Ferrera Bertran

Editor's Notes

  1. - Premios a la Innovación de The Guardian Hay que reconocer que las navajas suizas son útiles … Quién no ha necesitado una lupa en un momento de emergencia! A Hadoop le pasa como las navajas suizas. Son muy útiles, sudas la gota gorda consigues sacar el accesorio que quieres
  2. Distribuida: aprovecha la potencia de varias máquinas en un cluster Grandes conjuntos de datos: Hadoop no es apropiado para conjuntos de datos pequeños Simple Programming Model: Hadoop no es sólo un framework, es un nuevo paradigma de programación distribuida Hadoop se asienta principalmente en dos modulos: Un sistema de ficheros distribuido Para almacenar grandes volumenes de datos Un nuevo paradigma de programación: MapReduce Veamos uno por uno.
Advertisement