Big data, map reduce and beyond

13,823 views

Published on

Published in: Technology
0 Comments
20 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
13,823
On SlideShare
0
From Embeds
0
Number of Embeds
7,142
Actions
Shares
0
Downloads
0
Comments
0
Likes
20
Embeds 0
No embeds

No notes for slide
  • - Premios a la Innovación de The Guardian Hay que reconocer que las navajas suizas son útiles … Quién no ha necesitado una lupa en un momento de emergencia! A Hadoop le pasa como las navajas suizas. Son muy útiles, sudas la gota gorda consigues sacar el accesorio que quieres
  • Distribuida: aprovecha la potencia de varias máquinas en un cluster Grandes conjuntos de datos: Hadoop no es apropiado para conjuntos de datos pequeños Simple Programming Model: Hadoop no es sólo un framework, es un nuevo paradigma de programación distribuida Hadoop se asienta principalmente en dos modulos: Un sistema de ficheros distribuido Para almacenar grandes volumenes de datos Un nuevo paradigma de programación: MapReduce Veamos uno por uno.
  • Big data, map reduce and beyond

    1. 1. Big Data, MapReduce and beyond Iván de Prado Alonso // @ivanprado Pere Ferrera Bertran // @ferrerabertran @datasalt
    2. 2. Outline Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Big Data. What and why2. MapReduce & Hadoop3. MapReduce Design Patterns4. Real-life MapReduce 1. Data analytics 2. Crawling 3. Full-text indexing 4. Reputation systems 5. Data Mining5. Tuple MapReduce & Pangool 2 / 112
    3. 3. Big DataWhat and why
    4. 4. In the past... Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Data and computation fits on one monolithic machine● Monolithic databases: RDBMS● Scalability: – Vertical: buy better hardware● Distributed systems – No very common – Logic centric: Data move where the logic is● Distributed storage: SAN 4 / 112
    5. 5. Distributed systems are hard Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Building distributed systems is hard – If you can scale vertically at a reasonable cost, why to deal with distributed systems complexity?● But circumstances are changing: – Big Data● Big data refers to the massive amounts of data that are difficult to analyze and handle using common database management tools 5 / 112
    6. 6. BIG DATA “MAC”6 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
    7. 7. Big Data Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Data is the new bottleneck – Web data ● Web pages ● Interaction Logs – Social networks data – Mobile devices – Data generated by Sensors● Old systems/techniques are not appropriated● A new approach is needed 7 / 112
    8. 8. Big Data project parts Serving Acquiring Processing8 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
    9. 9. Acquiring Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Gathering/receiving/storing data from sources● Many kind of sources – Internet – Sensors – User behavior – Mobile devices – Health care data – Banking data – Social Networks – ….. 9 / 112
    10. 10. Processing Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Data is present in the system (acquired)● This step is responsible of extracting value from data – Eliminate duplicates – Infer relations – Calculate statistics – Correlate information – Ensure quality – Generate recommendations – …. 10 / 112
    11. 11. Serving Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Most of the cases, some interface has to be provided to access the processed information● Possibilities – Big Data / No Big Data – Real time access to results / non real time access● Some examples: – Search engine → inverted index – Banking data → relational database – Social Network → NoSQL database 11 / 112
    12. 12. Big Data system types Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Offline – Latency is not a problem● Online – Response immediacy is important● Mixed – Online behavior, but internally is a mixture of two systems ● One online ● One offline Offline OnlineMapReduce NoSQLHadoop Search enginesDistributed RDBMS 12 / 112
    13. 13. A Mixed Online Offline A P AS P P P S A S S Big Data Systems types II13 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
    14. 14. MapReduce & Hadoop
    15. 15. “Swiss army knife of the 21st century” Media Guardian Innovation Awardshttp://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop 15 / 112
    16. 16. History Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● 2004-2006 – GFS and MapReduce papers published by Google – Doug Cutting implements an open source version for Nutch● 2006-2008 – Hadoop project becomes independent from Nutch – Web scale reached in 2008● 2008-now – Hadoop becomes popular and is commercially exploited Source: Hadoop: a brief history. Doug Cutting 16 / 112
    17. 17. Hadoop “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model” From Apache Hadoop page 17 / 112
    18. 18. Main ideas Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Distributed – Distributed storage – Distributed computation platform● Built to be fault tolerant● Shared nothing architecture● Programmer isolation from distributed system difficulties – By providing an simply primitives for programming 18 / 112
    19. 19. Hadoop Distributed File System (HDFS) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Distributed – Aggregates the individual storage of each node● Files formed by blocks – Typically 64 or 128 Mb (configurable) – Stored in the OS filesystem: Xfs, Ext3, etc.● Fault tolerant – Blocks replicated more than once 19 / 112
    20. 20. How files are stored Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. DataNode 1 (DN1) NameNode 1 DataNode 2 (DN2) Data.txt: 2 Blocks: 1 DN1 1 DN2 2 3 DN1 DN4 3 DataNode 4 (DN4) DN2 DN3 2 4 DN4 4 DN3 DataNode 3 (DN3) 3 4 20 / 112
    21. 21. MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Map Reduce is the abstraction behind Hadoop● The unit of execution is the Job● Job has – An input – An output – A map function – A reduce function● Input and output are sequences of key/value pairs● The map and reduce functions are provided by the developer – The execution is distributed and parallelized by Hadoop 21 / 112
    22. 22. Job phases Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Two different phases: mapping and reducing● Mapping phase – Map function is applied to Input data ● Intermediate data is generated● Reducing phase – Reduce function is applied to intermediate data ● Final output is generated 22 / 112
    23. 23. MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Two functions (Map & Reduce) – Map(u, v) : [w,x]* – Reduce(w, x*) : [y, z]*● Example: word count – Map([document, null]) -> [word, 1]* – Reduce(word, 1*) -> [word, total]● MapReduce & SQL – SELECT word, count(*) GROUP BY word● Distributed execution in a cluster – Horizontal scalability 23 / 112
    24. 24. Word Count Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. This is a line Also this Map Reduce reduce(a, {1}) = map(“This is a line”) = a, 1 this, 1 reduce(also, {1}) = is, 1 also, 1 a, 1 reduce(is, {1}) = line, 1 is, 1 map(“Also this”) = reduce(line, {1}) = also, 1 line, 1 this, 1 reduce(this, {1, 1}) = this, 2 a, 1 also, 1 Result: is, 1 line, 1 this, 2 24 / 112
    25. 25. Map examples Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Swap Mapper – Swaps key and value map(key, value): emit (value, key)● Split Key Mapper – Splits key in words and emit a pair per each word map(key, value): words = key.split(“ “) for each word in words: emit (word, value) 25 / 112
    26. 26. Map examples (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Filter Mapper – Filter out some records map(key, value): if (key <> “the”): emit (key, value)● Key/value concatenation mapper – Concatenates the key and the value in the key map(key, value): emit (key + “ “ + value, null) 26 / 112
    27. 27. Reduce examples Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Count reducer – Counts the number of elements per each key reduce(key, values): count = 0 for each value in values: count++ emit(key, count)● Average reducer – Computes the average value for each key reduce(key, values): count = 0 total = 0 for each value in values: count++ total += value emit(key, total / count) 27 / 112
    28. 28. Reduce examples (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Keep first reducer – Keeps the first key/value input pair reduce(key, values): emit(key, first(values))● Value concatenation reducer – Concatenates the values in one string reduce(key, values): result = “” for each value in values: result += “ “ + value emit(key, result) 28 / 112
    29. 29. Identity map and reduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● The identity functions are those that keeps the input unchanged – Map identity map(key, value): emit (key, value) – Reduce identity reduce(key, values): for each value in values: emit (key, value) 29 / 112
    30. 30. Putting all together Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. map(k, v) → [w, x]* reduce(w, [x]+) → [y, z]*● Job flow: – The mapper generates key/value pairs – This pairs are grouped by the key – Hadoop calls the reduce function for each group – The output of the reduce function is the final Job output● Hadoop will distribute the work – Different nodes in the cluster will process the data in parallel 30 / 112
    31. 31. data Tasks Output Reduce Map Tasks Intermediate Job Execution Node 1 Node 1 Input Splits (blocks) Node 2 Node 231 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
    32. 32. Job Execution (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Key/value pair are sorted by key in the shuffle & sorting phase – That is needed in order to group registers by the key when calling the reducer – It also means that calls to the reduce function are done in key-order ● Reduce function with key “A” is always called before than reduce function with key “B” whiting the same reduce task● Reducers starts downloading data from the mappers as soon as possible – In order to reduce the shuffle & sorting phase time – Number of reducers can be configured by the programmer 32 / 112
    33. 33. Partial Sort Job Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● A job formed with the identity map and the identity reducer – It just sort data by the key per each reducerInput file D B A B C D E A Map 1 Map 2Intermediate D A B B D A C E data Reduce 1 Reduce 2 A A D D B B C E Output files 33 / 112
    34. 34. Input Splits Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Each map task process one input split – Map task starts processing at the first complete record, and finishes processing the record crossed by the rightmost boundary Input Input Input Input Split Split Split Split 1 2 3 4 File Records Map Map Map Map Task Task Task Task 1 2 3 4 34 / 112
    35. 35. Combiner Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Intermediate data goes from the map tasks to the reduce tasks through the network – Network can be saturated● Combiners can be used to reduce the amount of data sent to the reducers – When the operation is commutative and associative● A combiner is a function similar to the reducer – But it is executed in the map task, just after all the mapping has been done● Combiners cant have side effects – Because Hadoop can decide to execute them or not 35 / 112
    36. 36. Design Patterns
    37. 37. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Filtering2. Secondary sorting3. Distributed execution4. Computing statistics5. Count distinct6. Sorting7. Joins8. Reconciliation 37 / 112
    38. 38. Filtering Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Filtering: Input data● We process the input file in parallel with Hadoop and if(condition) { emit(); } emit a smaller dataset in the end Output data 38 / 112
    39. 39. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Filtering2. Secondary sorting3. Distributed execution4. Computing statistics5. Count distinct6. Sorting7. Joins8. Reconciliation 39 / 112
    40. 40. Secondary sorting Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Receive reducer values in a specific order● Moving averages: – Secondary sort by timestamp – Fill an in-memory window and perform average.● Top N items in a group: – Secondary sort by <X> – Emit the first N elements in a group● Useful, yet quite difficult to implement in Hadoop. Sort Comparator Key Group Comparator Partitioner 40 / 112
    41. 41. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Filtering2. Secondary sorting3. Distributed execution4. Computing statistics5. Count distinct6. Sorting7. Joins8. Reconciliation 41 / 112
    42. 42. Distributed execution without Hadoop Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Distributed queue – It is needed a common queue used to coordinate and assign work● Distributed workers – Consumers working on each node, getting work from the queue● Problems: – Difficult to coordinate ● Failover ● Loosing messages ● Load balance – Queue must scale 42 / 112
    43. 43. Distributed execution with Hadoop Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Map-only Jobs.● Use Hadoop just for the sake of “parallelizing something”.● Anything that doesnt involve a “group by” (no shuffle/reducer)● Examples: – Text categorization – Filtering – Crawling Map 1 Map 2 … Map n – Updating a DB – Distributed grep● NlineInputFormat can be handy for that. 43 / 112
    44. 44. Disadvantages Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Work is done in batches – And distribution is not probably even ● Some resources are wasted● There are some tricks to alleviate the problem – Task timeout + saving remaining work to next execution 44 / 112
    45. 45. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Filtering2. Secondary sorting3. Distributed execution4. Computing statistics5. Count distinct6. Sorting7. Joins8. Reconciliation 45 / 112
    46. 46. Computing statistics (I) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Count, Sum, Average, Std. Dev...● Aggregated by something● Recall SQL: select user, count(clicks) … group by user user, click user, click Map: emit (user, click) Reduce by user: count values user, count(click) 46 / 112
    47. 47. Computing statistics (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● When sum(), avg(), etc, Combiners are often needed● Imagine a user performed 3 million clicks – Then, a reducer will receive 3 million registers – This reducer will be the bottleneck of the Job. Everyone needs to wait for it to count 3 million things.● Solution: Perform partial counts in a Combiner● Combiner is executed before shuffling, after Mapper. 47 / 112
    48. 48. Computing statistics (III) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Using a Combiner: user, click user, click Map user, count(click) user, count(click) Combine user, sum(count(click)) Reduce● For each group, reducer aggregates N counts in the worst case! (N = #mappers) 48 / 112
    49. 49. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Filtering2. Secondary sorting3. Distributed execution4. Computing statistics5. Count distinct6. Sorting7. Joins8. Reconciliation 49 / 112
    50. 50. Distinct Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● How to calculate distinct count(something) group by X ?● It is somewhat easy (2 M/Rs): M/R 1 (eliminates duplicates): – emit ({X, something}, null) – so rows are grouped by ({X, something}) – In the reducer, just emit the first (ignore duplicates) M/R 2 (groups by X and count): – For each input X, something → emit (X, 1) – group by (X) – The reducer counts incoming values 50 / 112
    51. 51. Distinct: example Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. M/R 1 M/R 2 (X1, s1) (X1, s1) (X1, s2) (X1, s1) (X1, s1) (X1, s1) (X1, s2) (X1, s2) X1 → 2 (X1, s2) (X2, s1) (X2, s1) X2 → 2 (X2, s1) (X2, s3) (X2, s3) (X2, s1) (X2, s3) (X2, s1) 51 / 112
    52. 52. Distinct: Secondary sort Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● We can calculate distinct in only one Job● Using Secondary Sorting M/R: – emit ({X, something}, null) – group by (X), secondary sort by (something) – The reducer: iterate, count & emit “something, count” when “something” changes. Reset the counter each “something” change.● Need to use a Combiner to eliminate duplicates (otherwise reducer would receive too many records).● disctinct count() is more parallelizable with 2 Jobs than with 1! 52 / 112
    53. 53. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Filtering2. Secondary sorting3. Distributed execution4. Computing statistics5. Count distinct6. Sorting7. Joins8. Reconciliation 53 / 112
    54. 54. Sorting Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● We have seen how sorting is (partially) inherent in Hadoop.● But if we want “pure” sorting: – Use one Reducer (not scalable) – Use an advanced partitioning strategy● Yahoo! TeraSort (http://sortbenchmark.org/Yahoo2009.pdf)● Use sampling to calculate data distribution● Implement custom Partitioning according to distribution 54 / 112
    55. 55. Sorting (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Hash partitioning: 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 ...● Distribution-aware partitioning: 0 1 2 55 / 112
    56. 56. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Filtering2. Secondary sorting3. Distributed execution4. Computing statistics5. Count distinct6. Sorting7. Joins8. Reconciliation 56 / 112
    57. 57. Joins Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Joining two (or more) datasets is a quite common need.● The difficulty is that both datasets may be “too big” – Otherwise, an in-memory join can be do quite easily just by reading one of the datasets in RAM.● “Big joins” are commonly done “reduce-side”: Map dataset 1: (K1, d21) Map dataset 2: (K1, d11) (K2, d22) (K2, d12) Reduce by common key (K1, K2,...) K1 → d11, d21 Reduce: Join K2 → d12, d22● The so-called “map-side joins” are more complex and tricky. 57 / 112
    58. 58. Joins: 1-N relation Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Use secondary sorting to get the “one-side” of the relation the first – Otherwise you need to use memory to perform the join ● Does not scale● Employee (E) – Sales join (S) SSESSS You need to use memory ESSSSS Memory not needed 58 / 112
    59. 59. Left – Right – Inner joins Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Join between Employee and Sales reducer(key, values): employee = null first = first(values) rest = rest(values) If isEmployee(first) employee = first If employee = null // rigth join SSSSS else if size(rest) = 0 // left join E else // inner join ESSSSS 59 / 112
    60. 60. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Filtering2. Secondary sorting3. Distributed execution4. Computing statistics5. Count distinct6. Sorting7. Joins8. Reconciliation 60 / 112
    61. 61. Reconciliation Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Hadoop can be used to “simulate a database”.● For that, we need to: – Merge data state s(t) with past state s(t-1) using a Join. – Update rows with the same ID (performing whatever logic). – Store the result as the next full data state. – Rotate states: ● s(t-1) = s(t) ● s(t) = s(t + 1) s(t-1) s(t) s(t+1) M/R 61 / 112
    62. 62. Real-life Hadoop projects Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● 95% Real-life Hadoop projects are a mixture of the patterns we just saw.● Example: A vertical search-engine. – Distributed execution: Feed crawl / parse – Data reconciliation: Merge data by listing ID – Join: Augment listings with geographical info – …● Example: Payments data stats. – Secondary sort: weekly, daily & monthly stats – Distributed execution: Random-updates to a DB● ... 62 / 112
    63. 63. Real-life MapReduce
    64. 64. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Data analytics2. Crawling3. Full-text indexing4. Reputation systems5. Data Mining 64 / 112
    65. 65. Data analytics Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Obvious use case for MapReduce.● Examples: calculate unique visits per page. – Top products per month. – Unique clicks per banner. – Etc.● Offline analytics (Batch-oriented). – Online analytics not a good fit for MapReduce. 65 / 112
    66. 66. Data analytics: How it works Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● A batch process that uses all historical data. – Recompute everything always. – Easier to manage and maintain than incremental computation.● A chain of MapReduce steps produce the final output.● There are tools that ease building / maintaining the MapReduce chain: – Hive, Pig, Cascading, Pangool for programming a MapReduce flow easily. – Oozie, Azkaban for connecting existing MapReduce jobs easily. ● Scheduling flows and such. 66 / 112
    67. 67. Data analytics: Difficulties Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Some things are harder to calculate than others.● Calculating unique visits per page. – A simple solution in two MapReduce steps or a more sophisticated one in a single MapReduce step. – Approximated methods can be used as well.● Calculating the median. – Need to sort all the dataset and iterate twice if we don’t know the number of elements. 67 / 112
    68. 68. Data analytics: Examples Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Gather clicks on pages. – Save (click, page, timestamp) in the HDFS.● A MapReduce job groups by page and counts the number of clicks: ● map: emit(page, click). ● reduce: (page, list<click>) emits (page, totalClicks).● We now have the total number of clicks per page. 68 / 112
    69. 69. Data analytics: Examples (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Another MapReduce job groups by day and page and counts the number of clicks: ● map: emit((page, day), click). ● reduce: Same as before.● We now have the total number of clicks per page and day.● These are simple examples, but data analytics can get as sophisticated as you want. – Example: calculate a 10 bar histogram of the distribution of clicks over the hours of the day for each page. 69 / 112
    70. 70. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Data analytics2. Crawling3. Full-text indexing4. Reputation systems5. Data Mining 70 / 112
    71. 71. Crawling: Web Crawling Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Web Crawling: – “A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.”● Applications: – Search engines. – NLP (Sentiment analysis).● Examples: 71 / 112
    72. 72. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Data analytics2. Crawling3. Full-text indexing4. Reputation systems5. Data Mining 72 / 112
    73. 73. Crawling: Web Crawling (at scale) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● How to parallelize storage and bandwidth?● How to deduplicate stored URLs?● Other complexities: politeness, infinite loops, robots.txt, canonical URLs, pagination, parsing, …● Relevancy: Pagerank. 73 / 112
    74. 74. Crawling: Nutch Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● What is Nutch? – Open source web-search software project. – Apache project. – Hadoop, Tike, Lucene, SOLR.● (Brief) history – Started in 2002/2003 – 2005: MapReduce – 2006: Hadoop – 2006/2007: Tika – 2010 TLP Apache project 74 / 112
    75. 75. Crawling: Nutch: How it works Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● “Select, Crawl, Parse, Dedup by URL” loop.● Lucene, SOLR for indexing. – We will see them later.● CrawlDB: Pages are saved in HDFS.● MapReduce makes storage and computing scalable. – Helps in deduplicating pages by URL. – Helps in identifying new pages to crawl. 75 / 112
    76. 76. Crawling: Not-Only Web Crawling Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Custom crawlers: – Tweets. – XML feeds.● Simpler, as we usually don’t need to traverse a tree. – Sometimes only crawling a fixed seed of resources is enough.● Applications – Vertical search engines. – Reputation systems. 76 / 112
    77. 77. Crawling: Example: Crawling tweets at scale Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Use a scalable computing engine for fetching tweets. – Storm is a good fit. – Hadoop can be used as well. ● Tricky usage of MapReduce: Create as many groups as crawlers and embed a Crawler in them.● Save raw feed data (JSON) in HDFS.● MapReduce: Parse JSON tweets.● MapReduce: Deduplicate tweets.● MapReduce: Analyze tweets and perform data analysis. 77 / 112
    78. 78. M/R Parse HDFS M/R Dedup M/R Analysis Results Crawling: Example: Crawling tweets at scale78 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
    79. 79. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Data analytics2. Crawling3. Full-text indexing4. Reputation systems5. Data Mining 79 / 112
    80. 80. Full-text indexing: Definitions Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Search engine: – An information retrieval system designed to help find information stored on a computer system.● Inverted index: – Index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. – B-Trees are not inverted indexes!● Stemming.● Relevancy in results. 80 / 112
    81. 81. Full-text indexing: Applications Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Web search engines – Finding relevant pages for a topic● Vertical search engines – Finding jobs by description● Social networks – Finding messages by text● e-Commerce – Finding articles by description● In general, any service or application needing efficient text information retrieval 81 / 112
    82. 82. Full-text indexing (at scale) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Real-time indexing versus batch-indexing: – The first is cool: it is real-time, but it is difficult. We will not address it now. – The second is not real-time, but it is simpler.● How to batch-index a big corpus dataset? – Need a scalable storage, (HDFS).● How to deduplicate documents? – MapReduce to the rescue (like we saw before).● How to generate multiple indexes? – MapReduce can help (we will see how). 82 / 112
    83. 83. Full-text indexing: MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● MapReduce can be used to generate an inverted index. – Vertical partitioning v.s. Horizontal partitioning.● Example: – Map: emit(word, docId) – Reduce: emit(word, list<docIds>)● Quite simple. But what about stop words, stemming, etc?● How to store the index?● Better not to reinvent the wheel. 83 / 112
    84. 84. Full-text indexing: Lucene / SOLR Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Lucene: Doug Cutting’s – From Nutch. – Mainstream open-source implementation of an inverted index. – Efficient disk allocation, highly performant.● SOLR: Mainstream open-source search server. – Provides stemming, analyzers, HTTP servlets, etc. – Lacks some other desirable properties: ● Elasticity, real-time indexing, horizontal partitioning (although work in progress).● Still the reference technology for creating and serving inverted indexes. 84 / 112
    85. 85. Full-text indexing: MapReduce meets SOLR Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● We can use MapReduce for scaling the indexing process.● At the same time, we can use SOLR for creating the resulting index. – SOLR is used as-a-library.● Generated indexes are later deployed to the search servers. 85 / 112
    86. 86. Full-text indexing: Example Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● A vertical job search engine.● Jobs are parsed from crawled feeds and saved in the HDFS.● MapReduce for deduplicating job offers. – map: emit(jobId, job) – reduce (jobId, list<job>) -> emit (jobId, job) ● Retention policy: keep latest job. 86 / 112
    87. 87. Full-text indexing: Example (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● MapReduce for augmenting job information (adding geographical information). – map1: emit(job.city, job) – map2: emit(city, geoInfo) – reduce (job.city, list<job>)(city, geoInfo) -> for all jobs emit(job, geoInfo)● MapReduce for distributing the index process: – map: emit(job.country, job) – reduce: (job.country, list<job>) -> Create index for country “job.country” using SOLR.● Deploy per-country indexes to search cluster. 87 / 112
    88. 88. Full-text indexing: Example Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. XML feeds Search Cluster Deploy Geo Indexes HDFS info M/R M/R M/R M/R Parse Dedup Geo info Index 88 / 112
    89. 89. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Data analytics2. Crawling3. Full-text indexing4. Reputation systems5. Data Mining 89 / 112
    90. 90. Reputation: Definitions Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● What is reputation?● Reputation in social communities. – eBay, StackOverflow...● Reputation in social media. – Twitter, Facebook...● Why is it important? 90 / 112
    91. 91. Reputation: Relationships Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Modelling relationships is needed for calculating reputation● Graph-like models arise● Usually stored as vertices – A interacts with B – or, A trusts B – or, A → B● Internet-scale graphs can be stored in HDFS – Each vertex in a row – Add needed metadata to vertices: date, etc. 91 / 112
    92. 92. Reputation: MapReduce analysis on vertices Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● All people whom “A” interacted with. – map: (a, b) – reduce: (a, list<b>).● Essentially things like PageRank can be very easily implemented.● PageRank as a measure of page relevancy from page inlinks. – But it can be extrapolated to any kind of authority and trustiness metric. – E.g. People relevancy from social networks. 92 / 112
    93. 93. Reputation: Going deeper on graphs Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Friends of friends of friends. – 1 MapReduce step: My friends. – 2 MapReduce steps: Friends of my friends. – 3 MapReduce steps: Friends of friends of my friends.● Iterative MapReduce solves it.● But there are better foundational models such as Google’s Pregel. – Exploiting data locality in graphs. – Apache Giraph. – Apache Hama. 93 / 112
    94. 94. Reputation: Difficulties Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Sometimes multiple MapReduce steps are needed for calculating a final metric. – Because data doesn’t fit in memory.● Intermediate relations need to be calculated. – And later filtered out.● “Polynomial effect”: Calculate all pairwise relations in a set: N*(N-1)/2 – Possible bottleneck. 94 / 112
    95. 95. Reputation: Difficulties: Data imbalance Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● When grouping by something, some groups may be much bigger than others. – Causing “data imbalance”.● Data imbalance in MapReduce is a big problem. – Some machines will finish quickly while one will be busy for hours. ● Inefficient usage of resources. ● Data processing doesn’t scale linearly anymore. – Next MapReduce step can’t start as long as current one hasn’t yet finished. 95 / 112
    96. 96. Reputation: Example Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Input: Tweets.● Distributed crawling for fetching the tweets. – Save them in the HDFS.● Parse the tweets. Define the graph of trustiness. – A trusts B if A follows B.● Execute PageRank over the graph. – Spreads trustiness to all vertices. 96 / 112
    97. 97. M/R Parse HDFS Reputation: ExampleC A BD Graph of trustiness M/R Results PageRank97 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
    98. 98. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Data analytics2. Crawling3. Full-text indexing4. Reputation systems5. Data Mining 98 / 112
    99. 99. Data mining: Text classification Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Document classification – Documents are texts in this case.● Assigns a text one or more categories. – Binary classifiers versus multi-category classifiers – Multi-category classifiers can be built from multiple binary classifiers● Two steps: generating the model and classifying. 99 / 112
    100. 100. Data mining: Text classification: Steps Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Generating the model – The resultant model may or may not fit in memory. ● Let’s assume the final model fits in memory. – Use a large dataset for generating the model. ● MapReduce helps scaling the model generation process. ● Example: Build multiple binary classifiers -> parallelize by classifier. ● Example: Calculate conditional probabilities of a Bayesian model. Paralellize by word (like in WordCount example).● Classifying – MapReduce also helps in classifying a large dataset. ● If model fits in memory, parallelize documents to classify and load the model in memory. – Batch-classifying: parallelize documents to classify. Output is the set of documents with the assigned categories. 100 / 112
    101. 101. Data mining: Others Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Mahout library for other data mining problems. – Clustering, logistic regression, etc.● Recommendation algorithms. – Many recommendation algorithms are based on calculating correlations. – Calculating correlations in parallel with MapReduce is easy.● Remember: always in the “batch” or “offline” domain. – Recommendations are reloaded after batch process finishes. 101 / 112
    102. 102. Tuple Mapreduce Pere Ferrera, Ivan de Prado, Eric Palacios, Jose Luis Fernandez-Marquez, Giovanna Di MarzoSerugendo: Tuple MapReduce: Beyond classic MapReduce. In ICDM 2012: Proceedings of the IEEE International Conference on Data Mining (To appear) (10.7% Acceptance rate)
    103. 103. Common MapReduce problems Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Lack of compound records – By default, key & value are considered monolithic entities. ● In real life, this case is rare. – Alleviated by some serialization libraries (Thrift, Protocol Buffers)● Sorting within a group – MapReduce foundation does nos support it ● Although MapReduce implementations overcome this problem with “tricks”● Joins – Needs of compound records and sorting within a group to be implemented – Not directly supported by MapReduce 103 / 112
    104. 104. Tuple Map Reduce: rationale Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Compound records, sorting within a group and join are design patterns that arises in most MapReduce applications...● … but MapReduce does not make the implementation easy● An evolution of MapReduce paradigm is needed to cover these design patterns: Tuple MapReduce 104 / 112
    105. 105. Tuple MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● We extended the classic (key, value) MapReduce model● Use n-sized Tuples instead of (key, value)● Define a Tuple-based M/R – Covering most common use cases 105 / 112
    106. 106. Group by / Sort by Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● You can think of a M/R as a SELECT … GROUP BY …● With Tuple MapReduce, you simply “group by” a subset of Tuple fields – Easier, more intuitive than having objects for each kind of Key you want to group by.● Alternatively, you may “sort by” a wider subset – Hiding all complex logic behind secondary sort 106 / 112
    107. 107. Tuple-Join MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Extend the whole idea to allow for easier joins Tuple1: (a,b,c,d) Tuple2: (a,b,f,g,h) Join by (a,b)● Formally speaking: 107 / 112
    108. 108. Pangool http://pangool.net108 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
    109. 109. Pangool: What? Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Better, simpler, powerful API replacement for Hadoops API● What do we mean by API replacement? – APIs on top of Hadoop: Pig, Hive, Cascading. – Using them always comes with a tradeoff. – Paradigms other than MapReduce, not always the best choice. – Performance restrictions.● Pangool is still MapReduce, low-level and high performing – Yet a lot simpler! 109 / 112
    110. 110. Pangool: Why? Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Hadoop has a steep learning curve● Default API is too low-level● Making things efficient is harsh (binary comparisons, ser/de...)● There are some common patterns (joins, secondary sorting...) Common pattern How can we make them simpler Common pattern without loosing flexibility and power? Common pattern 110 / 112
    111. 111. Pangool API Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Schema, Tuple, …● Reducers, Mappers, etc are instances instead of static classes – Easier to configure them: new MyReducer(5, 2.0);● Still tied to Hadoops particularities in some ways – NullWritable, etc● Lets see an example 111 / 112
    112. 112. Thanks!!Iván de Prado AlonsoPere Ferrera Bertran

    ×