Your SlideShare is downloading. ×
0
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Big data, map reduce and beyond
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Big data, map reduce and beyond

9,281

Published on

Published in: Technology
0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
9,281
On Slideshare
0
From Embeds
0
Number of Embeds
23
Actions
Shares
0
Downloads
0
Comments
0
Likes
16
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • - Premios a la Innovación de The Guardian Hay que reconocer que las navajas suizas son útiles … Quién no ha necesitado una lupa en un momento de emergencia! A Hadoop le pasa como las navajas suizas. Son muy útiles, sudas la gota gorda consigues sacar el accesorio que quieres
  • Distribuida: aprovecha la potencia de varias máquinas en un cluster Grandes conjuntos de datos: Hadoop no es apropiado para conjuntos de datos pequeños Simple Programming Model: Hadoop no es sólo un framework, es un nuevo paradigma de programación distribuida Hadoop se asienta principalmente en dos modulos: Un sistema de ficheros distribuido Para almacenar grandes volumenes de datos Un nuevo paradigma de programación: MapReduce Veamos uno por uno.
  • Transcript

    • 1. Big Data, MapReduce and beyond Iván de Prado Alonso // @ivanprado Pere Ferrera Bertran // @ferrerabertran @datasalt
    • 2. Outline Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Big Data. What and why2. MapReduce & Hadoop3. MapReduce Design Patterns4. Real-life MapReduce 1. Data analytics 2. Crawling 3. Full-text indexing 4. Reputation systems 5. Data Mining5. Tuple MapReduce & Pangool 2 / 112
    • 3. Big DataWhat and why
    • 4. In the past... Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Data and computation fits on one monolithic machine● Monolithic databases: RDBMS● Scalability: – Vertical: buy better hardware● Distributed systems – No very common – Logic centric: Data move where the logic is● Distributed storage: SAN 4 / 112
    • 5. Distributed systems are hard Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Building distributed systems is hard – If you can scale vertically at a reasonable cost, why to deal with distributed systems complexity?● But circumstances are changing: – Big Data● Big data refers to the massive amounts of data that are difficult to analyze and handle using common database management tools 5 / 112
    • 6. BIG DATA “MAC”6 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
    • 7. Big Data Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Data is the new bottleneck – Web data ● Web pages ● Interaction Logs – Social networks data – Mobile devices – Data generated by Sensors● Old systems/techniques are not appropriated● A new approach is needed 7 / 112
    • 8. Big Data project parts Serving Acquiring Processing8 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
    • 9. Acquiring Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Gathering/receiving/storing data from sources● Many kind of sources – Internet – Sensors – User behavior – Mobile devices – Health care data – Banking data – Social Networks – ….. 9 / 112
    • 10. Processing Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Data is present in the system (acquired)● This step is responsible of extracting value from data – Eliminate duplicates – Infer relations – Calculate statistics – Correlate information – Ensure quality – Generate recommendations – …. 10 / 112
    • 11. Serving Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Most of the cases, some interface has to be provided to access the processed information● Possibilities – Big Data / No Big Data – Real time access to results / non real time access● Some examples: – Search engine → inverted index – Banking data → relational database – Social Network → NoSQL database 11 / 112
    • 12. Big Data system types Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Offline – Latency is not a problem● Online – Response immediacy is important● Mixed – Online behavior, but internally is a mixture of two systems ● One online ● One offline Offline OnlineMapReduce NoSQLHadoop Search enginesDistributed RDBMS 12 / 112
    • 13. A Mixed Online Offline A P AS P P P S A S S Big Data Systems types II13 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
    • 14. MapReduce & Hadoop
    • 15. “Swiss army knife of the 21st century” Media Guardian Innovation Awardshttp://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop 15 / 112
    • 16. History Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● 2004-2006 – GFS and MapReduce papers published by Google – Doug Cutting implements an open source version for Nutch● 2006-2008 – Hadoop project becomes independent from Nutch – Web scale reached in 2008● 2008-now – Hadoop becomes popular and is commercially exploited Source: Hadoop: a brief history. Doug Cutting 16 / 112
    • 17. Hadoop “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model” From Apache Hadoop page 17 / 112
    • 18. Main ideas Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Distributed – Distributed storage – Distributed computation platform● Built to be fault tolerant● Shared nothing architecture● Programmer isolation from distributed system difficulties – By providing an simply primitives for programming 18 / 112
    • 19. Hadoop Distributed File System (HDFS) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Distributed – Aggregates the individual storage of each node● Files formed by blocks – Typically 64 or 128 Mb (configurable) – Stored in the OS filesystem: Xfs, Ext3, etc.● Fault tolerant – Blocks replicated more than once 19 / 112
    • 20. How files are stored Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. DataNode 1 (DN1) NameNode 1 DataNode 2 (DN2) Data.txt: 2 Blocks: 1 DN1 1 DN2 2 3 DN1 DN4 3 DataNode 4 (DN4) DN2 DN3 2 4 DN4 4 DN3 DataNode 3 (DN3) 3 4 20 / 112
    • 21. MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Map Reduce is the abstraction behind Hadoop● The unit of execution is the Job● Job has – An input – An output – A map function – A reduce function● Input and output are sequences of key/value pairs● The map and reduce functions are provided by the developer – The execution is distributed and parallelized by Hadoop 21 / 112
    • 22. Job phases Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Two different phases: mapping and reducing● Mapping phase – Map function is applied to Input data ● Intermediate data is generated● Reducing phase – Reduce function is applied to intermediate data ● Final output is generated 22 / 112
    • 23. MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Two functions (Map & Reduce) – Map(u, v) : [w,x]* – Reduce(w, x*) : [y, z]*● Example: word count – Map([document, null]) -> [word, 1]* – Reduce(word, 1*) -> [word, total]● MapReduce & SQL – SELECT word, count(*) GROUP BY word● Distributed execution in a cluster – Horizontal scalability 23 / 112
    • 24. Word Count Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. This is a line Also this Map Reduce reduce(a, {1}) = map(“This is a line”) = a, 1 this, 1 reduce(also, {1}) = is, 1 also, 1 a, 1 reduce(is, {1}) = line, 1 is, 1 map(“Also this”) = reduce(line, {1}) = also, 1 line, 1 this, 1 reduce(this, {1, 1}) = this, 2 a, 1 also, 1 Result: is, 1 line, 1 this, 2 24 / 112
    • 25. Map examples Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Swap Mapper – Swaps key and value map(key, value): emit (value, key)● Split Key Mapper – Splits key in words and emit a pair per each word map(key, value): words = key.split(“ “) for each word in words: emit (word, value) 25 / 112
    • 26. Map examples (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Filter Mapper – Filter out some records map(key, value): if (key <> “the”): emit (key, value)● Key/value concatenation mapper – Concatenates the key and the value in the key map(key, value): emit (key + “ “ + value, null) 26 / 112
    • 27. Reduce examples Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Count reducer – Counts the number of elements per each key reduce(key, values): count = 0 for each value in values: count++ emit(key, count)● Average reducer – Computes the average value for each key reduce(key, values): count = 0 total = 0 for each value in values: count++ total += value emit(key, total / count) 27 / 112
    • 28. Reduce examples (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Keep first reducer – Keeps the first key/value input pair reduce(key, values): emit(key, first(values))● Value concatenation reducer – Concatenates the values in one string reduce(key, values): result = “” for each value in values: result += “ “ + value emit(key, result) 28 / 112
    • 29. Identity map and reduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● The identity functions are those that keeps the input unchanged – Map identity map(key, value): emit (key, value) – Reduce identity reduce(key, values): for each value in values: emit (key, value) 29 / 112
    • 30. Putting all together Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. map(k, v) → [w, x]* reduce(w, [x]+) → [y, z]*● Job flow: – The mapper generates key/value pairs – This pairs are grouped by the key – Hadoop calls the reduce function for each group – The output of the reduce function is the final Job output● Hadoop will distribute the work – Different nodes in the cluster will process the data in parallel 30 / 112
    • 31. data Tasks Output Reduce Map Tasks Intermediate Job Execution Node 1 Node 1 Input Splits (blocks) Node 2 Node 231 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
    • 32. Job Execution (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Key/value pair are sorted by key in the shuffle & sorting phase – That is needed in order to group registers by the key when calling the reducer – It also means that calls to the reduce function are done in key-order ● Reduce function with key “A” is always called before than reduce function with key “B” whiting the same reduce task● Reducers starts downloading data from the mappers as soon as possible – In order to reduce the shuffle & sorting phase time – Number of reducers can be configured by the programmer 32 / 112
    • 33. Partial Sort Job Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● A job formed with the identity map and the identity reducer – It just sort data by the key per each reducerInput file D B A B C D E A Map 1 Map 2Intermediate D A B B D A C E data Reduce 1 Reduce 2 A A D D B B C E Output files 33 / 112
    • 34. Input Splits Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Each map task process one input split – Map task starts processing at the first complete record, and finishes processing the record crossed by the rightmost boundary Input Input Input Input Split Split Split Split 1 2 3 4 File Records Map Map Map Map Task Task Task Task 1 2 3 4 34 / 112
    • 35. Combiner Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Intermediate data goes from the map tasks to the reduce tasks through the network – Network can be saturated● Combiners can be used to reduce the amount of data sent to the reducers – When the operation is commutative and associative● A combiner is a function similar to the reducer – But it is executed in the map task, just after all the mapping has been done● Combiners cant have side effects – Because Hadoop can decide to execute them or not 35 / 112
    • 36. Design Patterns
    • 37. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Filtering2. Secondary sorting3. Distributed execution4. Computing statistics5. Count distinct6. Sorting7. Joins8. Reconciliation 37 / 112
    • 38. Filtering Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Filtering: Input data● We process the input file in parallel with Hadoop and if(condition) { emit(); } emit a smaller dataset in the end Output data 38 / 112
    • 39. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Filtering2. Secondary sorting3. Distributed execution4. Computing statistics5. Count distinct6. Sorting7. Joins8. Reconciliation 39 / 112
    • 40. Secondary sorting Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Receive reducer values in a specific order● Moving averages: – Secondary sort by timestamp – Fill an in-memory window and perform average.● Top N items in a group: – Secondary sort by <X> – Emit the first N elements in a group● Useful, yet quite difficult to implement in Hadoop. Sort Comparator Key Group Comparator Partitioner 40 / 112
    • 41. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Filtering2. Secondary sorting3. Distributed execution4. Computing statistics5. Count distinct6. Sorting7. Joins8. Reconciliation 41 / 112
    • 42. Distributed execution without Hadoop Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Distributed queue – It is needed a common queue used to coordinate and assign work● Distributed workers – Consumers working on each node, getting work from the queue● Problems: – Difficult to coordinate ● Failover ● Loosing messages ● Load balance – Queue must scale 42 / 112
    • 43. Distributed execution with Hadoop Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Map-only Jobs.● Use Hadoop just for the sake of “parallelizing something”.● Anything that doesnt involve a “group by” (no shuffle/reducer)● Examples: – Text categorization – Filtering – Crawling Map 1 Map 2 … Map n – Updating a DB – Distributed grep● NlineInputFormat can be handy for that. 43 / 112
    • 44. Disadvantages Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Work is done in batches – And distribution is not probably even ● Some resources are wasted● There are some tricks to alleviate the problem – Task timeout + saving remaining work to next execution 44 / 112
    • 45. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Filtering2. Secondary sorting3. Distributed execution4. Computing statistics5. Count distinct6. Sorting7. Joins8. Reconciliation 45 / 112
    • 46. Computing statistics (I) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Count, Sum, Average, Std. Dev...● Aggregated by something● Recall SQL: select user, count(clicks) … group by user user, click user, click Map: emit (user, click) Reduce by user: count values user, count(click) 46 / 112
    • 47. Computing statistics (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● When sum(), avg(), etc, Combiners are often needed● Imagine a user performed 3 million clicks – Then, a reducer will receive 3 million registers – This reducer will be the bottleneck of the Job. Everyone needs to wait for it to count 3 million things.● Solution: Perform partial counts in a Combiner● Combiner is executed before shuffling, after Mapper. 47 / 112
    • 48. Computing statistics (III) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Using a Combiner: user, click user, click Map user, count(click) user, count(click) Combine user, sum(count(click)) Reduce● For each group, reducer aggregates N counts in the worst case! (N = #mappers) 48 / 112
    • 49. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Filtering2. Secondary sorting3. Distributed execution4. Computing statistics5. Count distinct6. Sorting7. Joins8. Reconciliation 49 / 112
    • 50. Distinct Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● How to calculate distinct count(something) group by X ?● It is somewhat easy (2 M/Rs): M/R 1 (eliminates duplicates): – emit ({X, something}, null) – so rows are grouped by ({X, something}) – In the reducer, just emit the first (ignore duplicates) M/R 2 (groups by X and count): – For each input X, something → emit (X, 1) – group by (X) – The reducer counts incoming values 50 / 112
    • 51. Distinct: example Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. M/R 1 M/R 2 (X1, s1) (X1, s1) (X1, s2) (X1, s1) (X1, s1) (X1, s1) (X1, s2) (X1, s2) X1 → 2 (X1, s2) (X2, s1) (X2, s1) X2 → 2 (X2, s1) (X2, s3) (X2, s3) (X2, s1) (X2, s3) (X2, s1) 51 / 112
    • 52. Distinct: Secondary sort Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● We can calculate distinct in only one Job● Using Secondary Sorting M/R: – emit ({X, something}, null) – group by (X), secondary sort by (something) – The reducer: iterate, count & emit “something, count” when “something” changes. Reset the counter each “something” change.● Need to use a Combiner to eliminate duplicates (otherwise reducer would receive too many records).● disctinct count() is more parallelizable with 2 Jobs than with 1! 52 / 112
    • 53. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Filtering2. Secondary sorting3. Distributed execution4. Computing statistics5. Count distinct6. Sorting7. Joins8. Reconciliation 53 / 112
    • 54. Sorting Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● We have seen how sorting is (partially) inherent in Hadoop.● But if we want “pure” sorting: – Use one Reducer (not scalable) – Use an advanced partitioning strategy● Yahoo! TeraSort (http://sortbenchmark.org/Yahoo2009.pdf)● Use sampling to calculate data distribution● Implement custom Partitioning according to distribution 54 / 112
    • 55. Sorting (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Hash partitioning: 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 ...● Distribution-aware partitioning: 0 1 2 55 / 112
    • 56. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Filtering2. Secondary sorting3. Distributed execution4. Computing statistics5. Count distinct6. Sorting7. Joins8. Reconciliation 56 / 112
    • 57. Joins Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Joining two (or more) datasets is a quite common need.● The difficulty is that both datasets may be “too big” – Otherwise, an in-memory join can be do quite easily just by reading one of the datasets in RAM.● “Big joins” are commonly done “reduce-side”: Map dataset 1: (K1, d21) Map dataset 2: (K1, d11) (K2, d22) (K2, d12) Reduce by common key (K1, K2,...) K1 → d11, d21 Reduce: Join K2 → d12, d22● The so-called “map-side joins” are more complex and tricky. 57 / 112
    • 58. Joins: 1-N relation Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Use secondary sorting to get the “one-side” of the relation the first – Otherwise you need to use memory to perform the join ● Does not scale● Employee (E) – Sales join (S) SSESSS You need to use memory ESSSSS Memory not needed 58 / 112
    • 59. Left – Right – Inner joins Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Join between Employee and Sales reducer(key, values): employee = null first = first(values) rest = rest(values) If isEmployee(first) employee = first If employee = null // rigth join SSSSS else if size(rest) = 0 // left join E else // inner join ESSSSS 59 / 112
    • 60. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Filtering2. Secondary sorting3. Distributed execution4. Computing statistics5. Count distinct6. Sorting7. Joins8. Reconciliation 60 / 112
    • 61. Reconciliation Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Hadoop can be used to “simulate a database”.● For that, we need to: – Merge data state s(t) with past state s(t-1) using a Join. – Update rows with the same ID (performing whatever logic). – Store the result as the next full data state. – Rotate states: ● s(t-1) = s(t) ● s(t) = s(t + 1) s(t-1) s(t) s(t+1) M/R 61 / 112
    • 62. Real-life Hadoop projects Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● 95% Real-life Hadoop projects are a mixture of the patterns we just saw.● Example: A vertical search-engine. – Distributed execution: Feed crawl / parse – Data reconciliation: Merge data by listing ID – Join: Augment listings with geographical info – …● Example: Payments data stats. – Secondary sort: weekly, daily & monthly stats – Distributed execution: Random-updates to a DB● ... 62 / 112
    • 63. Real-life MapReduce
    • 64. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Data analytics2. Crawling3. Full-text indexing4. Reputation systems5. Data Mining 64 / 112
    • 65. Data analytics Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Obvious use case for MapReduce.● Examples: calculate unique visits per page. – Top products per month. – Unique clicks per banner. – Etc.● Offline analytics (Batch-oriented). – Online analytics not a good fit for MapReduce. 65 / 112
    • 66. Data analytics: How it works Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● A batch process that uses all historical data. – Recompute everything always. – Easier to manage and maintain than incremental computation.● A chain of MapReduce steps produce the final output.● There are tools that ease building / maintaining the MapReduce chain: – Hive, Pig, Cascading, Pangool for programming a MapReduce flow easily. – Oozie, Azkaban for connecting existing MapReduce jobs easily. ● Scheduling flows and such. 66 / 112
    • 67. Data analytics: Difficulties Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Some things are harder to calculate than others.● Calculating unique visits per page. – A simple solution in two MapReduce steps or a more sophisticated one in a single MapReduce step. – Approximated methods can be used as well.● Calculating the median. – Need to sort all the dataset and iterate twice if we don’t know the number of elements. 67 / 112
    • 68. Data analytics: Examples Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Gather clicks on pages. – Save (click, page, timestamp) in the HDFS.● A MapReduce job groups by page and counts the number of clicks: ● map: emit(page, click). ● reduce: (page, list<click>) emits (page, totalClicks).● We now have the total number of clicks per page. 68 / 112
    • 69. Data analytics: Examples (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Another MapReduce job groups by day and page and counts the number of clicks: ● map: emit((page, day), click). ● reduce: Same as before.● We now have the total number of clicks per page and day.● These are simple examples, but data analytics can get as sophisticated as you want. – Example: calculate a 10 bar histogram of the distribution of clicks over the hours of the day for each page. 69 / 112
    • 70. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Data analytics2. Crawling3. Full-text indexing4. Reputation systems5. Data Mining 70 / 112
    • 71. Crawling: Web Crawling Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Web Crawling: – “A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.”● Applications: – Search engines. – NLP (Sentiment analysis).● Examples: 71 / 112
    • 72. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Data analytics2. Crawling3. Full-text indexing4. Reputation systems5. Data Mining 72 / 112
    • 73. Crawling: Web Crawling (at scale) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● How to parallelize storage and bandwidth?● How to deduplicate stored URLs?● Other complexities: politeness, infinite loops, robots.txt, canonical URLs, pagination, parsing, …● Relevancy: Pagerank. 73 / 112
    • 74. Crawling: Nutch Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● What is Nutch? – Open source web-search software project. – Apache project. – Hadoop, Tike, Lucene, SOLR.● (Brief) history – Started in 2002/2003 – 2005: MapReduce – 2006: Hadoop – 2006/2007: Tika – 2010 TLP Apache project 74 / 112
    • 75. Crawling: Nutch: How it works Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● “Select, Crawl, Parse, Dedup by URL” loop.● Lucene, SOLR for indexing. – We will see them later.● CrawlDB: Pages are saved in HDFS.● MapReduce makes storage and computing scalable. – Helps in deduplicating pages by URL. – Helps in identifying new pages to crawl. 75 / 112
    • 76. Crawling: Not-Only Web Crawling Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Custom crawlers: – Tweets. – XML feeds.● Simpler, as we usually don’t need to traverse a tree. – Sometimes only crawling a fixed seed of resources is enough.● Applications – Vertical search engines. – Reputation systems. 76 / 112
    • 77. Crawling: Example: Crawling tweets at scale Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Use a scalable computing engine for fetching tweets. – Storm is a good fit. – Hadoop can be used as well. ● Tricky usage of MapReduce: Create as many groups as crawlers and embed a Crawler in them.● Save raw feed data (JSON) in HDFS.● MapReduce: Parse JSON tweets.● MapReduce: Deduplicate tweets.● MapReduce: Analyze tweets and perform data analysis. 77 / 112
    • 78. M/R Parse HDFS M/R Dedup M/R Analysis Results Crawling: Example: Crawling tweets at scale78 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
    • 79. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Data analytics2. Crawling3. Full-text indexing4. Reputation systems5. Data Mining 79 / 112
    • 80. Full-text indexing: Definitions Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Search engine: – An information retrieval system designed to help find information stored on a computer system.● Inverted index: – Index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. – B-Trees are not inverted indexes!● Stemming.● Relevancy in results. 80 / 112
    • 81. Full-text indexing: Applications Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Web search engines – Finding relevant pages for a topic● Vertical search engines – Finding jobs by description● Social networks – Finding messages by text● e-Commerce – Finding articles by description● In general, any service or application needing efficient text information retrieval 81 / 112
    • 82. Full-text indexing (at scale) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Real-time indexing versus batch-indexing: – The first is cool: it is real-time, but it is difficult. We will not address it now. – The second is not real-time, but it is simpler.● How to batch-index a big corpus dataset? – Need a scalable storage, (HDFS).● How to deduplicate documents? – MapReduce to the rescue (like we saw before).● How to generate multiple indexes? – MapReduce can help (we will see how). 82 / 112
    • 83. Full-text indexing: MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● MapReduce can be used to generate an inverted index. – Vertical partitioning v.s. Horizontal partitioning.● Example: – Map: emit(word, docId) – Reduce: emit(word, list<docIds>)● Quite simple. But what about stop words, stemming, etc?● How to store the index?● Better not to reinvent the wheel. 83 / 112
    • 84. Full-text indexing: Lucene / SOLR Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Lucene: Doug Cutting’s – From Nutch. – Mainstream open-source implementation of an inverted index. – Efficient disk allocation, highly performant.● SOLR: Mainstream open-source search server. – Provides stemming, analyzers, HTTP servlets, etc. – Lacks some other desirable properties: ● Elasticity, real-time indexing, horizontal partitioning (although work in progress).● Still the reference technology for creating and serving inverted indexes. 84 / 112
    • 85. Full-text indexing: MapReduce meets SOLR Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● We can use MapReduce for scaling the indexing process.● At the same time, we can use SOLR for creating the resulting index. – SOLR is used as-a-library.● Generated indexes are later deployed to the search servers. 85 / 112
    • 86. Full-text indexing: Example Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● A vertical job search engine.● Jobs are parsed from crawled feeds and saved in the HDFS.● MapReduce for deduplicating job offers. – map: emit(jobId, job) – reduce (jobId, list<job>) -> emit (jobId, job) ● Retention policy: keep latest job. 86 / 112
    • 87. Full-text indexing: Example (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● MapReduce for augmenting job information (adding geographical information). – map1: emit(job.city, job) – map2: emit(city, geoInfo) – reduce (job.city, list<job>)(city, geoInfo) -> for all jobs emit(job, geoInfo)● MapReduce for distributing the index process: – map: emit(job.country, job) – reduce: (job.country, list<job>) -> Create index for country “job.country” using SOLR.● Deploy per-country indexes to search cluster. 87 / 112
    • 88. Full-text indexing: Example Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. XML feeds Search Cluster Deploy Geo Indexes HDFS info M/R M/R M/R M/R Parse Dedup Geo info Index 88 / 112
    • 89. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Data analytics2. Crawling3. Full-text indexing4. Reputation systems5. Data Mining 89 / 112
    • 90. Reputation: Definitions Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● What is reputation?● Reputation in social communities. – eBay, StackOverflow...● Reputation in social media. – Twitter, Facebook...● Why is it important? 90 / 112
    • 91. Reputation: Relationships Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Modelling relationships is needed for calculating reputation● Graph-like models arise● Usually stored as vertices – A interacts with B – or, A trusts B – or, A → B● Internet-scale graphs can be stored in HDFS – Each vertex in a row – Add needed metadata to vertices: date, etc. 91 / 112
    • 92. Reputation: MapReduce analysis on vertices Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● All people whom “A” interacted with. – map: (a, b) – reduce: (a, list<b>).● Essentially things like PageRank can be very easily implemented.● PageRank as a measure of page relevancy from page inlinks. – But it can be extrapolated to any kind of authority and trustiness metric. – E.g. People relevancy from social networks. 92 / 112
    • 93. Reputation: Going deeper on graphs Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Friends of friends of friends. – 1 MapReduce step: My friends. – 2 MapReduce steps: Friends of my friends. – 3 MapReduce steps: Friends of friends of my friends.● Iterative MapReduce solves it.● But there are better foundational models such as Google’s Pregel. – Exploiting data locality in graphs. – Apache Giraph. – Apache Hama. 93 / 112
    • 94. Reputation: Difficulties Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Sometimes multiple MapReduce steps are needed for calculating a final metric. – Because data doesn’t fit in memory.● Intermediate relations need to be calculated. – And later filtered out.● “Polynomial effect”: Calculate all pairwise relations in a set: N*(N-1)/2 – Possible bottleneck. 94 / 112
    • 95. Reputation: Difficulties: Data imbalance Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● When grouping by something, some groups may be much bigger than others. – Causing “data imbalance”.● Data imbalance in MapReduce is a big problem. – Some machines will finish quickly while one will be busy for hours. ● Inefficient usage of resources. ● Data processing doesn’t scale linearly anymore. – Next MapReduce step can’t start as long as current one hasn’t yet finished. 95 / 112
    • 96. Reputation: Example Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Input: Tweets.● Distributed crawling for fetching the tweets. – Save them in the HDFS.● Parse the tweets. Define the graph of trustiness. – A trusts B if A follows B.● Execute PageRank over the graph. – Spreads trustiness to all vertices. 96 / 112
    • 97. M/R Parse HDFS Reputation: ExampleC A BD Graph of trustiness M/R Results PageRank97 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
    • 98. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.1. Data analytics2. Crawling3. Full-text indexing4. Reputation systems5. Data Mining 98 / 112
    • 99. Data mining: Text classification Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Document classification – Documents are texts in this case.● Assigns a text one or more categories. – Binary classifiers versus multi-category classifiers – Multi-category classifiers can be built from multiple binary classifiers● Two steps: generating the model and classifying. 99 / 112
    • 100. Data mining: Text classification: Steps Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Generating the model – The resultant model may or may not fit in memory. ● Let’s assume the final model fits in memory. – Use a large dataset for generating the model. ● MapReduce helps scaling the model generation process. ● Example: Build multiple binary classifiers -> parallelize by classifier. ● Example: Calculate conditional probabilities of a Bayesian model. Paralellize by word (like in WordCount example).● Classifying – MapReduce also helps in classifying a large dataset. ● If model fits in memory, parallelize documents to classify and load the model in memory. – Batch-classifying: parallelize documents to classify. Output is the set of documents with the assigned categories. 100 / 112
    • 101. Data mining: Others Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Mahout library for other data mining problems. – Clustering, logistic regression, etc.● Recommendation algorithms. – Many recommendation algorithms are based on calculating correlations. – Calculating correlations in parallel with MapReduce is easy.● Remember: always in the “batch” or “offline” domain. – Recommendations are reloaded after batch process finishes. 101 / 112
    • 102. Tuple Mapreduce Pere Ferrera, Ivan de Prado, Eric Palacios, Jose Luis Fernandez-Marquez, Giovanna Di MarzoSerugendo: Tuple MapReduce: Beyond classic MapReduce. In ICDM 2012: Proceedings of the IEEE International Conference on Data Mining (To appear) (10.7% Acceptance rate)
    • 103. Common MapReduce problems Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Lack of compound records – By default, key & value are considered monolithic entities. ● In real life, this case is rare. – Alleviated by some serialization libraries (Thrift, Protocol Buffers)● Sorting within a group – MapReduce foundation does nos support it ● Although MapReduce implementations overcome this problem with “tricks”● Joins – Needs of compound records and sorting within a group to be implemented – Not directly supported by MapReduce 103 / 112
    • 104. Tuple Map Reduce: rationale Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Compound records, sorting within a group and join are design patterns that arises in most MapReduce applications...● … but MapReduce does not make the implementation easy● An evolution of MapReduce paradigm is needed to cover these design patterns: Tuple MapReduce 104 / 112
    • 105. Tuple MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● We extended the classic (key, value) MapReduce model● Use n-sized Tuples instead of (key, value)● Define a Tuple-based M/R – Covering most common use cases 105 / 112
    • 106. Group by / Sort by Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● You can think of a M/R as a SELECT … GROUP BY …● With Tuple MapReduce, you simply “group by” a subset of Tuple fields – Easier, more intuitive than having objects for each kind of Key you want to group by.● Alternatively, you may “sort by” a wider subset – Hiding all complex logic behind secondary sort 106 / 112
    • 107. Tuple-Join MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Extend the whole idea to allow for easier joins Tuple1: (a,b,c,d) Tuple2: (a,b,f,g,h) Join by (a,b)● Formally speaking: 107 / 112
    • 108. Pangool http://pangool.net108 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
    • 109. Pangool: What? Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Better, simpler, powerful API replacement for Hadoops API● What do we mean by API replacement? – APIs on top of Hadoop: Pig, Hive, Cascading. – Using them always comes with a tradeoff. – Paradigms other than MapReduce, not always the best choice. – Performance restrictions.● Pangool is still MapReduce, low-level and high performing – Yet a lot simpler! 109 / 112
    • 110. Pangool: Why? Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Hadoop has a steep learning curve● Default API is too low-level● Making things efficient is harsh (binary comparisons, ser/de...)● There are some common patterns (joins, secondary sorting...) Common pattern How can we make them simpler Common pattern without loosing flexibility and power? Common pattern 110 / 112
    • 111. Pangool API Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.● Schema, Tuple, …● Reducers, Mappers, etc are instances instead of static classes – Easier to configure them: new MyReducer(5, 2.0);● Still tied to Hadoops particularities in some ways – NullWritable, etc● Lets see an example 111 / 112
    • 112. Thanks!!Iván de Prado AlonsoPere Ferrera Bertran

    ×