MapReduce frameworks and methods - Adam Horvath, Google Technology User Group, Sydney


Published on

How to get started with MapReduce, how to learn the basics without the extremely steep learning curve

Published in: Education, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

MapReduce frameworks and methods - Adam Horvath, Google Technology User Group, Sydney

  1. 1. MapReduce frameworks and methods Presenter: Adam Horvath ( May 2012, Sydney Adam Horvath 1
  2. 2. Count the flowers• Given a flower field• Given unlimited resources• How to count the different colours of flowers in minimal time?23 May 2012, Sydney Adam Horvath 2
  3. 3. Original problem• Building product recommendations – For a ‘web shop’• Looking for easy to use frameworks – Not too many generic ones• Data warehouse – Way too complex, way too expensive• Maximum 1 screen to write the whole processing – If it’s longer, you wouldn’t understand• Functional languages - steep learning curve – Neither you nor others would understand23 May 2012, Sydney Adam Horvath 3
  4. 4. MapReduce based solutions• Mapreduce data processing – Takes a while to get used to it• Hadoop – Way too hard to start with• Lightweight frameworks – A little limitation here and there• MapReduce.NET – Roll your own23 May 2012, Sydney Adam Horvath 4
  5. 5. How can I learn to fly/MapReduce?• Hadoop, petabytes of data – Get a Boeing 747 – Pass the required 1500 hours of flying – Ready to go!• Simple frameworks, gigabytes of data – Order a plane model online for $30 – Play with a simulator for 60 minutes – Ready to go!23 May 2012, Sydney Adam Horvath 5
  6. 6. What is MapReduce• Simple framework to parallel batch process data (even semi- or unstructured data)• The output is usually reasonable size (human readable or for other systems)• Working with key-value pairs• Two major steps: Map and Reduce23 May 2012, Sydney Adam Horvath 6
  7. 7. What is not MapReduce• Database – The input is usually plain text files• Online tool – The process starts in batches, might run for hours• A language – Any programming language can be used to create Map and Reduce steps (depending on the framework used)• Compressing cartographic data23 May 2012, Sydney Adam Horvath 7
  8. 8. Map• Transform the input data into something that can be aggregated (counted, grouped…)• The “SELECT T1,T2… FROM X…” part of the query• Map from Key1-Value1 to – Key1-Value2 OR – Key2-Value2• Example, log file line: – “20120501 /urlxxx” – Input key: might be null or file name – Input value: the line itself• Output – Key: “urlxxx” Value: “20120501”23 May 2012, Sydney Adam Horvath 8
  9. 9. Reduce• Aggregates the output of the Map step• The simplest case is just to output the Map result without processing• The “…SUM() … GROUP BY…” part of the query• Reduces from Key1-Value1 to – Key1-Value1 (ideally, value1 is a list or a sum) – Key1-Value2 (input and output are incompatible)• Input – Key: “urlxxx” Value: “20120501” – Key: “urlxxx” Value: “20120502”• Output – Key: “urlxxx” Values: “20120501, 20120502”23 May 2012, Sydney Adam Horvath 9
  10. 10. Taken from Google Code University 23 May 2012, Sydney Adam Horvath 10
  11. 11. “Embarrassingly parallel”• MapReduce only makes sense when the data can be processed parallel• Easy to slice the input into smaller chunks/problems• Nothing is too big/complex if can be taken apart• There is no overlapping/dependence between the chunks23 May 2012, Sydney Adam Horvath 11
  12. 12. Structured vs Unstructured data• MapReduce frequently works on unstructured data or semi structured data• Structured is great for answering requests – Normalised SQL tables – Typically quick answers – Reasonable size of data• Unstructured is great for finding hidden trends – Logs, documents, images… – Typically batch processed, slow answers – Any size of data23 May 2012, Sydney Adam Horvath 12
  13. 13. History of MapReduce• Google described it in 2004• Map and reduce exists in many languages, manly in functional ones• Input/output is stored on a distributed file system (GFS)• Google Inc. holds PN/7,650,331 - System and method for efficient large-scale data processing23 May 2012, Sydney Adam Horvath 13
  14. 14. Questions for MapReduce• Create a database with inverted index• Recommend content based on previous visits• Parse logs, create visitor analytics• Process text documents, categorise them• Process images• Find partners for life23 May 2012, Sydney Adam Horvath 14
  15. 15. When to choose SQL• Small structured data – Normalised tables – The database engine is capable of running the JOINs and GROUP BYs• Online requirement – If the answer needs to be served within a second23 May 2012, Sydney Adam Horvath 15
  16. 16. Typical framework limitations• Same reduce output as input – Reduce input: [K1, [V1]], [K1, [V2]], output: [K1,[V1,V2]] – No need to create mergers but limits the complexity• Memory – The hard disk is the new tape• Debugging – No framework is flawless but usually hard to debug• Typed/untyped, generics – Very hard to understand the process without types• Network usage• Faulty/dead machines slow the whole process down23 May 2012, Sydney Adam Horvath 16
  17. 17. Well known frameworks• Hadoop – Java based, data stored on HDFS – HDFS – open source distributed file system for large files (~Google File System) – HBase – built on top of HDFS, provides fast key- value lookup for tables (~Google Bigtable)• Pig – Map Reduce expressed in Pig Latin• Hive - SQL over Hadoop• Sqoop - data from SQL to Hadoop23 May 2012, Sydney Adam Horvath 17
  18. 18. MapReduce components• Input – Reads data storage (files, database…)• Slicer – Distributes the huge data into small chunks• Map – Transfers the data for easy processing• Partition/sort – Sorts the output of Map operation to transfer to Reducers• Reduce – Aggregates the data• Merger – Merges two or more outputs of Reduces• After process – Some cleanup• Output – Writes the output of the MapReduce to disk (files, database…)23 May 2012, Sydney Adam Horvath 18
  19. 19. MapReduce.NET• Open source• Written in c#• Made for couple of machines with several Gb of data• Can process TB with batches• In comparison – Facebook: Hadoop, 4800 cores, 5.5 PetaBytes, – 12 TB per node23 May 2012, Sydney Adam Horvath 19
  20. 20. Counting flowers• Make small areas on the field• Place Mappers on each areas• Have one or more reducer for some areas• Mapper yells color, reducer takes note• Another reducer/merger merges the output of reducers23 May 2012, Sydney Adam Horvath 20
  21. 21. Counting flowers - code// MAPpublic override void Map(string key, string value, IQueue<Color, int> result){ Flower[] flowers = Flower.ParseArr(value); foreach (var item in flowers) { result.Push(item.Color, 1); }}// REDUCEpublic override int Reduce(Color key, int value, int result){ return result + value;}23 May 2012, Sydney Adam Horvath 21
  22. 22. Inverted index• Mapping from content to location – “expression” -> file.txt, 44• Lucene is a great tool utilising inverted indices• Index types – DocId list - which documents contain the term – Positional list - where do the documents contain the terms – Schema-independent - unique position of the word in the whole data store• Stop words – Can crate noise, like the word “the” – Omitting it can cause problems • “The veronicas”23 May 2012, Sydney Adam Horvath 22
  23. 23. Demo Inverted index23 May 2012, Sydney Adam Horvath 23
  24. 24. Product recommendation• Recommend content based on user behaviour – Use visit log, but other factors like clicking previous recommendations and aging are important too• Process – MAP Log  Session-ProdId – REDUCE Session-ProdId  Session (ProdId 1, ProdId 2...) – MAP Session (ProdId 1, ProdId 2...)  Transition (ProdId 1-ProdId 2), Transition(ProdId 2 -ProdId 2) – REDUCE Transition(ProdId 1-ProdId 2)  ProdId 1 (ProdId 2, ProdId 3…)23 May 2012, Sydney Adam Horvath 24
  25. 25. Demo Product recommendation23 May 2012, Sydney Adam Horvath 25
  26. 26. How to get started• Download MapReduce.NET, write simple processors• Download Hadoop VM from Cloudera – Fix all the bugs in the VM… – Start creating distributed processing• Create theories, prove them using your logs!23 May 2012, Sydney Adam Horvath 26
  27. 27. Case studies• Works for me – Find churning customers/visitors – Detecting keywords and searching by keyword – Item recommendation• Yahoo – Search, search assist• eHarmony – Matchmaking• – Predictive Behavioral Targeting• Facebook – Recommendations, etc.23 May 2012, Sydney Adam Horvath 27
  28. 28. CS101 – why is it slow?• Use custom data structures – Built in one like LinkedList, Stack, Queue will blow up – create fixed size structures• Serialization is always slow, deserialization is even worse – Avro IDL, Apache Thrift IDL, ProtoBuf• Dictionary*“slow”+++ – is a find/read/modify/find/write operation – Write custom dictionary with direct indexing23 May 2012, Sydney Adam Horvath 28
  29. 29. CS101 – why is it slow?• Do not use synchronisation – Few locks and waits are okay, thousands are not• Use CPU native data types – String is not• Do not use in-built string operations – Split, replace, … creates extra String instances• Don’t use RegEx• Store data in the file system – Anything that resembles a database (even SQLite) is at least a magnitude slower23 May 2012, Sydney Adam Horvath 29
  30. 30. CS101 – why is it slow?• Allow thread scheduler to do his job – Use Thread.Sleep(0) and not Thread.Sleep(1)• Do not use sorted data structures – Sort the data when you are done• Reflection is fine to call a long running method – But slow to call a function many times• Choose hash function very carefully – Validate the choice, collision is not your friend• Never performance test under Virtual Machines• Win32 on 32 bit platform – 2Gb max process size – On some Linux distributions depending on Kernel this holds too23 May 2012, Sydney Adam Horvath 30
  31. 31. URLs• This presentation• MapReduce.NET• Apache Hadoop• Apache Pig• Apache Hive• Cloudera Hadoop demo VM mo+VM• A Comparison of Approaches to Large-Scale Data Analysis nchmarks-sigmod09.pdf• Information Retrieval: Implementing and Evaluating Search Engines May 2012, Sydney Adam Horvath 31
  32. 32. Questions Adam Horvath ( May 2012, Sydney Adam Horvath 32
  33. 33. After hour labs Adam Horvath ( May 2012, Sydney Adam Horvath 33