Mr bi


Published on

Map reduce intro for MBAs

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Mr bi

  1. 1. An Insight into Map Reduce and related technology Renjith Peediackal 09BM8040
  2. 2. <ul><li>ET brand equity of 9 th march explains the future of analytics </li></ul><ul><li>Some of us will be champions of analytics within the respective organizations </li></ul><ul><li>Some us will be selling analytics products </li></ul><ul><li>Some has to talk to analytics professionals and understand the latest jargon </li></ul><ul><li>And the analytics moves to churn web data to give us more insights. So we move to mR and data in flight </li></ul><ul><li>We are IITians! </li></ul>Importance of understanding MR
  3. 3. The case for Map Reduce
  4. 4. Recommendation System <ul><li>Customer Y buys product X5 from an e-commerce site after going through a number of products X1, X2, X3, X4 </li></ul><ul><li>Student Y goes through site A1,A2,A3 and finally settles down and read the content from A5 </li></ul><ul><li>1000 of people behaves in the same way. </li></ul><ul><li>Can we make more traffic in our site or design a new site based on the insight derived from above pattern? </li></ul>
  5. 5. A lot more questions <ul><li>Based on ET interview of Avinash Kashik, Analytics expert: </li></ul>What pages are my customer’s reading
  6. 6. A lot more questions contd.. <ul><li>What kind of content I need to develop in my site so as to attract the right set of people? </li></ul><ul><li>Your URL should be present in what kind of sites so that you get maximum number of referral? </li></ul><ul><li>How many of them quit after seeing the homepage? </li></ul><ul><li>What different kind of design can be possible to make them go forward? </li></ul><ul><li>Are the users clicking on the right links in the right fashion in your websites?(Site overlay) </li></ul><ul><li>What is the bounce rate? </li></ul><ul><li>How to save money on PPC schemes? </li></ul>
  7. 7. And the typical problems with recommendation systems
  8. 8. Problems with popularity <ul><li>Customer need not be satisfied perpetually by same products </li></ul><ul><li>Popularity based system ruins this possibilities of exploration! </li></ul><ul><li>Companies have to create niche products and up sell and cross sell it to customers </li></ul><ul><ul><li>to satisfy them </li></ul></ul><ul><ul><li>retain them </li></ul></ul><ul><ul><li>and thus to be successful in the market. Opportunity of selling a product is lost! </li></ul></ul><ul><li>Lack of personalization leads to broken relations </li></ul><ul><li>Think Beyond POS data!! </li></ul>
  9. 9. Mixing expert opinion <ul><li>To avoid popularity and to have more meaningful recommendation mix expert opinion </li></ul><ul><li>Mix of art with science nobody knows the right blend </li></ul><ul><li>Think beyond POS data and experts wisdom </li></ul>
  10. 10. Pearls of wisdom in the net
  11. 11. But internet data is unfriendly <ul><li>To statistical techniques and DBMS technology </li></ul><ul><ul><li>Dynamic </li></ul></ul><ul><ul><li>Sparse </li></ul></ul><ul><ul><li>Unstructured </li></ul></ul><ul><li>Growth of data </li></ul><ul><ul><li>Published content: 3-4 Gb/day </li></ul></ul><ul><ul><li>Professional web content: 2 Gb/day </li></ul></ul><ul><ul><li>User generated content: 5-10 Gb/day </li></ul></ul><ul><ul><li>Private text content: ~2 Tb/day (200x more) </li></ul></ul><ul><ul><li>(Ref: Raghu Ramakrishnan </li></ul></ul><ul><li>Questions to this data </li></ul><ul><ul><li>Can we do Analytics over Web Data / User Generated Content? </li></ul></ul><ul><ul><li>TB of text data / GB of new data each day? </li></ul></ul><ul><ul><li>Structured Queries, Search Queries? </li></ul></ul><ul><ul><li>At “Google-Speed”? </li></ul></ul>
  12. 12. The case for a new technique <ul><li>That gives us a strong case for adopting the new technology of data in flight. </li></ul><ul><li>‘ Map Reduce’ is a technology developed by Google for the similar purposes. </li></ul>
  13. 13. What is Data in flight? <ul><li>Earlier data was at ‘rest’! </li></ul><ul><ul><li>The normal concepts of DBMS where data is at rest and the queries hit those static data and fetch results </li></ul></ul><ul><li>Now data is just flying in! </li></ul><ul><ul><li>the new concepts of ‘data in flight’ envisages the already prepared query as static, collecting dynamic data as and when it is produced and consumed. </li></ul></ul><ul><ul><li>Systems to handle </li></ul></ul>
  14. 14. Map and reduce <ul><li>A map operation is needed to translate the scarce information available in numerous formats to some forms which can be processed easily by an analytical technique . </li></ul><ul><li>Once the information is in simpler and structured form, it can be reduced to the required results. </li></ul>
  15. 15. Terminology explained.. <ul><li>A standard example: </li></ul><ul><ul><li>Word count! </li></ul></ul><ul><ul><ul><li>Given a document, how many of each word are there? </li></ul></ul></ul><ul><li>But in real world it can be: </li></ul><ul><ul><li>Given our search logs, how many people click on result 1 </li></ul></ul><ul><ul><li>Given our flicker photos, how many cat photos are there by users in each geographic region </li></ul></ul><ul><ul><li>Give our web crawl, what are the 10 most popular words? </li></ul></ul>
  16. 16. How does a map reduce programme work Programmer has to specify two methods: Map and Reduce
  17. 17. map (k, v) -> <k', v'>* <ul><li>Specify a map function that takes a key(k)/value(v) pair. </li></ul><ul><ul><li>key = document URL, value = document contents </li></ul></ul><ul><li>“ document1”, “to be or not to be” </li></ul><ul><li>Output of map is (potentially many) key/value pairs. <k', v'>* </li></ul><ul><li>In our case, output (word, “1”) once per word in the document </li></ul><ul><ul><li>“ to”, “1” </li></ul></ul><ul><ul><li>“ be”, “1” </li></ul></ul><ul><ul><li>“ or”, “1” </li></ul></ul><ul><ul><li>“ to”, “1” </li></ul></ul><ul><ul><li>“ not”, “1” </li></ul></ul><ul><ul><li>“ be”, “1” </li></ul></ul>
  18. 18. Shuffle or sort <ul><li>(shuffle/sort) </li></ul><ul><ul><li>“ to”, “1” </li></ul></ul><ul><ul><li>“ to”, “1” </li></ul></ul><ul><ul><li>“ be”, “1” </li></ul></ul><ul><ul><li>“ be”, “1” </li></ul></ul><ul><ul><li>“ not”, “1” </li></ul></ul><ul><ul><li>“ or”, “1”  </li></ul></ul>
  19. 19. – reduce (k', <v'>*) -> <k', v'>* <ul><li>The reduce function combines the values for a key </li></ul><ul><ul><li>“ be”, “2” </li></ul></ul><ul><ul><li>“ not”, “1” </li></ul></ul><ul><ul><li>“ or”, “1” </li></ul></ul><ul><ul><li>“ to”, “2” </li></ul></ul><ul><li>For different use cases functions within map and reduce differs, but the architecture and the supporting platform remains the same </li></ul>
  20. 20. How this new way helpful for our recommendation system? <ul><li>Brute power </li></ul><ul><ul><li>Uses the brute power of many machines to map the huge chunk of sparse data into small table of dense data </li></ul></ul><ul><ul><li>The complex and time consuming part of the “ task ” is done on the new, small and dense data in reduce part </li></ul></ul><ul><ul><li>Means, it separate huge data from the time consuming part of the algorithm, albeit a lot of disk space is utilized. </li></ul></ul>
  21. 21. Maps into a denser smaller table
  22. 22. Fault tolerance two different types- Database school of thought
  23. 23. Fault tolerance two different types- MR school of thought
  24. 24. Hierarchy of Parallelism: Cycle of brute force fault tolerance
  25. 25. Criticisms <ul><li>A giant step backward in the programming paradigm for large-scale data intensive applications </li></ul><ul><li>A sub-optimal implementation </li></ul><ul><li>in that it uses brute force instead of indexing </li></ul><ul><li>Not novel at all </li></ul><ul><li>it represents a specific implementation of well known techniques developed 25 years ago </li></ul><ul><li>Missing most features in current DBMS </li></ul><ul><li>Incompatible with all of the tools DBMS users have come to depend on </li></ul>
  26. 26. Why it is valuable still? <ul><li>Permanent writing magically enables two different wonderful features </li></ul><ul><ul><li>It raises the fault tolerance level to such a level, that we can employ millions of cheap computers to get our work done. </li></ul></ul><ul><ul><li>It brings dynamism and load balancing. Needed since we don’t know about the nature of the data. It helps the programmers to logically manage the complexity of the data </li></ul></ul>
  27. 27. Why can’t parallel DB deliver the same? <ul><li>At large scales, super-fancy reliable hardware still fails, albeit less often. The brute force fault tolerance is more practical. </li></ul><ul><li>software still needs to be fault-tolerant </li></ul><ul><li>commodity machines without fancy hardware gives better perf/$ </li></ul><ul><li>Usage of more memory to speed up querying has its own implication on tolerance and cost </li></ul><ul><li>Following an execution plan based system does not work with dynamic, sparse and unstructured data </li></ul>
  28. 28. An example: Invite you to the complexity-sequential web access-based recommendation system
  29. 29. sequential web access-based recommendation system <ul><li>It goes through web server logs, mines the pattern in the sequence and then creates a pattern tree. And the pattern tree is continuously modified taking the data from different servers.[Zhou et al] </li></ul>
  30. 30. Recommendation <ul><li>And when a particular user has to be catered with a suggestion </li></ul><ul><ul><li>his access pattern tree is compared with the entire tree of patterns. </li></ul></ul><ul><ul><li>And the most suitable portions of the tree in comparison with the user’s pattern are selected and </li></ul></ul><ul><ul><li>its branches are suggested. </li></ul></ul>
  31. 31. Some details <ul><li>Let E be a set of unique access events, which represents web resources accessed by users, i.e. web pages, URLs, topics or categories </li></ul><ul><li>A web access sequence S = e1e2 ... is an ordered collection (sequence) of access events </li></ul><ul><li>Suppose we have a set of web access sequences with the set of events, E = (a, b, c, d , e, f) a sample database will be like </li></ul>Session ID Web access sequence 1 abdac 2 eaebcac 3 babfae 4 abbacfc
  32. 32. Details Access events can be classified into frequent and infrequent based on frequency crossing a threshold level And a tree consisting of frequent access events can be created. Length of sequence Sequential web access pattern with support 1 a:4. b:4, c:3 2 aa:4. ab:4. oc3. ba:4. bc:3 3 aac:3, aba;4, obc:3, bac:3 4 Abac:3
  33. 34. The Map and reduce <ul><li>So a map job can be designed to process the logs and create pattern tree. </li></ul><ul><li>The task is divided among thousands of cheap machines using map Reduce platform. </li></ul><ul><li>dynamic data and the static query model of data in flight will be very helpful to modify the main tree </li></ul><ul><li>The tree structure can be efficiently stored by altering the physical storage by sorting and partitioning. </li></ul><ul><li>Then based on the user’s access pattern we have to select a few parts of the tree. This can be designed as a reduce job which runs across the tree data. </li></ul>
  34. 35. DBMS for the same case? <ul><li>Map </li></ul><ul><ul><li>A huge data base of access logs should be uploaded to a db. And then it should be updated at regular intervals to reflect the changes in the site usage. </li></ul></ul><ul><ul><li>Then a query has to be written to get tree kind of data structure out of this data behemoth, which changes shape continuously! </li></ul></ul><ul><ul><li>An execution plan, which is simplistic and non dynamic in nature has to be made. Ineffective </li></ul></ul><ul><ul><li>It should be divided among many parallel engines </li></ul></ul><ul><ul><li>And this requires expertise in parallel programming. </li></ul></ul><ul><li>Reduce </li></ul><ul><ul><li>During reduce phase the entire tree has to be searched for the existence of resembling patterns. </li></ul></ul><ul><ul><li>This also will be ineffective in an execution plan driven model as explained above. </li></ul></ul><ul><li>And with the explosion of data, and the increased need of increased personalization in recommendations, map reduce becomes the most suitable pattern. </li></ul>
  35. 36. Parallel DB vs MapReduce <ul><li>RDBMS is good when </li></ul><ul><ul><li>if the application is query-intensive, </li></ul></ul><ul><ul><li>whether semi structured or rigidly structured </li></ul></ul><ul><li>MR is effective </li></ul><ul><ul><li>ETL and “read once” data sets. </li></ul></ul><ul><ul><li>Complex analytics. </li></ul></ul><ul><ul><li>Semi-structured data, Non structured </li></ul></ul><ul><ul><li>Quick-and-dirty analyses. </li></ul></ul><ul><ul><li>Limited-budget operations. </li></ul></ul>
  36. 37. Summary of advantages of MR <ul><li>Storage system independence </li></ul><ul><li>automatic parallelization </li></ul><ul><li>load balancing </li></ul><ul><li>network and disk transfer optimization </li></ul><ul><li>handling of machine failures </li></ul><ul><li>Robustness </li></ul><ul><li>Improvements to core library benefit all users of library! </li></ul><ul><li>Ease to programmers! </li></ul>
  37. 38. Is mapReduce the final word?
  38. 39. What is hadoop <ul><li>Based on the map Reduce paradigm, apache foundation has given rise to a program for developing tools and techniques on an open source platform. </li></ul><ul><li>This program and the resultant technology is termed as hadoop </li></ul>
  39. 40. Pig <ul><li>Can we use MR for repetitive jobs effectively? </li></ul><ul><li>How can one control the execution of the hadoop program just like creating an execution plan in normal DB operation? </li></ul><ul><li>The answer leads to pig. Pig allows one to control the flow of data by creating execution plans easily. </li></ul><ul><li>Suitable when the task are repetitive and the plans can be envisaged early on. </li></ul>
  40. 41. What does hive do? <ul><li>Users of databases are not often technology masters. </li></ul><ul><li>They might be familiar to the existing platforms. And these platforms tend to generate SQL like queries. </li></ul><ul><li>We need a program to convert this traditional sql queries into mapReduce jobs. </li></ul><ul><li>And the one created by hadoop movement is Hive. </li></ul>
  41. 42. Hive architecture
  42. 43. Many more tools But