Map Reduce amrp presentation

594 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
594
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Map Reduce amrp presentation

  1. 1. An Insight into Map Reduce and related technology Renjith Peediackal 09BM8040 Project Guide: Prof. Prithwis Mukherjee
  2. 2. <ul><li>This project work has been undertaken in the 4 th Semester. </li></ul><ul><li>Currently the information available regarding map reduce technology, is either technology oriented or marketing oriented </li></ul><ul><li>Task was to understand the emerging technology, create a consulting document and also conduct a class room session in ‘IT for BI’ elective. </li></ul>Goals
  3. 3. The case for Map Reduce
  4. 4. Recommendation System: Older questions <ul><li>Customer Y buys product X5 from an e-commerce site after going through a number of products X1, X2, X3, X4 </li></ul><ul><li>Student Y goes through site A1,A2,A3 and finally settles down and read the content from A5 </li></ul><ul><li>1000 of people behaves in the same way. </li></ul><ul><li>Can we make more traffic in our site or design a new site based on the insight derived from above pattern? </li></ul>
  5. 5. A lot more questions <ul><li>Based on ET interview of Avinash Kashik, Analytics expert: </li></ul>What pages are my customer’s reading
  6. 6. A lot more questions contd.. <ul><li>What kind of content I need to develop in my site so as to attract the right set of people? </li></ul><ul><li>Your URL should be present in what kind of sites so that you get maximum number of referral? </li></ul><ul><li>How many of them quit after seeing the homepage? </li></ul><ul><li>What different kind of design can be possible to make them go forward? </li></ul><ul><li>Are the users clicking on the right links in the right fashion in your websites?(Site overlay) </li></ul><ul><li>What is the bounce rate? </li></ul><ul><li>How to save money on PPC schemes? </li></ul>
  7. 7. And the typical problems with recommendation systems
  8. 8. Problems with popularity <ul><li>Customer need not be satisfied perpetually by same products </li></ul><ul><li>Popularity based system ruins this possibilities of exploration! </li></ul><ul><li>Companies have to create niche products and up sell and cross sell it to customers </li></ul><ul><ul><li>to satisfy them </li></ul></ul><ul><ul><li>retain them </li></ul></ul><ul><ul><li>and thus to be successful in the market. Opportunity of selling a product is lost! </li></ul></ul><ul><li>Lack of personalization leads to broken relations </li></ul><ul><li>Think Beyond POS data!! </li></ul>
  9. 9. Mixing expert opinion <ul><li>To avoid popularity and to have more meaningful recommendation mix expert opinion </li></ul><ul><li>Mix of art with science nobody knows the right blend </li></ul><ul><li>Think beyond POS data and experts wisdom </li></ul>
  10. 10. Pearls of wisdom in the net
  11. 11. But internet data is unfriendly <ul><li>To statistical techniques and DBMS technology </li></ul><ul><ul><li>Dynamic </li></ul></ul><ul><ul><li>Sparse </li></ul></ul><ul><ul><li>Unstructured </li></ul></ul><ul><li>Growth of data </li></ul><ul><ul><li>Published content from traditional sources: 3-4 Gb/day </li></ul></ul><ul><ul><li>Professional web content: ~2 Gb/day </li></ul></ul><ul><ul><li>Private text content: ~3 Tb/day (200x more) </li></ul></ul><ul><ul><li>Upper bound on typed content: ~700 Tb/day </li></ul></ul><ul><ul><li>(Ref: Raghu Ramakrishnan http://www.cs.umbc.edu/~hillol/NGDM07/abstracts/slides/Ramakrishnan_ngdm07.pdf) </li></ul></ul><ul><li>Questions to this data </li></ul><ul><ul><li>Can we do Analytics over Web Data / User Generated Content? </li></ul></ul><ul><ul><li>TB of text data / GB of new data each day? </li></ul></ul><ul><ul><li>Structured Queries, Search Queries? </li></ul></ul><ul><ul><li>At “Google-Speed”? </li></ul></ul>
  12. 12. The case for a new technique <ul><li>That gives us a strong case for adopting the new technology of data in flight. </li></ul><ul><li>‘ Map Reduce’ is a technology developed by Google for the similar purposes. </li></ul>
  13. 13. What is Data in flight? <ul><li>Earlier data was at ‘rest’! </li></ul><ul><ul><li>The normal concepts of DBMS where data is at rest and the queries hit those static data and fetch results </li></ul></ul><ul><li>Now data is just flying in! </li></ul><ul><ul><li>the new concepts of ‘data in flight’ envisages the already prepared query as static, collecting dynamic data as and when it is produced and consumed. </li></ul></ul><ul><ul><li>Systems to handle </li></ul></ul>
  14. 14. Map and reduce <ul><li>A map operation is needed to translate the scarce information available in numerous formats to some forms which can be processed easily by an analytical technique . </li></ul><ul><li>Once the information is in simpler and structured form, it can be reduced to the required results. </li></ul>
  15. 15. Terminology explained.. <ul><li>A standard example: </li></ul><ul><ul><li>Word count! </li></ul></ul><ul><ul><ul><li>Given a document, how many of each word are there? </li></ul></ul></ul><ul><li>But in real world it can be: </li></ul><ul><ul><li>Given our search logs, how many people click on result 1 </li></ul></ul><ul><ul><li>Given our flicker photos, how many cat photos are there by users in each geographic region </li></ul></ul><ul><ul><li>Give our web crawl, what are the 10 most popular words? </li></ul></ul>
  16. 16. Word count and twitter <ul><li>Tweets can be used to get early warnings on epidemic like swine flue </li></ul><ul><li>Tweets can be used to understand the ‘mood’ of people in a region and can be used for different purposes, even subliminal marketing </li></ul><ul><ul><li>The software created by Dr Peter Dodds and Dr Chris Danforth of the University of Vermont , collects sentences from blogs and 'tweets‘, zeroing in on the happiest and saddest days of the last few years. </li></ul></ul><ul><li>Can it prevent social crises? </li></ul>
  17. 17. How does a map reduce programme work Programmer has to specify two methods: Map and Reduce
  18. 18. map (k, v) -> <k', v'>* <ul><li>Specify a map function that takes a key(k)/value(v) pair. </li></ul><ul><ul><li>key = document URL, value = document contents </li></ul></ul><ul><li>“ document1”, “to be or not to be” </li></ul><ul><li>Output of map is (potentially many) key/value pairs. <k', v'>* </li></ul><ul><li>In our case, output (word, “1”) once per word in the document </li></ul><ul><ul><li>“ to”, “1” </li></ul></ul><ul><ul><li>“ be”, “1” </li></ul></ul><ul><ul><li>“ or”, “1” </li></ul></ul><ul><ul><li>“ to”, “1” </li></ul></ul><ul><ul><li>“ not”, “1” </li></ul></ul><ul><ul><li>“ be”, “1” </li></ul></ul>
  19. 19. Shuffle or sort <ul><li>(shuffle/sort) </li></ul><ul><ul><li>“ to”, “1” </li></ul></ul><ul><ul><li>“ to”, “1” </li></ul></ul><ul><ul><li>“ be”, “1” </li></ul></ul><ul><ul><li>“ be”, “1” </li></ul></ul><ul><ul><li>“ not”, “1” </li></ul></ul><ul><ul><li>“ or”, “1”  </li></ul></ul>
  20. 20. – reduce (k', <v'>*) -> <k', v'>* <ul><li>The reduce function combines the values for a key </li></ul><ul><ul><li>“ be”, “2” </li></ul></ul><ul><ul><li>“ not”, “1” </li></ul></ul><ul><ul><li>“ or”, “1” </li></ul></ul><ul><ul><li>“ to”, “2” </li></ul></ul><ul><li>For different use cases functions within map and reduce differs, but the architecture and the supporting platform remains the same </li></ul>
  21. 21. How this new way helpful for our recommendation system? <ul><li>Brute power </li></ul><ul><ul><li>Uses the brute power of many machines to map the huge chunk of sparse data into small table of dense data </li></ul></ul><ul><ul><li>The complex and time consuming part of the “ task ” is done on the new, small and dense data in reduce part </li></ul></ul><ul><ul><li>Means, it separate huge data from the time consuming part of the algorithm, albeit a lot of disk space is utilized. </li></ul></ul>
  22. 22. Maps into a denser smaller table
  23. 23. Fault tolerance two different types- Database school of thought
  24. 24. Fault tolerance two different types- MR school of thought
  25. 25. Hierarchy of Parallelism: Cycle of brute force fault tolerance
  26. 26. Criticisms <ul><li>A giant step backward in the programming paradigm for large-scale data intensive applications </li></ul><ul><li>A sub-optimal implementation </li></ul><ul><li>in that it uses brute force instead of indexing </li></ul><ul><li>Not novel at all </li></ul><ul><li>it represents a specific implementation of well known techniques developed 25 years ago </li></ul><ul><li>Missing most features in current DBMS </li></ul><ul><li>Incompatible with all of the tools DBMS users have come to depend on </li></ul>
  27. 27. Why it is valuable? <ul><li>Permanent writing magically enables two different wonderful features </li></ul><ul><ul><li>It raises the fault tolerance level to such a level, that we can employ millions of cheap computers to get our work done. </li></ul></ul><ul><ul><li>It brings dynamism and load balancing. Needed since we don’t know about the nature of the data </li></ul></ul><ul><ul><li>And the biggest, It helps the programmers to logically manage the complexity of the data </li></ul></ul>
  28. 28. Why can’t parallel DB deliver the same? <ul><li>Who ? </li></ul><ul><li>At large scales, super-fancy reliable hardware still fails, albeit less often. The brute force fault tolerance is more practical. </li></ul><ul><li>software still needs to be fault-tolerant </li></ul><ul><li>commodity machines without fancy hardware gives better perf/$ </li></ul><ul><li>Usage of more memory to speed up querying has its own implication on tolerance and cost </li></ul><ul><li>Following an execution plan based system does not work with dynamic, sparse and unstructured data </li></ul>
  29. 29. An example: Invite you to the complexity-sequential web access-based recommendation system
  30. 30. sequential web access-based recommendation system <ul><li>It goes through web server logs, mines the pattern in the sequence and then creates a pattern tree. And the pattern tree is continuously modified taking the data from different servers.[Zhou et al] </li></ul>
  31. 31. Recommendation <ul><li>And when a particular user has to be catered with a suggestion </li></ul><ul><ul><li>his access pattern tree is compared with the entire tree of patterns. </li></ul></ul><ul><ul><li>And the most suitable portions of the tree in comparison with the user’s pattern are selected and </li></ul></ul><ul><ul><li>its branches are suggested. </li></ul></ul>
  32. 32. Some details <ul><li>Let E be a set of unique access events, which represents web resources accessed by users, i.e. web pages, URLs, topics or categories </li></ul><ul><li>A web access sequence S = e1e2 ... is an ordered collection (sequence) of access events </li></ul><ul><li>Suppose we have a set of web access sequences with the set of events, E = (a, b, c, d , e, f) a sample database will be like </li></ul>Session ID Web access sequence 1 abdac 2 eaebcac 3 babfae 4 abbacfc
  33. 33. Details Access events can be classified into frequent and infrequent based on frequency crossing a threshold level And a tree consisting of frequent access events can be created. Length of sequence Sequential web access pattern with support 1 a:4. b:4, c:3 2 aa:4. ab:4. oc3. ba:4. bc:3 3 aac:3, aba;4, obc:3, bac:3 4 Abac:3
  34. 35. The Map and reduce <ul><li>So a map job can be designed to process the logs and create pattern tree. </li></ul><ul><li>The task is divided among thousands of cheap machines using map Reduce platform. </li></ul><ul><li>dynamic data and the static query model of data in flight will be very helpful to modify the main tree </li></ul><ul><li>The tree structure can be efficiently stored by altering the physical storage by sorting and partitioning. </li></ul><ul><li>Then based on the user’s access pattern we have to select a few parts of the tree. This can be designed as a reduce job which runs across the tree data. </li></ul>
  35. 36. DBMS for the same case? <ul><li>Map </li></ul><ul><ul><li>A huge data base of access logs should be uploaded to a db. And then it should be updated at regular intervals to reflect the changes in the site usage. </li></ul></ul><ul><ul><li>Then a query has to be written to get tree kind of data structure out of this data behemoth, which changes shape continuously! </li></ul></ul><ul><ul><li>An execution plan, which is simplistic and non dynamic in nature has to be made. Ineffective </li></ul></ul><ul><ul><li>It should be divided among many parallel engines </li></ul></ul><ul><ul><li>And this requires expertise in parallel programming. </li></ul></ul><ul><li>Reduce </li></ul><ul><ul><li>During reduce phase the entire tree has to be searched for the existence of resembling patterns. </li></ul></ul><ul><ul><li>This also will be ineffective in an execution plan driven model as explained above. </li></ul></ul><ul><li>And with the explosion of data, and the increased need of increased personalization in recommendations, map reduce becomes the most suitable pattern. </li></ul>
  36. 37. Parallel DB vs MapReduce <ul><li>RDBMS is good when </li></ul><ul><ul><li>if the application is query-intensive, </li></ul></ul><ul><ul><li>whether semi structured or rigidly structured </li></ul></ul><ul><li>MR is effective </li></ul><ul><ul><li>ETL and “read once” data sets. </li></ul></ul><ul><ul><li>Complex analytics. </li></ul></ul><ul><ul><li>Semi-structured data, Non structured </li></ul></ul><ul><ul><li>Quick-and-dirty analyses. </li></ul></ul><ul><ul><li>Limited-budget operations. </li></ul></ul>
  37. 38. Summary of advantages of MR <ul><li>Storage system independence </li></ul><ul><li>automatic parallelization </li></ul><ul><li>load balancing </li></ul><ul><li>network and disk transfer optimization </li></ul><ul><li>handling of machine failures </li></ul><ul><li>Robustness </li></ul><ul><li>Improvements to core library benefit all users of library! </li></ul><ul><li>Ease to programmers! </li></ul>
  38. 39. Is mapReduce the final word?
  39. 40. What is hadoop <ul><li>Based on the map Reduce paradigm, apache foundation has given rise to a program for developing tools and techniques on an open source platform. </li></ul><ul><li>This program and the resultant technology is termed as hadoop </li></ul>
  40. 41. Pig <ul><li>Can we use MR for repetitive jobs effectively? </li></ul><ul><li>How can one control the execution of the hadoop program just like creating an execution plan in normal DB operation? </li></ul><ul><li>The answer leads to pig. Pig allows one to control the flow of data by creating execution plans easily. </li></ul><ul><li>Suitable when the task are repetitive and the plans can be envisaged early on. </li></ul>
  41. 42. What does hive do? <ul><li>Users of databases are not often technology masters. </li></ul><ul><li>They might be familiar to the existing platforms. And these platforms tend to generate SQL like queries. </li></ul><ul><li>We need a program to convert this traditional sql queries into mapReduce jobs. </li></ul><ul><li>And the one created by hadoop movement is Hive. </li></ul>
  42. 43. Hive architecture
  43. 44. New models : cloud Map reduce for dummies! <ul><li>Many services available on cloud like Amazon web services (Amazon elastic -http://aws.amazon.com/ec2/) </li></ul><ul><li>The user gets MR services by entering input text or site name, the required output etc without going to the technical details </li></ul><ul><li>Almost infinite scalability </li></ul><ul><li>New business models which are efficient </li></ul>
  44. 45. Concerns <ul><li>Excerpts from a slashdot comment on Jan 19, 2011 </li></ul><ul><li> “ But the very public complaints didn't stop Google from demanding a patent for MapReduce; nor did it stop the USPTO from granting Google's request (after four rejections). On Tuesday, the USPTO issued U.S. Patent No. 7,650,331 to Google for inventing Efficient Large-Scale Data Processing .” </li></ul><ul><li>Will google enforce the patent? </li></ul><ul><li>If it does it will hamper the growth of hadoop community. </li></ul>
  45. 46. Research Paper 1 MapReduce and Parallel DBMSs:Friends or Foes? Michael St onebraker, Daniel Abad i, Dav id J. eWitt, Sam Maden, Erik Paulson,Andrew Pav lo, and Alexander Rasin <ul><li>Salient points: </li></ul><ul><ul><li>The differences between MR and Parallel DB </li></ul></ul><ul><ul><ul><li>Use cases </li></ul></ul></ul><ul><ul><ul><li>Architectural </li></ul></ul></ul><ul><ul><li>Points of collaboration and learning from each other </li></ul></ul>
  46. 47. Research Paper 2 <ul><li>Web warehousing: Web technology meets data warehousing </li></ul><ul><ul><li>Xin Tan, David C. Yen ∗, Xiang Fang </li></ul></ul><ul><li>Salient points of the paper are </li></ul><ul><ul><li>Describes The Internet made it possible to apply Web technology to traditional data warehousing, which resulted in improved cost savings and productivity </li></ul></ul><ul><ul><li>The integrated data in Web warehousing create a close tie between IT departments and other business functions. </li></ul></ul><ul><ul><li>Security is also a key issue in Web-based warehouses </li></ul></ul>
  47. 48. Research Paper 3 <ul><li>Clouds, big data, and smart assets: Ten tech-enabled business trends to watch </li></ul><ul><ul><li>McKinsey Quarterly </li></ul></ul><ul><li>Salient points: </li></ul><ul><li>Four out of the top 10 were of important to the Data in Flight community </li></ul><ul><li>Trend 2: Making the network the organization </li></ul><ul><li>Trend 3: Collaboration at scale </li></ul><ul><li>Trend 4: The growing ‘Internet of Things’ </li></ul><ul><li>Trend 5: Experimentation and big data </li></ul>
  48. 49. Research Paper 4 <ul><li>What Are the Information Security Risks in Decision Support Systems and Data Warehousing? </li></ul><ul><li>Thomas Finne </li></ul><ul><li>Different aspects of security are </li></ul><ul><ul><li>Back up </li></ul></ul><ul><ul><li>Password, Biometrics </li></ul></ul><ul><ul><li>Administration </li></ul></ul><ul><ul><li>Viruses </li></ul></ul><ul><ul><li>Printing </li></ul></ul><ul><ul><li>Power disruption </li></ul></ul><ul><ul><li>Tempset,Hacking </li></ul></ul><ul><ul><li>Encryption </li></ul></ul><ul><ul><li>Copying file,Tapping over a network,Mobiles </li></ul></ul><ul><ul><li>Flood fire and theft </li></ul></ul><ul><ul><li>Testing </li></ul></ul><ul><ul><li>Software version </li></ul></ul><ul><ul><li>Deleting Data </li></ul></ul>
  49. 50. Research Paper 5 <ul><li>Parallel Collection of Live Data Using Hadoop </li></ul><ul><ul><li>Kyriacos Talattinis, Aikaterini Sidiropoulou, Konstantinos Chalkias, and George Stephanides </li></ul></ul><ul><li>Department of Applied Informatics, University of Macedonia, Thessaloniki, Greece </li></ul><ul><li>3 different use cases </li></ul><ul><ul><li>Domain Appraisal Tool (DAT </li></ul></ul><ul><ul><li>OpenBet - analyzing and presenting sport related data </li></ul></ul><ul><ul><li>Brute Force Cryptanalysis </li></ul></ul>
  50. 51. Research Paper 6 Hive – A Petabyte Scale Data Warehouse Using Hadoop Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu and Raghotham Murthy <ul><li>Salient points of the paper are </li></ul><ul><ul><li>Describes the uses and the architecture of Hive </li></ul></ul><ul><ul><li>Authors are from facebook Hive team </li></ul></ul>
  51. 52. Research Paper 7 Massive Structured Data Management Solution Ullas Nambiar, Rajeev Gupta, Himanshu Gupta and Mukesh Mohania IBM Research - India <ul><li>Salient points of the paper are </li></ul><ul><ul><li>Comparison between the performances of Hive, JAQL, Raw MR and DB systems across different kind of queries </li></ul></ul><ul><ul><li>Overview of the working of the technologies </li></ul></ul>
  52. 53. Research Paper 8 Situational Business Intelligence Alexander Löser, Fabian Hueske, and Volker Markl TU Berlin Database System and Information Management Group <ul><li>Salient points of the paper are </li></ul><ul><ul><li>Describes the need for data in flight </li></ul></ul><ul><ul><li>Describes the theoretical solutions </li></ul></ul><ul><ul><li>Discuss the current technology </li></ul></ul>
  53. 54. Research Paper 9 Beyond Search - Web Scale Business Analytics Alexander Löser http://user.cs.tu-berlin.de/~aloeser <ul><li>Importance and methods of analyzing content on the internet </li></ul><ul><li>Growth of the content </li></ul><ul><li>Beneficiaries of this information out of content </li></ul><ul><li>Methods and technology </li></ul>
  54. 55. Research Paper 9 An Intelligent Recommender System using Sequential Web Access Patterns Alexander Löser http://user.cs.tu-berlin.de/~aloeser <ul><li>Importance and methods of analyzing content on the internet </li></ul><ul><li>Growth of the content </li></ul><ul><li>Beneficiaries of this information out of content </li></ul><ul><li>Methods and technology </li></ul>
  55. 56. <ul><li>http://hadoop.apache.org/ </li></ul><ul><li>http://en.wikipedia.org </li></ul><ul><li>http://cloudera.com </li></ul><ul><li>Slashdot.org </li></ul><ul><li>http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ </li></ul><ul><li>Amazon.com </li></ul>Web sites references

×