Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Amr r enjith


Published on

A simple explanation of Map Reduce and related technologies

Published in: Education, Technology, Business
  • Be the first to comment

  • Be the first to like this

Amr r enjith

  1. 1. Study of Map Reduce and related technologies Applied management Research Project (January 2011 - April 2011) Vinod Gupta School of Management IIT Kharagpur Submitted By Renjith Peediackal Roll no. 09BM8040 MBA, Batch of 2009-11 Under the guidance of Prof. Prithwis Mukherjee Project Guide VGSOM, IIT Kharagpur
  2. 2. Certificate of ExaminationThis is to certify that we have examined the summer project report on “Study of Map Reduceand related technologies” and accept this in partial fulfillment for the MBA degree for which ithas been submitted. This approval does not endorse or accept every statement made, opinionexpressed or conclusion drawn, as recorded in this internship report. It is limited to theacceptance of the project report for the applicable purpose.Examiner (External) ____________________________________________________Additional Examiner ____________________________________________________Supervisor (Academic) __________________________________________________Date: ________________________________________________________________
  3. 3. UNDERTAKING REGARDING PLAGIARISMThis declaration is for the purpose of the Applied Management Research Project on Study ofMap Reduce and related technologies, which is in partial fulfillment for the degree of Masterof Business Administration.I hereby declare that the work in this report is authentic and genuine to the best of my knowledgeand I would be liable for appropriate punitive action if found otherwise.Renjith PeediackalRoll No. 09BM8040Vinod Gupta School of ManagementIIT Kharagpur
  4. 4. ACKNOWLEDGEMENT I am highly thankful to Professor Prithwis Mukherjee, esteemed professor at VinodGupta School of Management, IIT Kharagpur for providing me a chance to carry out my workon map reduce technology as a review paper in point for this project. I am also thankful to himfor time to time interactions and guidance that he provided me in the course of this project. AlsoI thank my classmates in the course ‘IT for BI’ for giving me support and feedback for my work. Renjith Peediackal, VGSOM,IIT Kharagpur
  5. 5. UNDERTAKING REGARDING PLAGIARISM 3ACKNOWLEDGEMENT 4EXECUTIVE SUMMARY 7INTRODUCTION 8THE CASE FOR MAP REDUCE 9The complexity of input data 9Information is sparse in the humungous data 9Appetite for personalization 9EXAMPLE OF A PROGRAMME USING INTERNET DATA 9Recommendation systems 9Some of the problems with recommendation systems 10 Popularity 11 Integration of expert opinion 12 Integration of internet data 12WHAT IS DATA IN FLIGHT? 14 Map and reduce. 14HOW DOES A MAP REDUCE PROGRAM WORK 15 Map 16 Shuffle or sort 17 Reduce 17 How this new way helpful to us in accommodating complex internet data for our recommendation system? 18 Brute power 18 Fault tolerance 19 Criticisms 21 Why it is valuable still? 22An example 23Sequential web access-based recommendation system 24PARALLEL DB VS MAPREDUCE 29 Page | 5
  7. 7. Executive Summary The data explosion happen in the internet has forced the technology wizards to redesignthe tools and techniques used to handle that data. ‘Map reduce’, ‘Hadoop’ and ‘Data in flight’has been buzz words in the last few years. Many of the information available regarding thesetechnologies are, on one hand absolutely technology oriented or on the other hand marketingdocuments of the companies which produce tools based on the technology. Whiletechnological language is difficult for a normal business user to understand it, marketingdocuments seems to be very superficial. We see the importance of an article, which explainsthe technology in a lucid and simple way. Then the importance and future prospects of thismovement can be understood by business users and adopted this mission as part of theadvanced management research project in the 4th semester. Web data available today is dynamic, sparse and unstructured mostly. And processingthe data with existing techniques like parallel relational database management systems isnearly impossible. And the programming effort need to do this in the older way is humungousand there is no answer to the question whether anybody will be able to do it practically. Thenew technologies like ‘hadoop’ create an easy platform to do parallel data processing in aneasier and cost effective way. In this research paper, the author is trying to explain ‘Hadoop’and related technology in simple way. In future the author can continue similar effort todemystify many of the tools and techniques coming from Hadoop community and make itfriendly for the business community. Page | 7
  8. 8. Introduction Web data available today is dynamic, sparse and unstructured mostly. And most of thepeople who generates or uses this content is customers or prospects for business organizations.Therefore the management community, especially marketers is eagerly looking forward toanalyze the information coming out of this internet data, so as to personalize their product andservices. Also processing the data can be vital for governments and social organizations too. The earlier systems like parallel relational databases were not efficient in processing theweb data. Data in flight refers to the technology of capturing and processing the dynamic dataas and when it is produced and consumed. Google introduced map reduce paradigm forefficiently processing the web data. Apache organization following the footsteps of map reduce,gave birth to an open source movement ‘hadoop’ to develop programs and tools to processinternet data. In this review paper, a humble attempt to explain this technology in a simple and clear mannerhas been taken. I hope the readers can appreciate the efforts. Page | 8
  9. 9. The case for Map ReduceThe complexity of input data In today content rich world the enormity and complexity of data available to decisionmakers is huge. And when we add the internet to the picture, the volume takes a big bangexplosion. And this data is unstructured and produced and consumed by random people atdifferent parts of the globe.Information is sparse in the humungous data The information is actually hidden in the sea of data produced and consumed.Appetite for personalization The perpetual problem of the marketer is to personalize their products to attract andretain customers. They look forward to information that can be derived from internet tounderstand their customers better.Example of a programme using internet dataRecommendation systems Take the example of a recommendation system. Customer Y buys product X from an e-commerce site after going through a number of products X1, X2, X3, X4. And this behavior isrepeated by 100 customers! And there are customers, who dropped from shopping after goingthrough X1 to X3. Page | 9
  10. 10. Should we have suggested those people about X before exiting from the site? But if weare recommending the wrong product, how will it affect the buying decision? Questions like this also arise: "Why am I selling X number of figurines, and not Y [ahigher number] number of figurines? What pages are the people who buy things reading? If Imrunning promotions, which Web sites are sending me traffic with people who make purchases?[ Avinash Kaushik – Analytics expert] And also What kind of content I need to develop in my siteso as to attract the right set of people. Your URL should be present in what kind of sites so thatyou get maximum number of referral? How many of them quit after seeing the homepage?What different kind of design can be possible to make them go forward? Are the users clickingon the right links in the right fashion in your websites? (Site overlay) What is the bounce rate?How to save money on PPC schemes? Also how can we get data regarding competitorsmarketing activities and how it influences the customers? Many of the people are using net to sell or at least market their products and services.How can we analyze the user behavior in such cases to benefit our organization? Example:using the information from something like face book shopping page. And most critical question to analytics professional is, can we make robust algorithms topredict such behavior. How much data we need to process? Is point of sales history enough?Some of the problems with recommendation systems Page | 10
  11. 11. PopularityRecommendation system has a basic logic of processing historical data of customer behaviourand predicting their behavior. So naturally the most “popular” product top the list ofrecommendations. Here popular means previously favored the most by the same segment ofcustomers, for a particular product category. This is dangerous to companies in many ways 1. Customer need not be satisfied perpetually by same products 2. Companies have to create niche products and up sell and cross sell it to customers to satisfy them retain them and thus to be successful in the market. Popularity based system ruins this possibilities of exploration! 3. Opportunity of selling a product is lost! 4. Lack of personalization leads to broken relations Page | 11
  12. 12. Integration of expert opinion How do we integrate the inferences given by experts in the particular business domainto the process of statistical analysis? This is a mix of art with science and nobody knows theright blend. Will this undermine the basic concept of data driven decision making?Integration of internet data To attack the current deficiencies, one approach is the usage of more data to increasethe scope and relevance of the analysis. The users of mail, social networking groups, or nichegroups for specific purposes are all consumers and potential customers. Also bloggers,information providers can play the role of opinion leaders. So if we are able to integrate theinformation scattered over the internet to our decision making models, we can betterunderstand different segments of customers, their current and future needs. But the fact is that the data available in the internet is not at all friendly for a statisticalanalysis using current DBMS technology on a normal level of cost. It is dynamic, sparse andmostly unstructured. Also the question of who can program such a system to handle web datausing current technology is difficult to answer. Page | 12
  13. 13. Growth of data: mindboggling. – Published content: 3-4 Gb/day – Professional web content: 2 Gb/day – User generated content: 5-10 Gb/day – Private text content: ~2 Tb/day (200x more) (Ref: Raghu Ramakrishnan to this data: very demanding business users. – Can we do Analytics over Web Data / User Generated Content? – Can we handle TB of text data / GB of new data getting added to internet each day? Page | 13
  14. 14. – Structured Queries, Search Queries for unstructured data? – Can we expect the tools to give answers at “Google-Speed”? That gives us a strong case for adopting the new technology of data in flight. ‘Map Reduce’ is a technology developed by Google for the similar purposes. Those technologies are explained one by one in following paragraphs.What is Data in flight?Earlier data was at ‘rest’! The normal concepts of DBMS where data is at rest and the queries hit those static dataand fetch resultsNow data is just flying in! The new concepts of ‘data in flight’ envisages the already prepared query as static,collecting dynamic data as and when it is produced and consumed. But this needs morepowerful tools to handle the enormity, complexity and ‘scarce’ nature of information containedin the data.Map and reduce. A map operation is needed to translate the scarce information available in numerousformats to some forms which can be processed easily by an analytical tool. Once theinformation is in simpler and structured form, it can be reduced to the required results. Page | 14
  15. 15. A standard example:Word count! Given a document, how many of each word are there?But in real world it can be: Given our search logs, how many people click on result 1 Given our flicker photos, how many cat photos are there by users in each geographic region Give our web crawl, what are the 10 most popular words?Word count and twitter Tweets can be used to get early warnings on epidemic like swine flue Tweets can be used to understand the ‘mood’ of people in a region and can be used for different purposes, even subliminal marketing The software created by Dr Peter Dodds and Dr Chris Danforth of the University of Vermont, collects sentences from blogs and tweets‘, zeroing in on the happiest and saddest days of the last few years. Can it prevent social crises?How does a map reduce program work Page | 15
  16. 16. Programmer has to specify two methods: Map and Reduce rest will be taken care by the platform.Map map (k, v) -> <k, v>* 1. Specify a map function that takes a key(k)/value(v) pair. a. key = document URL, value = document contents a. “document1”, “to be or not to be” 2. Output of map is (potentially many) key/value pairs. <k, v>* 3. In our case, output (word, “1”) once per word in the document a. “to”, “1” b. “be”, “1” Page | 16
  17. 17. – “or”, “1” – “to”, “1” – “not”, “1” – “be”, “1”Shuffle or sortThe items which are logically near is brought near to each other physically. – “to”, “1” – “to”, “1” – “be”, “1” – “be”, “1” – “not”, “1” – “or”, “1”Reducereduce (k, <v>*) -> <k, v>* • The reduce function combines the values for a key Page | 17
  18. 18. – “be”, “2” – “not”, “1” – “or”, “1” – “to”, “2” For different use cases functions within map and reduce differs, but the architectureand the supporting platform remains the same. For example our recommendation system canuse a map function to get any meaningful association between two product names X and Y inany online discussion to a pair of product combination and count. Later in reduce step based onthe count of combinations arising, the information can be reduced in to suggestion of X given Y.This is a very simplified explanation of the algorithms. Refer to papers on algorithms suchparallel affinity propagation to understand the complexity of such a process. This is a new wayof writing algorithms. Let us see how it helps to accommodate humungous internet data.How this new way helpful to us in accommodating complex internet data for ourrecommendation system?Brute power One view is could be that map reduce algorithms are effective because it – Uses the brute power of many machines to map the huge chunk of sparse data into small table of dense data – The complex and time consuming part of the “task” is done on the new, small and dense data in reduce part Page | 18
  19. 19. – Means, it separate huge data from the time consuming part of the algorithm, albeit a lot of disk space is utilized.Fault toleranceThere are two aspects of fault tolerance 1. Transaction should not be affected by the machine failure. One step would be making the machine less prone to failure. And ask the user to start the transaction from scratch in case of failure. Saving transaction data permanently (to save some of the steps) somewhere will reduce the speed of the transaction. Normal DB operation give emphasize to this kind of fault tolerance in its implementation. Fault tolerance: RDBMS school of thought Page | 19
  20. 20. 2. Complex query execution. Here the only point in fault tolerance is that the same query should be processed again from which ever state it can be restarted. Saving data in intermediate forms can be desirable. Map Reduce give emphasize to this kind of tolerance in its implementation. The extra time taken to save permanently is reduced by parallelization. Fault tolerance: MR school of thoughtHierarchy of Parallelism: Page | 20
  21. 21. Effectively Map reduce philosophy of fault tolerance can be summarized with followingdiagram.Cycle of brute force fault toleranceCriticisms 1. A giant step backward in the programming paradigm for large-scale data intensive applications 2. A sub-optimal implementation 3. in that it uses brute force instead of indexing 4. Not novel at all 5. it represents a specific implementation of well known techniques developed 25 years ago 6. Missing most features in current DBMS Page | 21
  22. 22. 7. Incompatible with all of the tools DBMS users have come to depend onWhy it is valuable still?The biggest reason is that, It helps the programmers to logically manage the complexity of thedataAlso intermediate permanent writing magically enables two different wonderful features weneed for our recommendation system. 1. It raises the fault tolerance level to such a level, that we can employ millions of cheap computers to get our work done.Searching entire social networks to find out any mention of our product from differentsegments of customers can be assigned to numerous computers and done fast enough. 2. It brings dynamism and load balancing. If one of the social networks having denser data and our set of machines take very long time to process, we can reassign the task to many more set of free machines to speed up the process. This is essential when we don’t know the nature of the data we are going to process.So the parallelism achieved by parallel DBMS (normally done by dividing an execution planacross a limited number of processors) stand to be nothing in front what is achieved by mapreduce. Page | 22
  23. 23. At large scales, super-fancy reliable hardware still fails, albeit less often software still needs to be fault-tolerant commodity machines without fancy hardware give better perf/$ Usage of more memory to speed up querying has its own implication on tolerance and costAn example This example invites you to the complex world of sequential web access-based recommendation system. Its working is explained in the following diagram: Page | 23
  24. 24. Sequential web access-based recommendation systemIt goes through web server logs, mines the pattern in the sequence and then creates a patterntree. And the pattern tree is continuously modified taking the data from different servers.[Zhouet al]System architecture of sequential web access system [Zhou et al]The target is to create a pattern tree.And when a particular user has to be catered with a suggestion Page | 24
  25. 25. – His access pattern tree is compared with the entire tree of patterns. – And the most suitable portions of the tree in comparison with the user’s pattern are selected and – the branches of those nodes are suggested.Some details of the algorithm • Let E be a set of unique access events, which represents web resources accessed by users, i.e. web pages, URLs, topics or categories • A web access sequence S = e1e2 ... is an ordered collection (sequence) of access eventsSuppose we have a set of web access sequences with the set of events, E = (a, b, c, d, e, f) asample database will be likeSession ID Web access sequence1 abdac2 eaebcac3 babfae4 abbacfc Page | 25
  26. 26. Access events can be classified into frequent and infrequent based on frequency crossing athreshold levelAnd a tree consisting of frequent access events can be created.Length of sequence Sequential web access pattern with support1 a:4. b:4, c:32 aa:4. ab:4. oc3. ba:4. bc:33 aac:3, aba;4, obc:3, bac:34 Abac:3Finally the tree is formed: Page | 26
  27. 27. The Map and reduce 1. So a map job can be designed to process the logs and create pattern tree. 2. The task is divided among thousands of cheap machines using map Reduce platform. 3. dynamic data and the static query model of data in flight will be very helpful to modify the main tree 4. The tree structure can be efficiently stored by altering the physical storage by sorting and partitioning. Page | 27
  28. 28. 5. Then based on the user’s access pattern we have to select a few parts of the tree. This can be designed as a reduce job which runs across the tree data.DBMS for the same case? • Map – A huge data base of access logs should be uploaded to a db. And then it should be updated at regular intervals to reflect the changes in the site usage. – Then a query has to be written to get tree kind of data structure out of this data behemoth, which changes shape continuously! – An execution plan, which is simplistic and non dynamic in nature has to be made. Ineffective – It should be divided among many parallel engines – And this requires expertise in parallel programming. • Reduce – During reduce phase the entire tree has to be searched for the existence of resembling patterns. – This also will be ineffective in an execution plan driven model as explained above. Page | 28
  29. 29. • And with the explosion of data, and the increased need of increased personalization in recommendations, map reduce becomes the most suitable pattern.Parallel DB vs MapReduce • RDBMS is good when – if the application is query-intensive, – whether semi structured or rigidly structured • MR is effective – ETL and “read once” data sets. – Complex analytics. – Semi-structured data, Non structured – Quick-and-dirty analyses. – Limited-budget operations.Summary of advantages of MR • Storage system independence • automatic parallelization • load balancing Page | 29
  30. 30. • network and disk transfer optimization • handling of machine failures • Robustness • Improvements to core library benefit all users of library! • Ease to programmers!What is ‘Hadoop’?Based on the map Reduce paradigm, apache foundation has given rise to a program fordeveloping tools and techniques on an open source platform. This program and the resultanttechnology are referred to as ‘Hadoop’.The icon of HadoopHybrid of Map Reduce and Relational Database Page | 30
  31. 31. As pointed out earlier RDBMS technology is suitable when the data to be processed is structured. Many of the intermediate forms while processing web data also can be structured. For that reasons many of the researchers have come up with a hybrid system, where the unstructured data will be taken care by map reduce platform and the structured part is handled by RDBMS. HadoopDB is one of those proposed systems.PigIf we have established a Hadoop infrastructure, can we use the same for processing repetitivetasks? How can we use it as efficiently as a relational data base handling repetitive tasks? Theanswer is a programming language called ‘Pig’. Pig helps us to write an execution plan, just likethe ones we have in relational database. Page | 31
  32. 32. • Pig allows the programmer to control the flow of data by prewritten codes. It is suitable when the tasks are repetitive and the plans can be envisaged early on.What does hive do?Users of databases are not often technology masters. They might be familiar to the existingplatforms. And these platforms tend to generate SQL like queries. We need a program toconvert this traditional SQL queries into mapReduce jobs. And the one created by Hadoopmovement is Hive. So from outside it looks like an SQL query or a button click which generatesSQL query, but the undergoing process is done the map reduce way. Page | 32
  33. 33. New models: cloud 1. Many services available on cloud like Amazon web services (Amazon elastic - 2. The user gets MR services by entering input text or site name, the required output etc without going to the technical details 3. Almost infinite scalability 4. New business models which are efficientPatent ConcernsExcerpts from a Slashdot comment on Jan 19, 2011 “But the very public complaints didnt stop Google from demanding a patent forMapReduce; nor did it stop the USPTO from granting Googles request (after four rejections).On Tuesday, the USPTO issued U.S. Patent No. 7,650,331 to Google for inventing Efficient Large-Scale Data Processing.” Page | 33
  34. 34. • Will Google enforce the patent? • If it does it will hamper the growth of Hadoop community.Scope of future work Hadoop is an open source movement. So the author can contribute to the Hadoop usercommunity by studying more about the tools and techniques and sharing them on the internet.Also implementing Hadoop in a lab and understanding many detailed aspects of map reduceenvironment also can be done in the future. This is essential to get the complete picture aboutthese new technologies. Also new use case for this technology can be developed to utilize thepower of Hadoop to solve business problems.References 1. MapReduce and Parallel DBMSs:Friends or Foes? Michael St onebraker, Daniel Abad i,Dav id J. eWitt, Sam Maden, Erik Paulson, Andrew Pavlo, and Alexander Rasin Communicat ions of the acm January 2010 2. Web warehousing: Web technology meets data warehousing Xin Tan, David C. Yen ∗, Xiang Fang 3. Clouds, big data, and smart assets: Ten tech-enabled business trends to watch McKinsey Quarterly 4. Large Scale of E-learning Resources Clustering with Parallel Affinity Propagation Page | 34
  35. 35. Wenhua Wang, Hanwang Zhang, Fei Wu, and Yueting Zhuang City University of Hong Kong, Hong Kong, August 2008.5. Experiences with Map Reduce, an Abstraction for Large-Scale Computation Jeff Dean Google, Inc.6. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid1, Kamil BajdaPawlikowski1, August 20097. Parallel Collection of Live Data Using Hadoop Kyriacos Talattinis, Aikaterini Sidiropoulou, Konstantinos Chalkias, and George Stephanides Department of Applied Informatics, University of Macedonia, Thessaloniki, Greece8. Hive – A Petabyte Scale Data Warehouse Using Hadoop Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu and Raghotham Murthy9. Massive Structured Data Management Solution Ullas Nambiar, Rajeev Gupta, Himanshu Gupta and Mukesh Mohania IBM Research – India10. Situational Business Intelligence Alexander Löser, Fabian Hueske, and Volker Markl, TU Berlin Database System and Information Management Group11. An Intelligent Recommender System using Sequential Web Access Patterns: Alexander Löser http://en.wikipedia.org14. http://cloudera.com15. Slashdot.org16. Page | 35