SlideShare a Scribd company logo
1 of 35
Download to read offline
Study of Map Reduce and related technologies

           Applied management Research Project


                (January 2011 - April 2011)


            Vinod Gupta School of Management


                      IIT Kharagpur




                      Submitted By


                    Renjith Peediackal


                    Roll no. 09BM8040


                  MBA, Batch of 2009-11




                   Under the guidance of


                 Prof. Prithwis Mukherjee


                      Project Guide


                  VGSOM, IIT Kharagpur
Certificate of Examination

This is to certify that we have examined the summer project report on “Study of Map Reduce

and related technologies” and accept this in partial fulfillment for the MBA degree for which it

has been submitted. This approval does not endorse or accept every statement made, opinion

expressed or conclusion drawn, as recorded in this internship report. It is limited to the

acceptance of the project report for the applicable purpose.


Examiner (External) ____________________________________________________


Additional Examiner ____________________________________________________


Supervisor (Academic) __________________________________________________


Date: ________________________________________________________________
UNDERTAKING REGARDING PLAGIARISM


This declaration is for the purpose of the Applied Management Research Project on Study of

Map Reduce and related technologies, which is in partial fulfillment for the degree of Master

of Business Administration.


I hereby declare that the work in this report is authentic and genuine to the best of my knowledge

and I would be liable for appropriate punitive action if found otherwise.




Renjith Peediackal


Roll No. 09BM8040


Vinod Gupta School of Management


IIT Kharagpur
ACKNOWLEDGEMENT

       I am highly thankful to Professor Prithwis Mukherjee, esteemed professor at Vinod

Gupta School of Management, IIT Kharagpur for providing me a chance to carry out my work

on map reduce technology as a review paper in point for this project. I am also thankful to him

for time to time interactions and guidance that he provided me in the course of this project. Also

I thank my classmates in the course ‘IT for BI’ for giving me support and feedback for my work.


                                                                              Renjith Peediackal,


                                                                         VGSOM,IIT Kharagpur
UNDERTAKING REGARDING PLAGIARISM                                                                          3

ACKNOWLEDGEMENT                                                                                           4

EXECUTIVE SUMMARY                                                                                         7

INTRODUCTION                                                                                              8

THE CASE FOR MAP REDUCE                                                                                   9

The complexity of input data                                                                              9

Information is sparse in the humungous data                                                               9

Appetite for personalization                                                                              9


EXAMPLE OF A PROGRAMME USING INTERNET DATA                                                                9

Recommendation systems                                                                                    9

Some of the problems with recommendation systems                                                         10
  Popularity                                                                                             11
  Integration of expert opinion                                                                          12
  Integration of internet data                                                                           12


WHAT IS DATA IN FLIGHT?                                                                                  14
  Map and reduce.                                                                                        14


HOW DOES A MAP REDUCE PROGRAM WORK                                                                       15
  Map                                                                                                    16
  Shuffle or sort                                                                                        17
  Reduce                                                                                                 17
  How this new way helpful to us in accommodating complex internet data for our recommendation system?   18
  Brute power                                                                                            18
  Fault tolerance                                                                                        19
  Criticisms                                                                                             21
  Why it is valuable still?                                                                              22

An example                                                                                               23

Sequential web access-based recommendation system                                                        24


PARALLEL DB VS MAPREDUCE                                                                                 29
                                                                                                  Page | 5
SUMMARY OF ADVANTAGES OF MR                        29

WHAT IS ‘HADOOP’?                                  30

HYBRID OF MAP REDUCE AND RELATIONAL DATABASE       30

PIG                                                31

WHAT DOES HIVE DO?                                 32

NEW MODELS: CLOUD                                  33

PATENT CONCERNS                                    33

SCOPE OF FUTURE WORK                               34

REFERENCES                                         34




                                               Page | 6
Executive Summary

       The data explosion happen in the internet has forced the technology wizards to redesign

the tools and techniques used to handle that data. ‘Map reduce’, ‘Hadoop’ and ‘Data in flight’

has been buzz words in the last few years. Many of the information available regarding these

technologies are, on one hand absolutely technology oriented or on the other hand marketing

documents of the companies which produce tools based on the technology. While

technological language is difficult for a normal business user to understand it, marketing

documents seems to be very superficial. We see the importance of an article, which explains

the technology in a lucid and simple way. Then the importance and future prospects of this

movement can be understood by business users and adopted this mission as part of the

advanced management research project in the 4th semester.


       Web data available today is dynamic, sparse and unstructured mostly. And processing

the data with existing techniques like parallel relational database management systems is

nearly impossible. And the programming effort need to do this in the older way is humungous

and there is no answer to the question whether anybody will be able to do it practically. The

new technologies like ‘hadoop’ create an easy platform to do parallel data processing in an

easier and cost effective way. In this research paper, the author is trying to explain ‘Hadoop’

and related technology in simple way. In future the author can continue similar effort to

demystify many of the tools and techniques coming from Hadoop community and make it

friendly for the business community.




                                                                                       Page | 7
Introduction

        Web data available today is dynamic, sparse and unstructured mostly. And most of the

people who generates or uses this content is customers or prospects for business organizations.

Therefore the management community, especially marketers is eagerly looking forward to

analyze the information coming out of this internet data, so as to personalize their product and

services. Also processing the data can be vital for governments and social organizations too.


        The earlier systems like parallel relational databases were not efficient in processing the

web data. Data in flight refers to the technology of capturing and processing the dynamic data

as and when it is produced and consumed. Google introduced map reduce paradigm for

efficiently processing the web data. Apache organization following the footsteps of map reduce,

gave birth to an open source movement ‘hadoop’ to develop programs and tools to process

internet data.


        In this review paper, a humble attempt to explain this technology in a simple and clear manner

has been taken. I hope the readers can appreciate the efforts.




                                                                                             Page | 8
The case for Map Reduce

The complexity of input data




       In today content rich world the enormity and complexity of data available to decision

makers is huge. And when we add the internet to the picture, the volume takes a big bang

explosion. And this data is unstructured and produced and consumed by random people at

different parts of the globe.


Information is sparse in the humungous data

       The information is actually hidden in the sea of data produced and consumed.


Appetite for personalization

       The perpetual problem of the marketer is to personalize their products to attract and

retain customers. They look forward to information that can be derived from internet to

understand their customers better.



Example of a programme using internet data

Recommendation systems

       Take the example of a recommendation system. Customer Y buys product X from an e-

commerce site after going through a number of products X1, X2, X3, X4. And this behavior is

repeated by 100 customers! And there are customers, who dropped from shopping after going

through X1 to X3.

                                                                                      Page | 9
Should we have suggested those people about X before exiting from the site? But if we

are recommending the wrong product, how will it affect the buying decision?


       Questions like this also arise: "Why am I selling X number of figurines, and not Y [a

higher number] number of figurines? What pages are the people who buy things reading? If I'm

running promotions, which Web sites are sending me traffic with people who make purchases?

[ Avinash Kaushik – Analytics expert] And also What kind of content I need to develop in my site

so as to attract the right set of people. Your URL should be present in what kind of sites so that

you get maximum number of referral? How many of them quit after seeing the homepage?

What different kind of design can be possible to make them go forward? Are the users clicking

on the right links in the right fashion in your websites? (Site overlay) What is the bounce rate?

How to save money on PPC schemes? Also how can we get data regarding competitors

marketing activities and how it influences the customers?


       Many of the people are using net to sell or at least market their products and services.

How can we analyze the user behavior in such cases to benefit our organization? Example:

using the information from something like face book shopping page.


       And most critical question to analytics professional is, can we make robust algorithms to

predict such behavior. How much data we need to process? Is point of sales history enough?


Some of the problems with recommendation systems




                                                                                         Page | 10
Popularity

Recommendation system has a basic logic of processing historical data of customer behaviour

and predicting their behavior. So naturally the most “popular” product top the list of

recommendations. Here popular means previously favored the most by the same segment of

customers, for a particular product category. This is dangerous to companies in many ways


   1. Customer need not be satisfied perpetually by same products

   2. Companies have to create niche products and up sell and cross sell it to customers to

       satisfy them retain them and thus to be successful in the market. Popularity based

       system ruins this possibilities of exploration!

   3. Opportunity of selling a product is lost!

   4. Lack of personalization leads to broken relations




                                                                                     Page | 11
Integration of expert opinion

       How do we integrate the inferences given by experts in the particular business domain

to the process of statistical analysis? This is a mix of art with science and nobody knows the

right blend. Will this undermine the basic concept of data driven decision making?


Integration of internet data

       To attack the current deficiencies, one approach is the usage of more data to increase

the scope and relevance of the analysis. The users of mail, social networking groups, or niche

groups for specific purposes are all consumers and potential customers. Also bloggers,

information providers can play the role of opinion leaders. So if we are able to integrate the

information scattered over the internet to our decision making models, we can better

understand different segments of customers, their current and future needs.


       But the fact is that the data available in the internet is not at all friendly for a statistical

analysis using current DBMS technology on a normal level of cost. It is dynamic, sparse and

mostly unstructured. Also the question of who can program such a system to handle web data

using current technology is difficult to answer.




                                                                                             Page | 12
Growth of data: mindboggling.

          – Published content: 3-4 Gb/day


          – Professional web content: 2 Gb/day


          – User generated content: 5-10 Gb/day


          – Private text content: ~2 Tb/day (200x more)


                                                                  (Ref:   Raghu   Ramakrishnan

http://www.cs.umbc.edu/~hillol/NGDM07/abstracts/slides/Ramakrishnan_ngdm07.pdf)


Questions to this data: very demanding business users.

          – Can we do Analytics over Web Data / User Generated Content?


          – Can we handle TB of text data / GB of new data getting added to internet each

              day?
                                                                                     Page | 13
– Structured Queries, Search Queries for unstructured data?


           – Can we expect the tools to give answers at “Google-Speed”?


           That gives us a strong case for adopting the new technology of data in flight. ‘Map

       Reduce’ is a technology developed by Google for the similar purposes. Those

       technologies are explained one by one in following paragraphs.



What is Data in flight?



Earlier data was at ‘rest’!


       The normal concepts of DBMS where data is at rest and the queries hit those static data

and fetch results


Now data is just flying in!


       The new concepts of ‘data in flight’ envisages the already prepared query as static,

collecting dynamic data as and when it is produced and consumed. But this needs more

powerful tools to handle the enormity, complexity and ‘scarce’ nature of information contained

in the data.


Map and reduce.

       A map operation is needed to translate the scarce information available in numerous

formats to some forms which can be processed easily by an analytical tool. Once the

information is in simpler and structured form, it can be reduced to the required results.

                                                                                        Page | 14
A standard example:


Word count!


       Given a document, how many of each word are there?


But in real world it can be:


                Given our search logs, how many people click on result 1


               Given our flicker photos, how many cat photos are there by users in each

               geographic region


               Give our web crawl, what are the 10 most popular words?


Word count and twitter


               Tweets can be used to get early warnings on epidemic like swine flue


               Tweets can be used to understand the ‘mood’ of people in a region and can be

               used for different purposes, even subliminal marketing


               The software created by Dr Peter Dodds and Dr Chris Danforth of the University

               of Vermont, collects sentences from blogs and 'tweets‘, zeroing in on the

               happiest and saddest days of the last few years.


               Can it prevent social crises?



How does a map reduce program work

                                                                                      Page | 15
Programmer has to specify two methods: Map and Reduce rest will be taken care by the

      platform.


Map

      map (k, v) -> <k', v'>*


  1. Specify a map function that takes a key(k)/value(v) pair.


         a. key = document URL, value = document contents


                                a. “document1”, “to be or not to be”


  2. Output of map is (potentially many) key/value pairs. <k', v'>*


  3. In our case, output (word, “1”) once per word in the document


         a. “to”, “1”


         b. “be”, “1”


                                                                                 Page | 16
– “or”, “1”


           – “to”, “1”


           – “not”, “1”


           – “be”, “1”


Shuffle or sort




The items which are logically near is brought near to each other physically.


           – “to”, “1”


           – “to”, “1”


           – “be”, “1”


           – “be”, “1”


           – “not”, “1”


           – “or”, “1”


Reduce




reduce (k', <v'>*) -> <k', v'>*


   •   The reduce function combines the values for a key


                                                                               Page | 17
– “be”, “2”


           – “not”, “1”


           – “or”, “1”


           – “to”, “2”


       For different use cases functions within map and reduce differs, but the architecture

and the supporting platform remains the same. For example our recommendation system can

use a map function to get any meaningful association between two product names X and Y in

any online discussion to a pair of product combination and count. Later in reduce step based on

the count of combinations arising, the information can be reduced in to suggestion of X given Y.

This is a very simplified explanation of the algorithms. Refer to papers on algorithms such

parallel affinity propagation to understand the complexity of such a process. This is a new way

of writing algorithms. Let us see how it helps to accommodate humungous internet data.


How this new way helpful to us in accommodating complex internet data for our

recommendation system?


Brute power

       One view is could be that map reduce algorithms are effective because it


           – Uses the brute power of many machines to map the huge chunk of sparse data

              into small table of dense data


           – The complex and time consuming part of the “task” is done on the new, small

              and dense data in reduce part
                                                                                       Page | 18
– Means, it separate huge data from the time consuming part of the algorithm,

              albeit a lot of disk space is utilized.


Fault tolerance

There are two aspects of fault tolerance


   1. Transaction should not be affected by the machine failure. One step would be making

           the machine less prone to failure. And ask the user to start the transaction from

           scratch in case of failure. Saving transaction data permanently (to save some of the

           steps) somewhere will reduce the speed of the transaction. Normal DB operation

           give emphasize to this kind of fault tolerance in its implementation.




           Fault tolerance: RDBMS school of thought




                                                                                      Page | 19
2. Complex query execution. Here the only point in fault tolerance is that the same query

           should be processed again from which ever state it can be restarted. Saving data in

           intermediate forms can be desirable. Map Reduce give emphasize to this kind of

           tolerance in its implementation. The extra time taken to save permanently is

           reduced by parallelization.




           Fault tolerance: MR school of thought


Hierarchy of Parallelism:




                                                                                     Page | 20
Effectively Map reduce philosophy of fault tolerance can be summarized with following

diagram.




Cycle of brute force fault tolerance


Criticisms

               1. A giant step backward in the programming paradigm for large-scale data

                   intensive applications

               2. A sub-optimal implementation

               3. in that it uses brute force instead of indexing

               4. Not novel at all

               5. it represents a specific implementation of well known techniques developed

                   25 years ago

               6. Missing most features in current DBMS

                                                                                   Page | 21
7. Incompatible with all of the tools DBMS users have come to depend on


Why it is valuable still?

The biggest reason is that, It helps the programmers to logically manage the complexity of the

data


Also intermediate permanent writing magically enables two different wonderful features we

need for our recommendation system.


               1. It raises the fault tolerance level to such a level, that we can employ millions

                   of cheap computers to get our work done.


Searching entire social networks to find out any mention of our product from different

segments of customers can be assigned to numerous computers and done fast enough.


               2. It brings dynamism and load balancing. If one of the social networks having

                   denser data and our set of machines take very long time to process, we can

                   reassign the task to many more set of free machines to speed up the process.

                   This is essential when we don’t know the nature of the data we are going to

                   process.




So the parallelism achieved by parallel DBMS (normally done by dividing an execution plan

across a limited number of processors) stand to be nothing in front what is achieved by map

reduce.



                                                                                         Page | 22
At large scales, super-fancy reliable hardware still fails, albeit less often


                   software still needs to be fault-tolerant


                   commodity machines without fancy hardware give better perf/$


                   Usage of more memory to speed up querying has its own implication on

                   tolerance and cost


An example

     This example invites you to the complex world of sequential web access-based

     recommendation system. Its working is explained in the following diagram:




                                                                                         Page | 23
Sequential web access-based recommendation system

It goes through web server logs, mines the pattern in the sequence and then creates a pattern

tree. And the pattern tree is continuously modified taking the data from different servers.[Zhou

et al]




System architecture of sequential web access system [Zhou et al]




The target is to create a pattern tree.


And when a particular user has to be catered with a suggestion

                                                                                       Page | 24
– His access pattern tree is compared with the entire tree of patterns.


             – And the most suitable portions of the tree in comparison with the user’s pattern

                are selected and


             – the branches of those nodes are suggested.


Some details of the algorithm


    •   Let E be a set of unique access events, which represents web resources accessed by

        users, i.e. web pages, URLs, topics or categories


    •   A web access sequence S = e1e2 ... is an ordered collection (sequence) of access events


Suppose we have a set of web access sequences with the set of events, E = (a, b, c, d, e, f) a

sample database will be like


Session ID                                        Web access sequence



1                                                 abdac


2                                                 eaebcac


3                                                 babfae


4                                                 abbacfc


                                                                                       Page | 25
Access events can be classified into frequent and infrequent based on frequency crossing a

threshold level


And a tree consisting of frequent access events can be created.


Length of sequence                                Sequential web access pattern with support



1                                                 a:4. b:4, c:3


2                                                 aa:4. ab:4. oc3. ba:4. bc:3


3                                                 aac:3, aba;4, obc:3, bac:3


4                                                 Abac:3




Finally the tree is formed:




                                                                                       Page | 26
The Map and reduce


   1. So a map job can be designed to process the logs and create pattern tree.


   2. The task is divided among thousands of cheap machines using map Reduce platform.


   3. dynamic data and the static query model of data in flight will be very helpful to modify

      the main tree


   4. The tree structure can be efficiently stored by altering the physical storage by sorting

      and partitioning.



                                                                                     Page | 27
5. Then based on the user’s access pattern we have to select a few parts of the tree. This

       can be designed as a reduce job which runs across the tree data.


DBMS for the same case?


   •   Map


          – A huge data base of access logs should be uploaded to a db. And then it should

              be updated at regular intervals to reflect the changes in the site usage.


          – Then a query has to be written to get tree kind of data structure out of this data

              behemoth, which changes shape continuously!


          – An execution plan, which is simplistic and non dynamic in nature has to be made.

              Ineffective


          –     It should be divided among many parallel engines


          – And this requires expertise in parallel programming.


   •   Reduce


          – During reduce phase the entire tree has to be searched for the existence of

              resembling patterns.


          – This also will be ineffective in an execution plan driven model as explained

              above.




                                                                                          Page | 28
•   And with the explosion of data, and the increased need of increased personalization in

      recommendations, map reduce becomes the most suitable pattern.



Parallel DB vs MapReduce

  •   RDBMS is good when


         – if the application is query-intensive,


         – whether semi structured or rigidly structured


  •   MR is effective


         – ETL and “read once” data sets.


         – Complex analytics.


         – Semi-structured data, Non structured


         – Quick-and-dirty analyses.


         – Limited-budget operations.



Summary of advantages of MR

  •   Storage system independence


  •   automatic parallelization


  •   load balancing



                                                                                   Page | 29
•   network and disk transfer optimization


   •   handling of machine failures


   •   Robustness


   •   Improvements to core library benefit all users of library!


   •   Ease to programmers!



What is ‘Hadoop’?

Based on the map Reduce paradigm, apache foundation has given rise to a program for

developing tools and techniques on an open source platform. This program and the resultant

technology are referred to as ‘Hadoop’.




The icon of Hadoop



Hybrid of Map Reduce and Relational Database


                                                                                 Page | 30
As pointed out earlier RDBMS technology is suitable when the data to be processed is

       structured. Many of the intermediate forms while processing web data also can be

       structured. For that reasons many of the researchers have come up with a hybrid

       system, where the unstructured data will be taken care by map reduce platform and the

       structured part is handled by RDBMS. HadoopDB is one of those proposed systems.




Pig

If we have established a Hadoop infrastructure, can we use the same for processing repetitive

tasks? How can we use it as efficiently as a relational data base handling repetitive tasks? The

answer is a programming language called ‘Pig’. Pig helps us to write an execution plan, just like

the ones we have in relational database.




                                                                                        Page | 31
•   Pig allows the programmer to control the flow of data by prewritten codes. It is suitable

       when the tasks are repetitive and the plans can be envisaged early on.



What does hive do?

Users of databases are not often technology masters. They might be familiar to the existing

platforms. And these platforms tend to generate SQL like queries. We need a program to

convert this traditional SQL queries into mapReduce jobs. And the one created by Hadoop

movement is Hive. So from outside it looks like an SQL query or a button click which generates

SQL query, but the undergoing process is done the map reduce way.




                                                                                       Page | 32
New models: cloud

   1. Many services available on cloud like Amazon web services (Amazon elastic -

       http://aws.amazon.com/ec2/)


   2. The user gets MR services by entering input text or site name, the required output etc

       without going to the technical details


   3. Almost infinite scalability


   4. New business models which are efficient



Patent Concerns




Excerpts from a Slashdot comment on Jan 19, 2011


       “But the very public complaints didn't stop Google from demanding a patent for

MapReduce; nor did it stop the USPTO from granting Google's request (after four rejections).

On Tuesday, the USPTO issued U.S. Patent No. 7,650,331 to Google for inventing Efficient Large-

Scale Data Processing.”

                                                                                      Page | 33
•   Will Google enforce the patent?


   •   If it does it will hamper the growth of Hadoop community.




Scope of future work

       Hadoop is an open source movement. So the author can contribute to the Hadoop user

community by studying more about the tools and techniques and sharing them on the internet.

Also implementing Hadoop in a lab and understanding many detailed aspects of map reduce

environment also can be done in the future. This is essential to get the complete picture about

these new technologies. Also new use case for this technology can be developed to utilize the

power of Hadoop to solve business problems.



References


   1. MapReduce and Parallel DBMSs:Friends or Foes?

       Michael St onebraker, Daniel Abad i,Dav id J. eWitt, Sam Maden, Erik Paulson, Andrew
       Pavlo, and Alexander Rasin

       Communicat ions of the acm January 2010

   2. Web warehousing: Web technology meets data warehousing

       Xin Tan, David C. Yen ∗, Xiang Fang

   3. Clouds, big data, and smart assets: Ten tech-enabled business trends to watch

       McKinsey Quarterly

   4. Large Scale of E-learning Resources Clustering with Parallel Affinity Propagation

                                                                                          Page | 34
Wenhua Wang, Hanwang Zhang, Fei Wu, and Yueting Zhuang

   City University of Hong Kong, Hong Kong, August 2008.

5. Experiences with Map Reduce, an Abstraction for Large-Scale Computation
    Jeff Dean Google, Inc.
6. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for
    Analytical Workloads
   Azza Abouzeid1, Kamil BajdaPawlikowski1, August 2009
7. Parallel Collection of Live Data Using Hadoop
     Kyriacos Talattinis, Aikaterini Sidiropoulou, Konstantinos Chalkias, and George
    Stephanides
    Department of Applied Informatics, University of Macedonia, Thessaloniki, Greece
8. Hive – A Petabyte Scale Data Warehouse Using Hadoop
    Ashish Thusoo, Joydeep Sen Sarma, Namit Jain,         Zheng Shao, Prasad Chakka, Ning
    Zhang, Suresh Antony, Hao Liu and Raghotham Murthy
9. Massive Structured Data Management Solution
    Ullas Nambiar, Rajeev Gupta, Himanshu Gupta and Mukesh Mohania
    IBM Research – India
10. Situational Business Intelligence
    Alexander Löser, Fabian Hueske, and Volker Markl,              TU Berlin
    Database System and Information Management Group
11. An Intelligent Recommender System using Sequential Web Access Patterns: Alexander
    Löser

   http://user.cs.tu-berlin.de/~aloeser

12. http://hadoop.apache.org/

13. http://en.wikipedia.org

14. http://cloudera.com

15. Slashdot.org

16. Amazon.com




                                                                                  Page | 35

More Related Content

Similar to Amr r enjith

Global eLEARNING Industry
Global eLEARNING IndustryGlobal eLEARNING Industry
Global eLEARNING Industry
ReportLinker.com
 
Metered IT - The Path to Utility Computing
Metered IT - The Path to Utility ComputingMetered IT - The Path to Utility Computing
Metered IT - The Path to Utility Computing
Valencell, Inc.
 
[HCII2011] Performance Visualization for Large Scale Computing System - A Lit...
[HCII2011] Performance Visualization for Large Scale Computing System - A Lit...[HCII2011] Performance Visualization for Large Scale Computing System - A Lit...
[HCII2011] Performance Visualization for Large Scale Computing System - A Lit...
Qin Gao
 
Seo proposal rose_valley
Seo proposal rose_valleySeo proposal rose_valley
Seo proposal rose_valley
Tom Richard
 
Consumer Engagement in the Digital Age
Consumer Engagement in the Digital AgeConsumer Engagement in the Digital Age
Consumer Engagement in the Digital Age
Gregory Birgé
 
Oracle Days Romania 2011 Keynote
Oracle Days Romania 2011 Keynote Oracle Days Romania 2011 Keynote
Oracle Days Romania 2011 Keynote
Freelance PR
 

Similar to Amr r enjith (20)

Global eLEARNING Industry
Global eLEARNING IndustryGlobal eLEARNING Industry
Global eLEARNING Industry
 
Value Proposition for IBM PureFlex System Case for IBM PureFlex System for ...
Value Proposition for  IBM PureFlex System  Case for IBM PureFlex System for ...Value Proposition for  IBM PureFlex System  Case for IBM PureFlex System for ...
Value Proposition for IBM PureFlex System Case for IBM PureFlex System for ...
 
NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE
NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURENETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE
NETWORK PROCESSORS OF THE PAST, PRESENT AND FUTURE
 
Metered IT - The Path to Utility Computing
Metered IT - The Path to Utility ComputingMetered IT - The Path to Utility Computing
Metered IT - The Path to Utility Computing
 
[HCII2011] Performance Visualization for Large Scale Computing System - A Lit...
[HCII2011] Performance Visualization for Large Scale Computing System - A Lit...[HCII2011] Performance Visualization for Large Scale Computing System - A Lit...
[HCII2011] Performance Visualization for Large Scale Computing System - A Lit...
 
Software Modernization and Legacy Migration Primer
Software Modernization and Legacy Migration PrimerSoftware Modernization and Legacy Migration Primer
Software Modernization and Legacy Migration Primer
 
Emulex and Enterprise Strategy Group Present Why I/O is Strategic for Virtual...
Emulex and Enterprise Strategy Group Present Why I/O is Strategic for Virtual...Emulex and Enterprise Strategy Group Present Why I/O is Strategic for Virtual...
Emulex and Enterprise Strategy Group Present Why I/O is Strategic for Virtual...
 
Big Data, IBM Power Event
Big Data, IBM Power EventBig Data, IBM Power Event
Big Data, IBM Power Event
 
Whitepaper Cloud Egovernance Imaginea
Whitepaper Cloud Egovernance ImagineaWhitepaper Cloud Egovernance Imaginea
Whitepaper Cloud Egovernance Imaginea
 
Building an Innovation Program
Building an Innovation ProgramBuilding an Innovation Program
Building an Innovation Program
 
Seo proposal rose_valley
Seo proposal rose_valleySeo proposal rose_valley
Seo proposal rose_valley
 
Consumer Engagement in the Digital Age
Consumer Engagement in the Digital AgeConsumer Engagement in the Digital Age
Consumer Engagement in the Digital Age
 
Oracle Days Romania 2011 Keynote
Oracle Days Romania 2011 Keynote Oracle Days Romania 2011 Keynote
Oracle Days Romania 2011 Keynote
 
Procure-to-Pay (P2P) Outsourcing: Unlocking Value from End-to-End Process Out...
Procure-to-Pay (P2P) Outsourcing: Unlocking Value from End-to-End Process Out...Procure-to-Pay (P2P) Outsourcing: Unlocking Value from End-to-End Process Out...
Procure-to-Pay (P2P) Outsourcing: Unlocking Value from End-to-End Process Out...
 
Rick Tilghman, PayPal - Arizona World Usability Day 2012 UX Keynote Presentation
Rick Tilghman, PayPal - Arizona World Usability Day 2012 UX Keynote PresentationRick Tilghman, PayPal - Arizona World Usability Day 2012 UX Keynote Presentation
Rick Tilghman, PayPal - Arizona World Usability Day 2012 UX Keynote Presentation
 
Barc - QlikTech in THE BI SURVEY 12
Barc - QlikTech in THE BI SURVEY 12Barc - QlikTech in THE BI SURVEY 12
Barc - QlikTech in THE BI SURVEY 12
 
Chevrolet & Performics OneSearch Case Study
Chevrolet & Performics OneSearch Case StudyChevrolet & Performics OneSearch Case Study
Chevrolet & Performics OneSearch Case Study
 
Intro ict
Intro ictIntro ict
Intro ict
 
Big Data LDN 2018: USING FAST-DATA TO MAKE SEMICONDUCTORS
Big Data LDN 2018: USING FAST-DATA TO MAKE SEMICONDUCTORSBig Data LDN 2018: USING FAST-DATA TO MAKE SEMICONDUCTORS
Big Data LDN 2018: USING FAST-DATA TO MAKE SEMICONDUCTORS
 
Devops for bank in indonesia
Devops for bank in indonesiaDevops for bank in indonesia
Devops for bank in indonesia
 

Recently uploaded

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Recently uploaded (20)

Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 

Amr r enjith

  • 1. Study of Map Reduce and related technologies Applied management Research Project (January 2011 - April 2011) Vinod Gupta School of Management IIT Kharagpur Submitted By Renjith Peediackal Roll no. 09BM8040 MBA, Batch of 2009-11 Under the guidance of Prof. Prithwis Mukherjee Project Guide VGSOM, IIT Kharagpur
  • 2. Certificate of Examination This is to certify that we have examined the summer project report on “Study of Map Reduce and related technologies” and accept this in partial fulfillment for the MBA degree for which it has been submitted. This approval does not endorse or accept every statement made, opinion expressed or conclusion drawn, as recorded in this internship report. It is limited to the acceptance of the project report for the applicable purpose. Examiner (External) ____________________________________________________ Additional Examiner ____________________________________________________ Supervisor (Academic) __________________________________________________ Date: ________________________________________________________________
  • 3. UNDERTAKING REGARDING PLAGIARISM This declaration is for the purpose of the Applied Management Research Project on Study of Map Reduce and related technologies, which is in partial fulfillment for the degree of Master of Business Administration. I hereby declare that the work in this report is authentic and genuine to the best of my knowledge and I would be liable for appropriate punitive action if found otherwise. Renjith Peediackal Roll No. 09BM8040 Vinod Gupta School of Management IIT Kharagpur
  • 4. ACKNOWLEDGEMENT I am highly thankful to Professor Prithwis Mukherjee, esteemed professor at Vinod Gupta School of Management, IIT Kharagpur for providing me a chance to carry out my work on map reduce technology as a review paper in point for this project. I am also thankful to him for time to time interactions and guidance that he provided me in the course of this project. Also I thank my classmates in the course ‘IT for BI’ for giving me support and feedback for my work. Renjith Peediackal, VGSOM,IIT Kharagpur
  • 5. UNDERTAKING REGARDING PLAGIARISM 3 ACKNOWLEDGEMENT 4 EXECUTIVE SUMMARY 7 INTRODUCTION 8 THE CASE FOR MAP REDUCE 9 The complexity of input data 9 Information is sparse in the humungous data 9 Appetite for personalization 9 EXAMPLE OF A PROGRAMME USING INTERNET DATA 9 Recommendation systems 9 Some of the problems with recommendation systems 10 Popularity 11 Integration of expert opinion 12 Integration of internet data 12 WHAT IS DATA IN FLIGHT? 14 Map and reduce. 14 HOW DOES A MAP REDUCE PROGRAM WORK 15 Map 16 Shuffle or sort 17 Reduce 17 How this new way helpful to us in accommodating complex internet data for our recommendation system? 18 Brute power 18 Fault tolerance 19 Criticisms 21 Why it is valuable still? 22 An example 23 Sequential web access-based recommendation system 24 PARALLEL DB VS MAPREDUCE 29 Page | 5
  • 6. SUMMARY OF ADVANTAGES OF MR 29 WHAT IS ‘HADOOP’? 30 HYBRID OF MAP REDUCE AND RELATIONAL DATABASE 30 PIG 31 WHAT DOES HIVE DO? 32 NEW MODELS: CLOUD 33 PATENT CONCERNS 33 SCOPE OF FUTURE WORK 34 REFERENCES 34 Page | 6
  • 7. Executive Summary The data explosion happen in the internet has forced the technology wizards to redesign the tools and techniques used to handle that data. ‘Map reduce’, ‘Hadoop’ and ‘Data in flight’ has been buzz words in the last few years. Many of the information available regarding these technologies are, on one hand absolutely technology oriented or on the other hand marketing documents of the companies which produce tools based on the technology. While technological language is difficult for a normal business user to understand it, marketing documents seems to be very superficial. We see the importance of an article, which explains the technology in a lucid and simple way. Then the importance and future prospects of this movement can be understood by business users and adopted this mission as part of the advanced management research project in the 4th semester. Web data available today is dynamic, sparse and unstructured mostly. And processing the data with existing techniques like parallel relational database management systems is nearly impossible. And the programming effort need to do this in the older way is humungous and there is no answer to the question whether anybody will be able to do it practically. The new technologies like ‘hadoop’ create an easy platform to do parallel data processing in an easier and cost effective way. In this research paper, the author is trying to explain ‘Hadoop’ and related technology in simple way. In future the author can continue similar effort to demystify many of the tools and techniques coming from Hadoop community and make it friendly for the business community. Page | 7
  • 8. Introduction Web data available today is dynamic, sparse and unstructured mostly. And most of the people who generates or uses this content is customers or prospects for business organizations. Therefore the management community, especially marketers is eagerly looking forward to analyze the information coming out of this internet data, so as to personalize their product and services. Also processing the data can be vital for governments and social organizations too. The earlier systems like parallel relational databases were not efficient in processing the web data. Data in flight refers to the technology of capturing and processing the dynamic data as and when it is produced and consumed. Google introduced map reduce paradigm for efficiently processing the web data. Apache organization following the footsteps of map reduce, gave birth to an open source movement ‘hadoop’ to develop programs and tools to process internet data. In this review paper, a humble attempt to explain this technology in a simple and clear manner has been taken. I hope the readers can appreciate the efforts. Page | 8
  • 9. The case for Map Reduce The complexity of input data In today content rich world the enormity and complexity of data available to decision makers is huge. And when we add the internet to the picture, the volume takes a big bang explosion. And this data is unstructured and produced and consumed by random people at different parts of the globe. Information is sparse in the humungous data The information is actually hidden in the sea of data produced and consumed. Appetite for personalization The perpetual problem of the marketer is to personalize their products to attract and retain customers. They look forward to information that can be derived from internet to understand their customers better. Example of a programme using internet data Recommendation systems Take the example of a recommendation system. Customer Y buys product X from an e- commerce site after going through a number of products X1, X2, X3, X4. And this behavior is repeated by 100 customers! And there are customers, who dropped from shopping after going through X1 to X3. Page | 9
  • 10. Should we have suggested those people about X before exiting from the site? But if we are recommending the wrong product, how will it affect the buying decision? Questions like this also arise: "Why am I selling X number of figurines, and not Y [a higher number] number of figurines? What pages are the people who buy things reading? If I'm running promotions, which Web sites are sending me traffic with people who make purchases? [ Avinash Kaushik – Analytics expert] And also What kind of content I need to develop in my site so as to attract the right set of people. Your URL should be present in what kind of sites so that you get maximum number of referral? How many of them quit after seeing the homepage? What different kind of design can be possible to make them go forward? Are the users clicking on the right links in the right fashion in your websites? (Site overlay) What is the bounce rate? How to save money on PPC schemes? Also how can we get data regarding competitors marketing activities and how it influences the customers? Many of the people are using net to sell or at least market their products and services. How can we analyze the user behavior in such cases to benefit our organization? Example: using the information from something like face book shopping page. And most critical question to analytics professional is, can we make robust algorithms to predict such behavior. How much data we need to process? Is point of sales history enough? Some of the problems with recommendation systems Page | 10
  • 11. Popularity Recommendation system has a basic logic of processing historical data of customer behaviour and predicting their behavior. So naturally the most “popular” product top the list of recommendations. Here popular means previously favored the most by the same segment of customers, for a particular product category. This is dangerous to companies in many ways 1. Customer need not be satisfied perpetually by same products 2. Companies have to create niche products and up sell and cross sell it to customers to satisfy them retain them and thus to be successful in the market. Popularity based system ruins this possibilities of exploration! 3. Opportunity of selling a product is lost! 4. Lack of personalization leads to broken relations Page | 11
  • 12. Integration of expert opinion How do we integrate the inferences given by experts in the particular business domain to the process of statistical analysis? This is a mix of art with science and nobody knows the right blend. Will this undermine the basic concept of data driven decision making? Integration of internet data To attack the current deficiencies, one approach is the usage of more data to increase the scope and relevance of the analysis. The users of mail, social networking groups, or niche groups for specific purposes are all consumers and potential customers. Also bloggers, information providers can play the role of opinion leaders. So if we are able to integrate the information scattered over the internet to our decision making models, we can better understand different segments of customers, their current and future needs. But the fact is that the data available in the internet is not at all friendly for a statistical analysis using current DBMS technology on a normal level of cost. It is dynamic, sparse and mostly unstructured. Also the question of who can program such a system to handle web data using current technology is difficult to answer. Page | 12
  • 13. Growth of data: mindboggling. – Published content: 3-4 Gb/day – Professional web content: 2 Gb/day – User generated content: 5-10 Gb/day – Private text content: ~2 Tb/day (200x more) (Ref: Raghu Ramakrishnan http://www.cs.umbc.edu/~hillol/NGDM07/abstracts/slides/Ramakrishnan_ngdm07.pdf) Questions to this data: very demanding business users. – Can we do Analytics over Web Data / User Generated Content? – Can we handle TB of text data / GB of new data getting added to internet each day? Page | 13
  • 14. – Structured Queries, Search Queries for unstructured data? – Can we expect the tools to give answers at “Google-Speed”? That gives us a strong case for adopting the new technology of data in flight. ‘Map Reduce’ is a technology developed by Google for the similar purposes. Those technologies are explained one by one in following paragraphs. What is Data in flight? Earlier data was at ‘rest’! The normal concepts of DBMS where data is at rest and the queries hit those static data and fetch results Now data is just flying in! The new concepts of ‘data in flight’ envisages the already prepared query as static, collecting dynamic data as and when it is produced and consumed. But this needs more powerful tools to handle the enormity, complexity and ‘scarce’ nature of information contained in the data. Map and reduce. A map operation is needed to translate the scarce information available in numerous formats to some forms which can be processed easily by an analytical tool. Once the information is in simpler and structured form, it can be reduced to the required results. Page | 14
  • 15. A standard example: Word count! Given a document, how many of each word are there? But in real world it can be: Given our search logs, how many people click on result 1 Given our flicker photos, how many cat photos are there by users in each geographic region Give our web crawl, what are the 10 most popular words? Word count and twitter Tweets can be used to get early warnings on epidemic like swine flue Tweets can be used to understand the ‘mood’ of people in a region and can be used for different purposes, even subliminal marketing The software created by Dr Peter Dodds and Dr Chris Danforth of the University of Vermont, collects sentences from blogs and 'tweets‘, zeroing in on the happiest and saddest days of the last few years. Can it prevent social crises? How does a map reduce program work Page | 15
  • 16. Programmer has to specify two methods: Map and Reduce rest will be taken care by the platform. Map map (k, v) -> <k', v'>* 1. Specify a map function that takes a key(k)/value(v) pair. a. key = document URL, value = document contents a. “document1”, “to be or not to be” 2. Output of map is (potentially many) key/value pairs. <k', v'>* 3. In our case, output (word, “1”) once per word in the document a. “to”, “1” b. “be”, “1” Page | 16
  • 17. – “or”, “1” – “to”, “1” – “not”, “1” – “be”, “1” Shuffle or sort The items which are logically near is brought near to each other physically. – “to”, “1” – “to”, “1” – “be”, “1” – “be”, “1” – “not”, “1” – “or”, “1” Reduce reduce (k', <v'>*) -> <k', v'>* • The reduce function combines the values for a key Page | 17
  • 18. – “be”, “2” – “not”, “1” – “or”, “1” – “to”, “2” For different use cases functions within map and reduce differs, but the architecture and the supporting platform remains the same. For example our recommendation system can use a map function to get any meaningful association between two product names X and Y in any online discussion to a pair of product combination and count. Later in reduce step based on the count of combinations arising, the information can be reduced in to suggestion of X given Y. This is a very simplified explanation of the algorithms. Refer to papers on algorithms such parallel affinity propagation to understand the complexity of such a process. This is a new way of writing algorithms. Let us see how it helps to accommodate humungous internet data. How this new way helpful to us in accommodating complex internet data for our recommendation system? Brute power One view is could be that map reduce algorithms are effective because it – Uses the brute power of many machines to map the huge chunk of sparse data into small table of dense data – The complex and time consuming part of the “task” is done on the new, small and dense data in reduce part Page | 18
  • 19. – Means, it separate huge data from the time consuming part of the algorithm, albeit a lot of disk space is utilized. Fault tolerance There are two aspects of fault tolerance 1. Transaction should not be affected by the machine failure. One step would be making the machine less prone to failure. And ask the user to start the transaction from scratch in case of failure. Saving transaction data permanently (to save some of the steps) somewhere will reduce the speed of the transaction. Normal DB operation give emphasize to this kind of fault tolerance in its implementation. Fault tolerance: RDBMS school of thought Page | 19
  • 20. 2. Complex query execution. Here the only point in fault tolerance is that the same query should be processed again from which ever state it can be restarted. Saving data in intermediate forms can be desirable. Map Reduce give emphasize to this kind of tolerance in its implementation. The extra time taken to save permanently is reduced by parallelization. Fault tolerance: MR school of thought Hierarchy of Parallelism: Page | 20
  • 21. Effectively Map reduce philosophy of fault tolerance can be summarized with following diagram. Cycle of brute force fault tolerance Criticisms 1. A giant step backward in the programming paradigm for large-scale data intensive applications 2. A sub-optimal implementation 3. in that it uses brute force instead of indexing 4. Not novel at all 5. it represents a specific implementation of well known techniques developed 25 years ago 6. Missing most features in current DBMS Page | 21
  • 22. 7. Incompatible with all of the tools DBMS users have come to depend on Why it is valuable still? The biggest reason is that, It helps the programmers to logically manage the complexity of the data Also intermediate permanent writing magically enables two different wonderful features we need for our recommendation system. 1. It raises the fault tolerance level to such a level, that we can employ millions of cheap computers to get our work done. Searching entire social networks to find out any mention of our product from different segments of customers can be assigned to numerous computers and done fast enough. 2. It brings dynamism and load balancing. If one of the social networks having denser data and our set of machines take very long time to process, we can reassign the task to many more set of free machines to speed up the process. This is essential when we don’t know the nature of the data we are going to process. So the parallelism achieved by parallel DBMS (normally done by dividing an execution plan across a limited number of processors) stand to be nothing in front what is achieved by map reduce. Page | 22
  • 23. At large scales, super-fancy reliable hardware still fails, albeit less often software still needs to be fault-tolerant commodity machines without fancy hardware give better perf/$ Usage of more memory to speed up querying has its own implication on tolerance and cost An example This example invites you to the complex world of sequential web access-based recommendation system. Its working is explained in the following diagram: Page | 23
  • 24. Sequential web access-based recommendation system It goes through web server logs, mines the pattern in the sequence and then creates a pattern tree. And the pattern tree is continuously modified taking the data from different servers.[Zhou et al] System architecture of sequential web access system [Zhou et al] The target is to create a pattern tree. And when a particular user has to be catered with a suggestion Page | 24
  • 25. – His access pattern tree is compared with the entire tree of patterns. – And the most suitable portions of the tree in comparison with the user’s pattern are selected and – the branches of those nodes are suggested. Some details of the algorithm • Let E be a set of unique access events, which represents web resources accessed by users, i.e. web pages, URLs, topics or categories • A web access sequence S = e1e2 ... is an ordered collection (sequence) of access events Suppose we have a set of web access sequences with the set of events, E = (a, b, c, d, e, f) a sample database will be like Session ID Web access sequence 1 abdac 2 eaebcac 3 babfae 4 abbacfc Page | 25
  • 26. Access events can be classified into frequent and infrequent based on frequency crossing a threshold level And a tree consisting of frequent access events can be created. Length of sequence Sequential web access pattern with support 1 a:4. b:4, c:3 2 aa:4. ab:4. oc3. ba:4. bc:3 3 aac:3, aba;4, obc:3, bac:3 4 Abac:3 Finally the tree is formed: Page | 26
  • 27. The Map and reduce 1. So a map job can be designed to process the logs and create pattern tree. 2. The task is divided among thousands of cheap machines using map Reduce platform. 3. dynamic data and the static query model of data in flight will be very helpful to modify the main tree 4. The tree structure can be efficiently stored by altering the physical storage by sorting and partitioning. Page | 27
  • 28. 5. Then based on the user’s access pattern we have to select a few parts of the tree. This can be designed as a reduce job which runs across the tree data. DBMS for the same case? • Map – A huge data base of access logs should be uploaded to a db. And then it should be updated at regular intervals to reflect the changes in the site usage. – Then a query has to be written to get tree kind of data structure out of this data behemoth, which changes shape continuously! – An execution plan, which is simplistic and non dynamic in nature has to be made. Ineffective – It should be divided among many parallel engines – And this requires expertise in parallel programming. • Reduce – During reduce phase the entire tree has to be searched for the existence of resembling patterns. – This also will be ineffective in an execution plan driven model as explained above. Page | 28
  • 29. And with the explosion of data, and the increased need of increased personalization in recommendations, map reduce becomes the most suitable pattern. Parallel DB vs MapReduce • RDBMS is good when – if the application is query-intensive, – whether semi structured or rigidly structured • MR is effective – ETL and “read once” data sets. – Complex analytics. – Semi-structured data, Non structured – Quick-and-dirty analyses. – Limited-budget operations. Summary of advantages of MR • Storage system independence • automatic parallelization • load balancing Page | 29
  • 30. network and disk transfer optimization • handling of machine failures • Robustness • Improvements to core library benefit all users of library! • Ease to programmers! What is ‘Hadoop’? Based on the map Reduce paradigm, apache foundation has given rise to a program for developing tools and techniques on an open source platform. This program and the resultant technology are referred to as ‘Hadoop’. The icon of Hadoop Hybrid of Map Reduce and Relational Database Page | 30
  • 31. As pointed out earlier RDBMS technology is suitable when the data to be processed is structured. Many of the intermediate forms while processing web data also can be structured. For that reasons many of the researchers have come up with a hybrid system, where the unstructured data will be taken care by map reduce platform and the structured part is handled by RDBMS. HadoopDB is one of those proposed systems. Pig If we have established a Hadoop infrastructure, can we use the same for processing repetitive tasks? How can we use it as efficiently as a relational data base handling repetitive tasks? The answer is a programming language called ‘Pig’. Pig helps us to write an execution plan, just like the ones we have in relational database. Page | 31
  • 32. Pig allows the programmer to control the flow of data by prewritten codes. It is suitable when the tasks are repetitive and the plans can be envisaged early on. What does hive do? Users of databases are not often technology masters. They might be familiar to the existing platforms. And these platforms tend to generate SQL like queries. We need a program to convert this traditional SQL queries into mapReduce jobs. And the one created by Hadoop movement is Hive. So from outside it looks like an SQL query or a button click which generates SQL query, but the undergoing process is done the map reduce way. Page | 32
  • 33. New models: cloud 1. Many services available on cloud like Amazon web services (Amazon elastic - http://aws.amazon.com/ec2/) 2. The user gets MR services by entering input text or site name, the required output etc without going to the technical details 3. Almost infinite scalability 4. New business models which are efficient Patent Concerns Excerpts from a Slashdot comment on Jan 19, 2011 “But the very public complaints didn't stop Google from demanding a patent for MapReduce; nor did it stop the USPTO from granting Google's request (after four rejections). On Tuesday, the USPTO issued U.S. Patent No. 7,650,331 to Google for inventing Efficient Large- Scale Data Processing.” Page | 33
  • 34. Will Google enforce the patent? • If it does it will hamper the growth of Hadoop community. Scope of future work Hadoop is an open source movement. So the author can contribute to the Hadoop user community by studying more about the tools and techniques and sharing them on the internet. Also implementing Hadoop in a lab and understanding many detailed aspects of map reduce environment also can be done in the future. This is essential to get the complete picture about these new technologies. Also new use case for this technology can be developed to utilize the power of Hadoop to solve business problems. References 1. MapReduce and Parallel DBMSs:Friends or Foes? Michael St onebraker, Daniel Abad i,Dav id J. eWitt, Sam Maden, Erik Paulson, Andrew Pavlo, and Alexander Rasin Communicat ions of the acm January 2010 2. Web warehousing: Web technology meets data warehousing Xin Tan, David C. Yen ∗, Xiang Fang 3. Clouds, big data, and smart assets: Ten tech-enabled business trends to watch McKinsey Quarterly 4. Large Scale of E-learning Resources Clustering with Parallel Affinity Propagation Page | 34
  • 35. Wenhua Wang, Hanwang Zhang, Fei Wu, and Yueting Zhuang City University of Hong Kong, Hong Kong, August 2008. 5. Experiences with Map Reduce, an Abstraction for Large-Scale Computation Jeff Dean Google, Inc. 6. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid1, Kamil BajdaPawlikowski1, August 2009 7. Parallel Collection of Live Data Using Hadoop Kyriacos Talattinis, Aikaterini Sidiropoulou, Konstantinos Chalkias, and George Stephanides Department of Applied Informatics, University of Macedonia, Thessaloniki, Greece 8. Hive – A Petabyte Scale Data Warehouse Using Hadoop Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu and Raghotham Murthy 9. Massive Structured Data Management Solution Ullas Nambiar, Rajeev Gupta, Himanshu Gupta and Mukesh Mohania IBM Research – India 10. Situational Business Intelligence Alexander Löser, Fabian Hueske, and Volker Markl, TU Berlin Database System and Information Management Group 11. An Intelligent Recommender System using Sequential Web Access Patterns: Alexander Löser http://user.cs.tu-berlin.de/~aloeser 12. http://hadoop.apache.org/ 13. http://en.wikipedia.org 14. http://cloudera.com 15. Slashdot.org 16. Amazon.com Page | 35