An Insight into Map Reduce and related technology Renjith Peediackal 09BM8040
This project work has been undertaken in the 4 th  Semester. Currently the information available is either technology oriented or marketing oriented Task was to understand the emerging technology, create a consulting document and also conduct a class room session in ‘IT for BI’ elective. Goals
The case for Map Reduce
Recommendation System Customer Y buys product X5 from an e-commerce site after going through a number of products X1, X2, X3, X4 Student Y goes through site A1,A2,A3 and finally settles down and read the content from A5 1000 of people behaves in the same way. Can we make more traffic in our site or design a new site based on the insight derived from above pattern?
A lot more questions Based on ET interview of Avinash Kashik, Analytics expert: What pages are my customer’s reading
A lot more questions contd.. What kind of content I need to develop in my site so as to attract the right set of people? Your URL should be present in what kind of sites so that you get maximum number of referral?  How many of them quit after seeing the homepage?  What different kind of design can be possible to make them go forward? Are the users clicking on the right links in the right fashion in your websites?(Site overlay) What is the bounce rate? How to save money on PPC schemes?
And the typical problems with recommendation systems
Problems with popularity Customer need not be satisfied perpetually by same products Popularity based system ruins this possibilities of exploration! Companies have to create niche products and up sell and cross sell it to customers  to satisfy them retain them  and thus to be successful in the market. Opportunity of selling a product is lost!  Lack of personalization leads to broken relations Think Beyond POS data!!
Mixing expert opinion To avoid popularity and to have more meaningful recommendation mix expert opinion Mix of art with science nobody knows the right blend Think beyond POS data and experts wisdom
Pearls of wisdom in the net
But internet data is unfriendly To statistical techniques and DBMS technology Dynamic Sparse Unstructured Growth of data Published content: 3-4 Gb/day Professional web content: 2 Gb/day User generated content: 5-10 Gb/day Private text content: ~2 Tb/day (200x more) (Ref: Raghu Ramakrishnan http://www.cs.umbc.edu/~hillol/NGDM07/abstracts/slides/Ramakrishnan_ngdm07.pdf) Questions to this data Can we do Analytics over Web Data / User Generated Content? TB of text data / GB of new data each day? Structured Queries, Search Queries? At “Google-Speed”?
The case for a new technique That gives us a strong case for adopting the new technology of data in flight.  ‘ Map Reduce’ is a technology developed by Google for the similar purposes.
What is Data in flight? Earlier data was at ‘rest’! The normal concepts of DBMS where data is at rest and the queries hit those static data and fetch results Now data is just flying in! the new concepts of ‘data in flight’ envisages the already prepared query as static, collecting dynamic data as and when it is produced and consumed. Systems to handle
Map and reduce A map operation is needed to translate the scarce information available in numerous formats to some forms which can be processed easily by an  analytical technique .  Once the information is in simpler and structured form, it can be reduced to the required results.
Terminology explained.. A standard example: Word count! Given a document, how many of each word are there? But in real world it can be: Given our search logs, how many people click on result 1 Given our flicker photos, how many cat photos are there by users in each geographic region Give our web crawl, what are the 10 most popular words?
Word count and twitter Tweets can be used to get early warnings on epidemic like swine flue Tweets can be used to understand the ‘mood’ of people in a region and can be used for different purposes, even subliminal marketing The software created by Dr Peter Dodds and Dr Chris Danforth of the  University of Vermont , collects sentences from blogs and 'tweets‘, zeroing in on the happiest and saddest days of the last few years. Can it prevent social crises?
How does a map reduce programme work  Programmer has to specify two methods: Map and Reduce
map (k, v) -> <k', v'>* Specify a map function that takes a key(k)/value(v) pair. key = document URL, value = document contents “ document1”, “to be or not to be” Output of map is (potentially many) key/value pairs. <k', v'>* In our case, output (word, “1”) once per word in the document “ to”, “1” “ be”, “1” “ or”, “1”  “ to”, “1” “ not”, “1” “ be”, “1”
Shuffle or sort (shuffle/sort) “ to”, “1” “ to”, “1” “ be”, “1” “ be”, “1” “ not”, “1” “ or”, “1” 
–  reduce (k', <v'>*) -> <k', v'>* The reduce function combines the values for a key “ be”, “2” “ not”, “1” “ or”, “1” “ to”, “2” For different use cases functions within map and reduce differs, but the architecture and the supporting platform remains the same
How this new way helpful for our recommendation system? Brute power Uses the  brute power  of many machines to map the huge chunk of sparse data into small table of dense data The  complex and time consuming  part of the “ task ” is done on the new, small and dense data in reduce part Means, it separate huge data from the time consuming part of the algorithm, albeit a lot of disk space is utilized.
Maps into a denser smaller table
Fault tolerance two different types- Database school of thought
Fault tolerance two different types- MR school of thought
Hierarchy of Parallelism:  Cycle of brute force fault tolerance
Criticisms A giant step backward in the programming paradigm for large-scale data intensive applications A sub-optimal implementation in that it uses brute force instead of indexing Not novel at all it represents a specific implementation of well known techniques developed 25 years ago Missing most features in current DBMS Incompatible with all of the tools DBMS users have come to depend on
Why it is valuable still? Permanent writing magically enables two different wonderful features It raises the fault tolerance level to such a level, that we can employ millions of cheap computers to get our work done. It brings dynamism and load balancing. Needed since we don’t know about the nature of the data And the biggest,  It helps the  programmers  to logically manage the complexity of the data
Why can’t parallel DB deliver the same? At large scales, super-fancy reliable hardware still fails, albeit less often. The brute force fault tolerance is more practical. software still needs to be fault-tolerant commodity machines without fancy hardware gives better perf/$  Usage of more memory to speed up querying has its own implication on tolerance and cost Following an execution plan based system does not work with dynamic, sparse and unstructured data
An example: Invite you to the complexity-sequential web access-based recommendation system
sequential web access-based recommendation system It goes through web server logs, mines the pattern in the sequence and then creates a pattern tree. And the pattern tree is continuously modified taking the data from different servers.[Zhou et al]
Recommendation And when a particular user has to be catered with a suggestion  his access pattern tree is compared with the entire tree of patterns.  And the most suitable portions of the tree in comparison with the user’s pattern are selected and  its branches are suggested.
Some details Let E be a set  of  unique access events, which represents web resources accessed by users, i.e. web pages, URLs, topics or categories A web access sequence S = e1e2 ... is an ordered collection (sequence) of access events Suppose we have a set  of  web access sequences with the set of events, E =  (a, b, c,  d , e, f)  a  sample database  will be like Session ID  Web access sequence  1  abdac  2  eaebcac  3  babfae  4  abbacfc
Details Access events can be classified into frequent and infrequent based on frequency crossing a threshold level And a tree consisting of frequent access events can be created. Length of sequence  Sequential web access pattern with support  1  a:4. b:4, c:3   2  aa:4. ab:4. oc3. ba:4. bc:3   3  aac:3, aba;4, obc:3, bac:3   4  Abac:3
 
The Map and reduce So a map job can be designed to process the logs and create pattern tree.  The task is divided among thousands of cheap machines using map Reduce platform. dynamic data and the static query model of data in flight will be very helpful to modify the main tree The tree structure can be efficiently stored by altering the physical storage by sorting and partitioning. Then based on the user’s access pattern we have to select a few parts of the tree. This can be designed as a reduce job which runs across the tree data.
DBMS for the same case? Map A huge data base of access logs should be uploaded to a db. And then it should be updated at regular intervals to reflect the changes in the site usage.  Then a query has to be written to get tree kind of data structure out of this data behemoth, which changes shape continuously!  An execution plan, which is simplistic and non dynamic in nature has to be made. Ineffective It should be divided among many parallel engines And this requires expertise in parallel programming. Reduce During reduce phase the entire tree has to be searched for the existence of resembling patterns.  This also will be ineffective in an execution plan driven model as explained above.  And with the explosion of data, and the increased need of increased personalization in recommendations, map reduce becomes the most suitable pattern.
Parallel DB vs MapReduce RDBMS is good when if the application is query-intensive, whether semi structured or rigidly structured MR is effective ETL and “read once” data sets. Complex analytics. Semi-structured data, Non structured Quick-and-dirty analyses. Limited-budget operations.
Summary of advantages of MR Storage system independence automatic parallelization load balancing network and disk transfer optimization handling of machine failures Robustness Improvements to core library benefit all users of library!  Ease to programmers!
Is mapReduce the final word?
What is hadoop Based on the map Reduce paradigm, apache foundation has given rise to a program for developing tools and techniques on an open source platform.  This program and the resultant technology is termed as hadoop
Pig Can we use MR for repetitive jobs effectively? How can one control the execution of the hadoop program just like creating an execution plan in normal DB operation?  The answer leads to pig. Pig allows one to control the flow of data by creating execution plans easily. Suitable when the task are repetitive and the plans can be envisaged early on.
What does hive do? Users of databases are not often technology masters.  They might be familiar to the existing platforms. And these platforms tend to generate SQL like queries.  We need a program to convert this traditional sql queries into mapReduce jobs.  And the one created by hadoop movement is Hive.
Hive architecture
New models : cloud Map reduce for dummies! Many services available on cloud like Amazon web services (Amazon elastic -http://aws.amazon.com/ec2/) The user gets MR services by entering input text or site name, the required output etc without going to the technical details Almost infinite scalability New business models which are efficient
Concerns Excerpts from a slashdot comment on Jan 19, 2011  “ But the very public complaints didn't stop Google from demanding a patent for MapReduce; nor did it stop the USPTO from granting Google's request (after four rejections). On Tuesday, the USPTO issued U.S. Patent No. 7,650,331 to Google for inventing  Efficient Large-Scale Data Processing .” Will google enforce the patent? If it does it will hamper the growth of hadoop community.
Research Paper 1 MapReduce and Parallel DBMSs:Friends or Foes? Michael St onebraker, Daniel Abad i, Dav id J. eWitt,  Sam Maden, Erik Paulson,Andrew Pav lo, and  Alexander Rasin Salient points: The differences between MR and Parallel DB Use cases Architectural Points of collaboration and learning from each other
Research Paper 2 Web warehousing: Web technology meets data warehousing  Xin Tan, David C. Yen ∗, Xiang Fang Salient points of the paper are Describes The Internet made it possible to apply Web technology to traditional data warehousing, which resulted in improved cost savings and productivity The  integrated data in Web warehousing create a close tie between IT departments and other business  functions.  Security is also a key issue in Web-based warehouses
Research Paper 3 Clouds, big data, and smart assets: Ten tech-enabled business trends to watch McKinsey Quarterly Salient points: Four out of the top 10 were of important to the Data in Flight community Trend 2: Making the network the organization  Trend 3: Collaboration at scale  Trend 4: The growing ‘Internet of Things’  Trend 5: Experimentation and big data
Research Paper 4 What Are the Information Security Risks in Decision Support Systems and Data Warehousing?  Thomas Finne Different aspects of security are Back up Password, Biometrics Administration Viruses Printing Power disruption Tempset,Hacking Encryption Copying file,Tapping over a network,Mobiles Flood fire and theft Testing Software version Deleting Data
Research Paper 5 Parallel Collection of Live Data Using Hadoop Kyriacos Talattinis, Aikaterini Sidiropoulou, Konstantinos Chalkias, and George Stephanides Department of Applied Informatics, University of  Macedonia, Thessaloniki, Greece 3 different use cases Domain Appraisal Tool (DAT OpenBet -  analyzing and presenting sport related data Brute Force Cryptanalysis
Research Paper 6 Hive – A Petabyte Scale Data Warehouse Using Hadoop Ashish Thusoo, Joydeep Sen Sarma, Namit Jain,  Zheng Shao, Prasad Chakka, Ning Zhang, Suresh  Antony, Hao Liu  and Raghotham Murthy Salient points of the paper are Describes the uses and the architecture of Hive Authors are from facebook Hive team
Research Paper 7 Massive Structured Data Management Solution Ullas Nambiar, Rajeev Gupta, Himanshu Gupta and  Mukesh Mohania IBM Research - India Salient points of the paper are Comparison between the performances of Hive, JAQL, Raw MR and DB systems across different kind of queries Overview of the working of the technologies
Research Paper 8 Situational Business Intelligence  Alexander Löser, Fabian Hueske, and Volker Markl   TU Berlin Database System and Information  Management Group Salient points of the paper are Describes the need for data in flight Describes the theoretical solutions Discuss the current technology
Research Paper 9 Beyond Search - Web Scale Business Analytics  Alexander Löser http://user.cs.tu-berlin.de/~aloeser Importance and methods of analyzing content on the internet Growth of the content Beneficiaries of this information out of content Methods and technology
Research Paper 9 An Intelligent Recommender System using Sequential Web Access Patterns   Alexander Löser http://user.cs.tu-berlin.de/~aloeser Importance and methods of analyzing content on the internet Growth of the content Beneficiaries of this information out of content Methods and technology
http://hadoop.apache.org/ http://en.wikipedia.org http://cloudera.com Slashdot.org http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ Amazon.com Web sites references
 

Mr bi amrp

  • 1.
    An Insight intoMap Reduce and related technology Renjith Peediackal 09BM8040
  • 2.
    This project workhas been undertaken in the 4 th Semester. Currently the information available is either technology oriented or marketing oriented Task was to understand the emerging technology, create a consulting document and also conduct a class room session in ‘IT for BI’ elective. Goals
  • 3.
    The case forMap Reduce
  • 4.
    Recommendation System CustomerY buys product X5 from an e-commerce site after going through a number of products X1, X2, X3, X4 Student Y goes through site A1,A2,A3 and finally settles down and read the content from A5 1000 of people behaves in the same way. Can we make more traffic in our site or design a new site based on the insight derived from above pattern?
  • 5.
    A lot morequestions Based on ET interview of Avinash Kashik, Analytics expert: What pages are my customer’s reading
  • 6.
    A lot morequestions contd.. What kind of content I need to develop in my site so as to attract the right set of people? Your URL should be present in what kind of sites so that you get maximum number of referral? How many of them quit after seeing the homepage? What different kind of design can be possible to make them go forward? Are the users clicking on the right links in the right fashion in your websites?(Site overlay) What is the bounce rate? How to save money on PPC schemes?
  • 7.
    And the typicalproblems with recommendation systems
  • 8.
    Problems with popularityCustomer need not be satisfied perpetually by same products Popularity based system ruins this possibilities of exploration! Companies have to create niche products and up sell and cross sell it to customers to satisfy them retain them and thus to be successful in the market. Opportunity of selling a product is lost! Lack of personalization leads to broken relations Think Beyond POS data!!
  • 9.
    Mixing expert opinionTo avoid popularity and to have more meaningful recommendation mix expert opinion Mix of art with science nobody knows the right blend Think beyond POS data and experts wisdom
  • 10.
    Pearls of wisdomin the net
  • 11.
    But internet datais unfriendly To statistical techniques and DBMS technology Dynamic Sparse Unstructured Growth of data Published content: 3-4 Gb/day Professional web content: 2 Gb/day User generated content: 5-10 Gb/day Private text content: ~2 Tb/day (200x more) (Ref: Raghu Ramakrishnan http://www.cs.umbc.edu/~hillol/NGDM07/abstracts/slides/Ramakrishnan_ngdm07.pdf) Questions to this data Can we do Analytics over Web Data / User Generated Content? TB of text data / GB of new data each day? Structured Queries, Search Queries? At “Google-Speed”?
  • 12.
    The case fora new technique That gives us a strong case for adopting the new technology of data in flight. ‘ Map Reduce’ is a technology developed by Google for the similar purposes.
  • 13.
    What is Datain flight? Earlier data was at ‘rest’! The normal concepts of DBMS where data is at rest and the queries hit those static data and fetch results Now data is just flying in! the new concepts of ‘data in flight’ envisages the already prepared query as static, collecting dynamic data as and when it is produced and consumed. Systems to handle
  • 14.
    Map and reduceA map operation is needed to translate the scarce information available in numerous formats to some forms which can be processed easily by an analytical technique . Once the information is in simpler and structured form, it can be reduced to the required results.
  • 15.
    Terminology explained.. Astandard example: Word count! Given a document, how many of each word are there? But in real world it can be: Given our search logs, how many people click on result 1 Given our flicker photos, how many cat photos are there by users in each geographic region Give our web crawl, what are the 10 most popular words?
  • 16.
    Word count andtwitter Tweets can be used to get early warnings on epidemic like swine flue Tweets can be used to understand the ‘mood’ of people in a region and can be used for different purposes, even subliminal marketing The software created by Dr Peter Dodds and Dr Chris Danforth of the University of Vermont , collects sentences from blogs and 'tweets‘, zeroing in on the happiest and saddest days of the last few years. Can it prevent social crises?
  • 17.
    How does amap reduce programme work Programmer has to specify two methods: Map and Reduce
  • 18.
    map (k, v)-> <k', v'>* Specify a map function that takes a key(k)/value(v) pair. key = document URL, value = document contents “ document1”, “to be or not to be” Output of map is (potentially many) key/value pairs. <k', v'>* In our case, output (word, “1”) once per word in the document “ to”, “1” “ be”, “1” “ or”, “1” “ to”, “1” “ not”, “1” “ be”, “1”
  • 19.
    Shuffle or sort(shuffle/sort) “ to”, “1” “ to”, “1” “ be”, “1” “ be”, “1” “ not”, “1” “ or”, “1” 
  • 20.
    – reduce(k', <v'>*) -> <k', v'>* The reduce function combines the values for a key “ be”, “2” “ not”, “1” “ or”, “1” “ to”, “2” For different use cases functions within map and reduce differs, but the architecture and the supporting platform remains the same
  • 21.
    How this newway helpful for our recommendation system? Brute power Uses the brute power of many machines to map the huge chunk of sparse data into small table of dense data The complex and time consuming part of the “ task ” is done on the new, small and dense data in reduce part Means, it separate huge data from the time consuming part of the algorithm, albeit a lot of disk space is utilized.
  • 22.
    Maps into adenser smaller table
  • 23.
    Fault tolerance twodifferent types- Database school of thought
  • 24.
    Fault tolerance twodifferent types- MR school of thought
  • 25.
    Hierarchy of Parallelism: Cycle of brute force fault tolerance
  • 26.
    Criticisms A giantstep backward in the programming paradigm for large-scale data intensive applications A sub-optimal implementation in that it uses brute force instead of indexing Not novel at all it represents a specific implementation of well known techniques developed 25 years ago Missing most features in current DBMS Incompatible with all of the tools DBMS users have come to depend on
  • 27.
    Why it isvaluable still? Permanent writing magically enables two different wonderful features It raises the fault tolerance level to such a level, that we can employ millions of cheap computers to get our work done. It brings dynamism and load balancing. Needed since we don’t know about the nature of the data And the biggest, It helps the programmers to logically manage the complexity of the data
  • 28.
    Why can’t parallelDB deliver the same? At large scales, super-fancy reliable hardware still fails, albeit less often. The brute force fault tolerance is more practical. software still needs to be fault-tolerant commodity machines without fancy hardware gives better perf/$ Usage of more memory to speed up querying has its own implication on tolerance and cost Following an execution plan based system does not work with dynamic, sparse and unstructured data
  • 29.
    An example: Inviteyou to the complexity-sequential web access-based recommendation system
  • 30.
    sequential web access-basedrecommendation system It goes through web server logs, mines the pattern in the sequence and then creates a pattern tree. And the pattern tree is continuously modified taking the data from different servers.[Zhou et al]
  • 31.
    Recommendation And whena particular user has to be catered with a suggestion his access pattern tree is compared with the entire tree of patterns. And the most suitable portions of the tree in comparison with the user’s pattern are selected and its branches are suggested.
  • 32.
    Some details LetE be a set of unique access events, which represents web resources accessed by users, i.e. web pages, URLs, topics or categories A web access sequence S = e1e2 ... is an ordered collection (sequence) of access events Suppose we have a set of web access sequences with the set of events, E = (a, b, c, d , e, f) a sample database will be like Session ID Web access sequence 1 abdac 2 eaebcac 3 babfae 4 abbacfc
  • 33.
    Details Access eventscan be classified into frequent and infrequent based on frequency crossing a threshold level And a tree consisting of frequent access events can be created. Length of sequence Sequential web access pattern with support 1 a:4. b:4, c:3 2 aa:4. ab:4. oc3. ba:4. bc:3 3 aac:3, aba;4, obc:3, bac:3 4 Abac:3
  • 34.
  • 35.
    The Map andreduce So a map job can be designed to process the logs and create pattern tree. The task is divided among thousands of cheap machines using map Reduce platform. dynamic data and the static query model of data in flight will be very helpful to modify the main tree The tree structure can be efficiently stored by altering the physical storage by sorting and partitioning. Then based on the user’s access pattern we have to select a few parts of the tree. This can be designed as a reduce job which runs across the tree data.
  • 36.
    DBMS for thesame case? Map A huge data base of access logs should be uploaded to a db. And then it should be updated at regular intervals to reflect the changes in the site usage. Then a query has to be written to get tree kind of data structure out of this data behemoth, which changes shape continuously! An execution plan, which is simplistic and non dynamic in nature has to be made. Ineffective It should be divided among many parallel engines And this requires expertise in parallel programming. Reduce During reduce phase the entire tree has to be searched for the existence of resembling patterns. This also will be ineffective in an execution plan driven model as explained above. And with the explosion of data, and the increased need of increased personalization in recommendations, map reduce becomes the most suitable pattern.
  • 37.
    Parallel DB vsMapReduce RDBMS is good when if the application is query-intensive, whether semi structured or rigidly structured MR is effective ETL and “read once” data sets. Complex analytics. Semi-structured data, Non structured Quick-and-dirty analyses. Limited-budget operations.
  • 38.
    Summary of advantagesof MR Storage system independence automatic parallelization load balancing network and disk transfer optimization handling of machine failures Robustness Improvements to core library benefit all users of library! Ease to programmers!
  • 39.
    Is mapReduce thefinal word?
  • 40.
    What is hadoopBased on the map Reduce paradigm, apache foundation has given rise to a program for developing tools and techniques on an open source platform. This program and the resultant technology is termed as hadoop
  • 41.
    Pig Can weuse MR for repetitive jobs effectively? How can one control the execution of the hadoop program just like creating an execution plan in normal DB operation? The answer leads to pig. Pig allows one to control the flow of data by creating execution plans easily. Suitable when the task are repetitive and the plans can be envisaged early on.
  • 42.
    What does hivedo? Users of databases are not often technology masters. They might be familiar to the existing platforms. And these platforms tend to generate SQL like queries. We need a program to convert this traditional sql queries into mapReduce jobs. And the one created by hadoop movement is Hive.
  • 43.
  • 44.
    New models :cloud Map reduce for dummies! Many services available on cloud like Amazon web services (Amazon elastic -http://aws.amazon.com/ec2/) The user gets MR services by entering input text or site name, the required output etc without going to the technical details Almost infinite scalability New business models which are efficient
  • 45.
    Concerns Excerpts froma slashdot comment on Jan 19, 2011 “ But the very public complaints didn't stop Google from demanding a patent for MapReduce; nor did it stop the USPTO from granting Google's request (after four rejections). On Tuesday, the USPTO issued U.S. Patent No. 7,650,331 to Google for inventing Efficient Large-Scale Data Processing .” Will google enforce the patent? If it does it will hamper the growth of hadoop community.
  • 46.
    Research Paper 1MapReduce and Parallel DBMSs:Friends or Foes? Michael St onebraker, Daniel Abad i, Dav id J. eWitt, Sam Maden, Erik Paulson,Andrew Pav lo, and Alexander Rasin Salient points: The differences between MR and Parallel DB Use cases Architectural Points of collaboration and learning from each other
  • 47.
    Research Paper 2Web warehousing: Web technology meets data warehousing Xin Tan, David C. Yen ∗, Xiang Fang Salient points of the paper are Describes The Internet made it possible to apply Web technology to traditional data warehousing, which resulted in improved cost savings and productivity The integrated data in Web warehousing create a close tie between IT departments and other business functions. Security is also a key issue in Web-based warehouses
  • 48.
    Research Paper 3Clouds, big data, and smart assets: Ten tech-enabled business trends to watch McKinsey Quarterly Salient points: Four out of the top 10 were of important to the Data in Flight community Trend 2: Making the network the organization Trend 3: Collaboration at scale Trend 4: The growing ‘Internet of Things’ Trend 5: Experimentation and big data
  • 49.
    Research Paper 4What Are the Information Security Risks in Decision Support Systems and Data Warehousing? Thomas Finne Different aspects of security are Back up Password, Biometrics Administration Viruses Printing Power disruption Tempset,Hacking Encryption Copying file,Tapping over a network,Mobiles Flood fire and theft Testing Software version Deleting Data
  • 50.
    Research Paper 5Parallel Collection of Live Data Using Hadoop Kyriacos Talattinis, Aikaterini Sidiropoulou, Konstantinos Chalkias, and George Stephanides Department of Applied Informatics, University of Macedonia, Thessaloniki, Greece 3 different use cases Domain Appraisal Tool (DAT OpenBet - analyzing and presenting sport related data Brute Force Cryptanalysis
  • 51.
    Research Paper 6Hive – A Petabyte Scale Data Warehouse Using Hadoop Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu and Raghotham Murthy Salient points of the paper are Describes the uses and the architecture of Hive Authors are from facebook Hive team
  • 52.
    Research Paper 7Massive Structured Data Management Solution Ullas Nambiar, Rajeev Gupta, Himanshu Gupta and Mukesh Mohania IBM Research - India Salient points of the paper are Comparison between the performances of Hive, JAQL, Raw MR and DB systems across different kind of queries Overview of the working of the technologies
  • 53.
    Research Paper 8Situational Business Intelligence Alexander Löser, Fabian Hueske, and Volker Markl TU Berlin Database System and Information Management Group Salient points of the paper are Describes the need for data in flight Describes the theoretical solutions Discuss the current technology
  • 54.
    Research Paper 9Beyond Search - Web Scale Business Analytics Alexander Löser http://user.cs.tu-berlin.de/~aloeser Importance and methods of analyzing content on the internet Growth of the content Beneficiaries of this information out of content Methods and technology
  • 55.
    Research Paper 9An Intelligent Recommender System using Sequential Web Access Patterns Alexander Löser http://user.cs.tu-berlin.de/~aloeser Importance and methods of analyzing content on the internet Growth of the content Beneficiaries of this information out of content Methods and technology
  • 56.
    http://hadoop.apache.org/ http://en.wikipedia.org http://cloudera.comSlashdot.org http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ Amazon.com Web sites references
  • 57.