Your SlideShare is downloading. ×
0
ANALYTICS ON                                            HADOOP                                               Donald Miner ...
Large Retailer and Pregnancy                                                         “   As Pole’s computers crawled      ...
Hadoop Origins Open source system based off of papers  written by Google MapReduce used by Google to parse and  index we...
What is Hadoop?                                              Two Core Components                               HDFS       ...
Why is Hadoop Important? Business analytics require new approaches        – Data size        – Data growth The new natur...
Structured and Unstructured Data Greenplum DB              Partitioning  SQL               Indexing       RDBMS           ...
Structured and Unstructured Data                                                                                          ...
Leverage Both in a Unified Platform Greenplum DB                                                                          ...
Hadoop Use Case                                Launching our new product:                                  The Marshmallow...
Marshmallow House Release Analysis            Greenplum Party© Copyright 2012 EMC Corporation. All rights reserved.   10
Website Logs 15 web servers, 5 application servers Problem: cross-correlation Problem: 500TB of data with 1TB/day Prob...
Current System SQL database        – ETL process to collect and parse logs        – Analyze transactions on the website  ...
Augmenting Capabilities with Hadoop Hadoop helps us extract value in more ways Particular analytics we have in mind:    ...
Geographical Distribution Problem: We don’t know what the amount of  interest is, by location Value: This will allow us ...
Geographical Distribution Solution: Find IP addresses interested in our  product, then count them over their locations M...
Sample MapReduce Java Code A MapReduce job consists of a Mapper,  Reducer, and a Driver The Mapper parses, filters, tran...
Mapper Code© Copyright 2012 EMC Corporation. All rights reserved.   17
Reducer Code© Copyright 2012 EMC Corporation. All rights reserved.   18
Driver Code© Copyright 2012 EMC Corporation. All rights reserved.   19
Sessionizing Problem: Data is scattered Value: Analyze a user’s experience at a session-  level, which shows a bigger pi...
Sessionizing Solution: Load the data sets and group by IP and  temporal locality, then output as a hierarchical data  str...
Unstructured and Semi-Structured Data Unnatural to store in an RDBMS Unstructured: text, documents, media,  raw sensor d...
Behavioral Model Problem: We don’t understand how our visitors  behave stereotypically Value: Optimize our interface for...
Behavioral Model Solution: Run over the sessions and build a generic  model from those MapReduce job: Use clustering to ...
Apache Mahout Machine learning library built on Hadoop Scalable machine learning Open source project Data mining, adva...
Hadoop Makes These Possible Unstructured analysis is possible in Java and  Hadoop Advanced data mining and machine learn...
Provide Feedback & Win!                                                          125 attendees will receive              ...
© Copyright 2012 EMC Corporation. All rights reserved.   28
Thank You© Copyright 2012 EMC Corporation. All rights reserved.        29
Analytics on Hadoop
Upcoming SlideShare
Loading in...5
×

Analytics on Hadoop

1,626

Published on

EMC World 2012 : Hadoop has rapidly emerged as the preferred solution for big data analytics across unstructured data and companies are seeking competitive advantage by finding effective ways of analyzing new sources of unstructured and machine-generated data. This session reviews the practices of performing analytics using unstructured data with Hadoop.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,626
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
42
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Analytics on Hadoop"

  1. 1. ANALYTICS ON HADOOP Donald Miner Solutions Architect Advanced Technologies Group© Copyright 2012 EMC Corporation. All rights reserved. 1
  2. 2. Large Retailer and Pregnancy “ As Pole’s computers crawled through the data, he was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score. More important, he could also ? estimate her due date to within a small window, so they could send coupons timed to very specific stages of her ” pregnancy.© Copyright 2012 EMC Corporation. All rights reserved. 2
  3. 3. Hadoop Origins Open source system based off of papers written by Google MapReduce used by Google to parse and index web pages and calculate “page rank” Came from the need of a system that is: – Linearly and horizontally scalable – Able to store massive amounts of data – Fault tolerant – Ready to analyze HTML files – Cheap to build and maintain© Copyright 2012 EMC Corporation. All rights reserved. 3
  4. 4. What is Hadoop? Two Core Components HDFS MapReduce Scalable storage in Compute via the Hadoop Distribued MapReduce distributed File System Processing platform Open source system developed by the Apache Foundation Storage and compute in one framework Massively scalable© Copyright 2012 EMC Corporation. All rights reserved. 4
  5. 5. Why is Hadoop Important? Business analytics require new approaches – Data size – Data growth The new nature of data – Unstructured – Numerous sources Hadoop makes analytics on large data sets more cost effective© Copyright 2012 EMC Corporation. All rights reserved. 5
  6. 6. Structured and Unstructured Data Greenplum DB Partitioning SQL Indexing RDBMS BI Tools GP MapReduceTables and Schemas STRUCTURED UNSTRUCTURED© Copyright 2012 EMC Corporation. All rights reserved. 6
  7. 7. Structured and Unstructured Data Hadoop Schema on load SequenceFile MapReduce Hive Directories Java XML, JSON, … Flat files Pig No ETL STRUCTURED UNSTRUCTURED© Copyright 2012 EMC Corporation. All rights reserved. 7
  8. 8. Leverage Both in a Unified Platform Greenplum DB Hadoop Partitioning SQL Schema on load Indexing SequenceFile MapReduce RDBMS Hive BI Tools Directories Java XML, JSON, … GP MapReduceTables and Schemas Pig No ETL Flat files STRUCTURED UNSTRUCTURED© Copyright 2012 EMC Corporation. All rights reserved. 8
  9. 9. Hadoop Use Case Launching our new product: The Marshmallow House© Copyright 2012 EMC Corporation. All rights reserved. 9
  10. 10. Marshmallow House Release Analysis Greenplum Party© Copyright 2012 EMC Corporation. All rights reserved. 10
  11. 11. Website Logs 15 web servers, 5 application servers Problem: cross-correlation Problem: 500TB of data with 1TB/day Problem: extracting insights from text© Copyright 2012 EMC Corporation. All rights reserved. 11
  12. 12. Current System SQL database – ETL process to collect and parse logs – Analyze transactions on the website – Can’t work with the text comfortably Perl scripts parsing the logs – Doesn’t scale – Hard to correlate across systems – Hard to deploy© Copyright 2012 EMC Corporation. All rights reserved. 12
  13. 13. Augmenting Capabilities with Hadoop Hadoop helps us extract value in more ways Particular analytics we have in mind: – Interest in product by location – Sessionizing our disparate data – Building behavior models of our customers – Analyzing customers’ sentiment of our products Why? Target Marshmallow House purchasers© Copyright 2012 EMC Corporation. All rights reserved. 13
  14. 14. Geographical Distribution Problem: We don’t know what the amount of interest is, by location Value: This will allow us to justify and scope additional marketing efforts Why Hadoop: Search through text, parsing log, custom data structures© Copyright 2012 EMC Corporation. All rights reserved. 14
  15. 15. Geographical Distribution Solution: Find IP addresses interested in our product, then count them over their locations MapReduce job: – map: extract ip addresses from all data, enrich with ipgeo information – reduce: group by geographical location, count the number of records – output: location, count Result: Lots of interest in Virginia© Copyright 2012 EMC Corporation. All rights reserved. 15
  16. 16. Sample MapReduce Java Code A MapReduce job consists of a Mapper, Reducer, and a Driver The Mapper parses, filters, transforms, enriches, and extracts The Reducer aggregates, counts, and outputs The Driver sets up and submits the job for execution© Copyright 2012 EMC Corporation. All rights reserved. 16
  17. 17. Mapper Code© Copyright 2012 EMC Corporation. All rights reserved. 17
  18. 18. Reducer Code© Copyright 2012 EMC Corporation. All rights reserved. 18
  19. 19. Driver Code© Copyright 2012 EMC Corporation. All rights reserved. 19
  20. 20. Sessionizing Problem: Data is scattered Value: Analyze a user’s experience at a session- level, which shows a bigger picture Why Hadoop: Hadoop can deal with heterogeneous and hierarchical data well© Copyright 2012 EMC Corporation. All rights reserved. 20
  21. 21. Sessionizing Solution: Load the data sets and group by IP and temporal locality, then output as a hierarchical data structure MapReduce job: – map: extract IP and date/time, keep the record – reduce: group by IP, then group into sessions; format into JSON documents and output Result: 1 million sessions a day© Copyright 2012 EMC Corporation. All rights reserved. 21
  22. 22. Unstructured and Semi-Structured Data Unnatural to store in an RDBMS Unstructured: text, documents, media, raw sensor data Semi-structured: mixed structured/unstructured; hierarchical Hadoop’s ability to leverage Java to gives flexibility “Schema on load” Data stored as “rich documents”© Copyright 2012 EMC Corporation. All rights reserved. 22
  23. 23. Behavioral Model Problem: We don’t understand how our visitors behave stereotypically Value: Optimize our interface for usability; understand our customers Why Hadoop: Advanced analytics and machine learning is possible because of the flexibility of the framework© Copyright 2012 EMC Corporation. All rights reserved. 23
  24. 24. Behavioral Model Solution: Run over the sessions and build a generic model from those MapReduce job: Use clustering to bring users into stereotypes, then use frequent item set analysis to build correlations between our users’ actions Results: We have three major types of buyers; casual buyers usually visit the marshmallow house from the main page© Copyright 2012 EMC Corporation. All rights reserved. 24
  25. 25. Apache Mahout Machine learning library built on Hadoop Scalable machine learning Open source project Data mining, advanced analytics, predictive modeling Main use cases: recommendation engines, clustering, classification, frequent itemset mining© Copyright 2012 EMC Corporation. All rights reserved. 25
  26. 26. Hadoop Makes These Possible Unstructured analysis is possible in Java and Hadoop Advanced data mining and machine learning techniques are natural Data analysis can be done on the data in its original form Analyze large amounts of heterogeneous data© Copyright 2012 EMC Corporation. All rights reserved. 26
  27. 27. Provide Feedback & Win!  125 attendees will receive $100 iTunes gift cards. To enter the raffle, simply complete: – 5 sessions surveys – The conference survey  Download the EMC World Conference App to learn more: emcworld.com/app© Copyright 2012 EMC Corporation. All rights reserved. 27
  28. 28. © Copyright 2012 EMC Corporation. All rights reserved. 28
  29. 29. Thank You© Copyright 2012 EMC Corporation. All rights reserved. 29
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×