• Share
  • Email
  • Embed
  • Like
  • Private Content
Analytics on Hadoop
 

Analytics on Hadoop

on

  • 1,924 views

EMC World 2012 : Hadoop has rapidly emerged as the preferred solution for big data analytics across unstructured data and companies are seeking competitive advantage by finding effective ways of ...

EMC World 2012 : Hadoop has rapidly emerged as the preferred solution for big data analytics across unstructured data and companies are seeking competitive advantage by finding effective ways of analyzing new sources of unstructured and machine-generated data. This session reviews the practices of performing analytics using unstructured data with Hadoop.

Statistics

Views

Total Views
1,924
Views on SlideShare
1,736
Embed Views
188

Actions

Likes
0
Downloads
39
Comments
0

3 Embeds 188

http://www.scoop.it 185
https://si0.twimg.com 2
http://webcache.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Analytics on Hadoop Analytics on Hadoop Presentation Transcript

    • ANALYTICS ON HADOOP Donald Miner Solutions Architect Advanced Technologies Group© Copyright 2012 EMC Corporation. All rights reserved. 1
    • Large Retailer and Pregnancy “ As Pole’s computers crawled through the data, he was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score. More important, he could also ? estimate her due date to within a small window, so they could send coupons timed to very specific stages of her ” pregnancy.© Copyright 2012 EMC Corporation. All rights reserved. 2
    • Hadoop Origins Open source system based off of papers written by Google MapReduce used by Google to parse and index web pages and calculate “page rank” Came from the need of a system that is: – Linearly and horizontally scalable – Able to store massive amounts of data – Fault tolerant – Ready to analyze HTML files – Cheap to build and maintain© Copyright 2012 EMC Corporation. All rights reserved. 3
    • What is Hadoop? Two Core Components HDFS MapReduce Scalable storage in Compute via the Hadoop Distribued MapReduce distributed File System Processing platform Open source system developed by the Apache Foundation Storage and compute in one framework Massively scalable© Copyright 2012 EMC Corporation. All rights reserved. 4
    • Why is Hadoop Important? Business analytics require new approaches – Data size – Data growth The new nature of data – Unstructured – Numerous sources Hadoop makes analytics on large data sets more cost effective© Copyright 2012 EMC Corporation. All rights reserved. 5
    • Structured and Unstructured Data Greenplum DB Partitioning SQL Indexing RDBMS BI Tools GP MapReduceTables and Schemas STRUCTURED UNSTRUCTURED© Copyright 2012 EMC Corporation. All rights reserved. 6
    • Structured and Unstructured Data Hadoop Schema on load SequenceFile MapReduce Hive Directories Java XML, JSON, … Flat files Pig No ETL STRUCTURED UNSTRUCTURED© Copyright 2012 EMC Corporation. All rights reserved. 7
    • Leverage Both in a Unified Platform Greenplum DB Hadoop Partitioning SQL Schema on load Indexing SequenceFile MapReduce RDBMS Hive BI Tools Directories Java XML, JSON, … GP MapReduceTables and Schemas Pig No ETL Flat files STRUCTURED UNSTRUCTURED© Copyright 2012 EMC Corporation. All rights reserved. 8
    • Hadoop Use Case Launching our new product: The Marshmallow House© Copyright 2012 EMC Corporation. All rights reserved. 9
    • Marshmallow House Release Analysis Greenplum Party© Copyright 2012 EMC Corporation. All rights reserved. 10
    • Website Logs 15 web servers, 5 application servers Problem: cross-correlation Problem: 500TB of data with 1TB/day Problem: extracting insights from text© Copyright 2012 EMC Corporation. All rights reserved. 11
    • Current System SQL database – ETL process to collect and parse logs – Analyze transactions on the website – Can’t work with the text comfortably Perl scripts parsing the logs – Doesn’t scale – Hard to correlate across systems – Hard to deploy© Copyright 2012 EMC Corporation. All rights reserved. 12
    • Augmenting Capabilities with Hadoop Hadoop helps us extract value in more ways Particular analytics we have in mind: – Interest in product by location – Sessionizing our disparate data – Building behavior models of our customers – Analyzing customers’ sentiment of our products Why? Target Marshmallow House purchasers© Copyright 2012 EMC Corporation. All rights reserved. 13
    • Geographical Distribution Problem: We don’t know what the amount of interest is, by location Value: This will allow us to justify and scope additional marketing efforts Why Hadoop: Search through text, parsing log, custom data structures© Copyright 2012 EMC Corporation. All rights reserved. 14
    • Geographical Distribution Solution: Find IP addresses interested in our product, then count them over their locations MapReduce job: – map: extract ip addresses from all data, enrich with ipgeo information – reduce: group by geographical location, count the number of records – output: location, count Result: Lots of interest in Virginia© Copyright 2012 EMC Corporation. All rights reserved. 15
    • Sample MapReduce Java Code A MapReduce job consists of a Mapper, Reducer, and a Driver The Mapper parses, filters, transforms, enriches, and extracts The Reducer aggregates, counts, and outputs The Driver sets up and submits the job for execution© Copyright 2012 EMC Corporation. All rights reserved. 16
    • Mapper Code© Copyright 2012 EMC Corporation. All rights reserved. 17
    • Reducer Code© Copyright 2012 EMC Corporation. All rights reserved. 18
    • Driver Code© Copyright 2012 EMC Corporation. All rights reserved. 19
    • Sessionizing Problem: Data is scattered Value: Analyze a user’s experience at a session- level, which shows a bigger picture Why Hadoop: Hadoop can deal with heterogeneous and hierarchical data well© Copyright 2012 EMC Corporation. All rights reserved. 20
    • Sessionizing Solution: Load the data sets and group by IP and temporal locality, then output as a hierarchical data structure MapReduce job: – map: extract IP and date/time, keep the record – reduce: group by IP, then group into sessions; format into JSON documents and output Result: 1 million sessions a day© Copyright 2012 EMC Corporation. All rights reserved. 21
    • Unstructured and Semi-Structured Data Unnatural to store in an RDBMS Unstructured: text, documents, media, raw sensor data Semi-structured: mixed structured/unstructured; hierarchical Hadoop’s ability to leverage Java to gives flexibility “Schema on load” Data stored as “rich documents”© Copyright 2012 EMC Corporation. All rights reserved. 22
    • Behavioral Model Problem: We don’t understand how our visitors behave stereotypically Value: Optimize our interface for usability; understand our customers Why Hadoop: Advanced analytics and machine learning is possible because of the flexibility of the framework© Copyright 2012 EMC Corporation. All rights reserved. 23
    • Behavioral Model Solution: Run over the sessions and build a generic model from those MapReduce job: Use clustering to bring users into stereotypes, then use frequent item set analysis to build correlations between our users’ actions Results: We have three major types of buyers; casual buyers usually visit the marshmallow house from the main page© Copyright 2012 EMC Corporation. All rights reserved. 24
    • Apache Mahout Machine learning library built on Hadoop Scalable machine learning Open source project Data mining, advanced analytics, predictive modeling Main use cases: recommendation engines, clustering, classification, frequent itemset mining© Copyright 2012 EMC Corporation. All rights reserved. 25
    • Hadoop Makes These Possible Unstructured analysis is possible in Java and Hadoop Advanced data mining and machine learning techniques are natural Data analysis can be done on the data in its original form Analyze large amounts of heterogeneous data© Copyright 2012 EMC Corporation. All rights reserved. 26
    • Provide Feedback & Win!  125 attendees will receive $100 iTunes gift cards. To enter the raffle, simply complete: – 5 sessions surveys – The conference survey  Download the EMC World Conference App to learn more: emcworld.com/app© Copyright 2012 EMC Corporation. All rights reserved. 27
    • © Copyright 2012 EMC Corporation. All rights reserved. 28
    • Thank You© Copyright 2012 EMC Corporation. All rights reserved. 29