SlideShare a Scribd company logo
1 of 30
Download to read offline
ANALYTICS ON
                                            HADOOP


                                               Donald Miner
                                               Solutions Architect
                                               Advanced Technologies Group




© Copyright 2012 EMC Corporation. All rights reserved.                       1
Large Retailer and Pregnancy

                                                         “   As Pole’s computers crawled
                                                             through the data, he was able
                                                             to identify about 25 products
                                                             that, when analyzed together,
                                                             allowed him to assign each
                                                             shopper a “pregnancy
                                                             prediction” score. More
                                                             important, he could also
                                         ?                   estimate her due date to
                                                             within a small window, so they
                                                             could send coupons timed to
                                                             very specific stages of her

                                                                                         ”
                                                             pregnancy.



© Copyright 2012 EMC Corporation. All rights reserved.                                        2
Hadoop Origins
 Open source system based off of papers
  written by Google
 MapReduce used by Google to parse and
  index web pages and calculate “page rank”
 Came from the need of a system that is:
        –    Linearly and horizontally scalable
        –    Able to store massive amounts of data
        –    Fault tolerant
        –    Ready to analyze HTML files
        –    Cheap to build and maintain


© Copyright 2012 EMC Corporation. All rights reserved.   3
What is Hadoop?
                                              Two Core Components

                               HDFS                            MapReduce

                  Scalable storage in                        Compute via the
                  Hadoop Distribued                        MapReduce distributed
                     File System                            Processing platform

 Open source system developed by the Apache
  Foundation
 Storage and compute in one framework
 Massively scalable



© Copyright 2012 EMC Corporation. All rights reserved.                             4
Why is Hadoop Important?
 Business analytics require new approaches
        – Data size
        – Data growth
 The new nature of data
        – Unstructured
        – Numerous sources
 Hadoop makes analytics on large data sets
  more cost effective




© Copyright 2012 EMC Corporation. All rights reserved.   5
Structured and Unstructured Data
 Greenplum DB
              Partitioning
  SQL
               Indexing
       RDBMS                            BI Tools
                                                          GP MapReduce
Tables and Schemas

 STRUCTURED                                              UNSTRUCTURED




© Copyright 2012 EMC Corporation. All rights reserved.                   6
Structured and Unstructured Data
                                                                                          Hadoop
                                                                                    Schema on load
                                                         SequenceFile
                                                                                         MapReduce
                            Hive                                             Directories  Java
                                                                 XML, JSON, …             Flat files
                                           Pig                                  No ETL

 STRUCTURED                                                                        UNSTRUCTURED




© Copyright 2012 EMC Corporation. All rights reserved.                                                 7
Leverage Both in a Unified Platform
 Greenplum DB                                                                             Hadoop
              Partitioning
  SQL                                                                               Schema on load
         Indexing                                        SequenceFile
                                                                                         MapReduce
       RDBMS Hive                       BI Tools                             Directories  Java
                                                                 XML, JSON, …      GP MapReduce
Tables and Schemas Pig                                                          No ETL    Flat files
 STRUCTURED                                                                        UNSTRUCTURED




© Copyright 2012 EMC Corporation. All rights reserved.                                                 8
Hadoop Use Case
                                Launching our new product:
                                  The Marshmallow House




© Copyright 2012 EMC Corporation. All rights reserved.       9
Marshmallow House Release Analysis
            Greenplum Party




© Copyright 2012 EMC Corporation. All rights reserved.   10
Website Logs
 15 web servers, 5 application servers
 Problem: cross-correlation
 Problem: 500TB of data with 1TB/day
 Problem: extracting insights from text




© Copyright 2012 EMC Corporation. All rights reserved.   11
Current System
 SQL database
        – ETL process to collect and parse logs
        – Analyze transactions on the website
        – Can’t work with the text comfortably
 Perl scripts parsing the logs
        – Doesn’t scale
        – Hard to correlate across systems
        – Hard to deploy




© Copyright 2012 EMC Corporation. All rights reserved.   12
Augmenting Capabilities with Hadoop
 Hadoop helps us extract value in more ways
 Particular analytics we have in mind:
        –    Interest in product by location
        –    Sessionizing our disparate data
        –    Building behavior models of our customers
        –    Analyzing customers’ sentiment of our products
 Why? Target Marshmallow House purchasers




© Copyright 2012 EMC Corporation. All rights reserved.        13
Geographical Distribution
 Problem: We don’t know what the amount of
  interest is, by location
 Value: This will allow us to justify and scope
  additional marketing efforts
 Why Hadoop: Search through text, parsing
  log, custom data structures




© Copyright 2012 EMC Corporation. All rights reserved.   14
Geographical Distribution
 Solution: Find IP addresses interested in our
  product, then count them over their locations
 MapReduce job:
        – map: extract ip addresses from all data, enrich with
          ipgeo information
        – reduce: group by geographical location, count the
          number of records
        – output: location, count
 Result: Lots of interest in Virginia




© Copyright 2012 EMC Corporation. All rights reserved.           15
Sample MapReduce Java Code
 A MapReduce job consists of a Mapper,
  Reducer, and a Driver
 The Mapper parses, filters, transforms,
  enriches, and extracts
 The Reducer aggregates, counts, and outputs
 The Driver sets up and submits the job for
  execution




© Copyright 2012 EMC Corporation. All rights reserved.   16
Mapper Code




© Copyright 2012 EMC Corporation. All rights reserved.   17
Reducer Code




© Copyright 2012 EMC Corporation. All rights reserved.   18
Driver Code




© Copyright 2012 EMC Corporation. All rights reserved.   19
Sessionizing
 Problem: Data is scattered
 Value: Analyze a user’s experience at a session-
  level, which shows a bigger picture
 Why Hadoop: Hadoop can deal with heterogeneous
  and hierarchical data well




© Copyright 2012 EMC Corporation. All rights reserved.   20
Sessionizing
 Solution: Load the data sets and group by IP and
  temporal locality, then output as a hierarchical data
  structure
 MapReduce job:
        – map: extract IP and date/time, keep the record
        – reduce: group by IP, then group into sessions; format
          into JSON documents and output
 Result: 1 million sessions a day




© Copyright 2012 EMC Corporation. All rights reserved.            21
Unstructured and Semi-Structured Data
 Unnatural to store in an RDBMS
 Unstructured: text, documents, media,
  raw sensor data
 Semi-structured: mixed structured/unstructured;
  hierarchical
 Hadoop’s ability to leverage Java to gives flexibility
 “Schema on load”
 Data stored as “rich documents”




© Copyright 2012 EMC Corporation. All rights reserved.     22
Behavioral Model
 Problem: We don’t understand how our visitors
  behave stereotypically
 Value: Optimize our interface for usability;
  understand our customers
 Why Hadoop: Advanced analytics and machine
  learning is possible because of the flexibility of the
  framework




© Copyright 2012 EMC Corporation. All rights reserved.     23
Behavioral Model
 Solution: Run over the sessions and build a generic
  model from those
 MapReduce job: Use clustering to bring users into
  stereotypes, then use frequent item set analysis to build
  correlations between our users’ actions
 Results: We have three major types of buyers; casual
  buyers usually visit the marshmallow house from the
  main page




© Copyright 2012 EMC Corporation. All rights reserved.        24
Apache Mahout
 Machine learning library built on Hadoop
 Scalable machine learning
 Open source project
 Data mining, advanced analytics, predictive
  modeling
 Main use cases: recommendation engines,
  clustering, classification, frequent itemset
  mining



© Copyright 2012 EMC Corporation. All rights reserved.   25
Hadoop Makes These Possible
 Unstructured analysis is possible in Java and
  Hadoop
 Advanced data mining and machine learning
  techniques are natural
 Data analysis can be done on the data in its
  original form
 Analyze large amounts of heterogeneous
  data



© Copyright 2012 EMC Corporation. All rights reserved.   26
Provide Feedback & Win!


                                                          125 attendees will receive
                                                           $100 iTunes gift cards. To
                                                           enter the raffle, simply
                                                           complete:
                                                            – 5 sessions surveys
                                                            – The conference survey
                                                          Download the EMC World
                                                           Conference App to learn
                                                           more: emcworld.com/app



© Copyright 2012 EMC Corporation. All rights reserved.                                  27
© Copyright 2012 EMC Corporation. All rights reserved.   28
Thank You




© Copyright 2012 EMC Corporation. All rights reserved.        29
Analytics on Hadoop

More Related Content

What's hot

Oracle Optimized Datacenter - Storage
Oracle Optimized Datacenter - StorageOracle Optimized Datacenter - Storage
Oracle Optimized Datacenter - StorageWalter Moriconi
 
Big Data launch Singapore Patrick Buddenbaum
Big Data launch Singapore Patrick BuddenbaumBig Data launch Singapore Patrick Buddenbaum
Big Data launch Singapore Patrick BuddenbaumIntelAPAC
 
The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012DATAVERSITY
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopJoey Jablonski
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSJane Man
 
Cómo construimos Oracle TimesTen
Cómo construimos Oracle TimesTenCómo construimos Oracle TimesTen
Cómo construimos Oracle TimesTenSoftware Guru
 
Whitepaper : Working with Greenplum Database using Toad for Data Analysts
Whitepaper : Working with Greenplum Database using Toad for Data Analysts Whitepaper : Working with Greenplum Database using Toad for Data Analysts
Whitepaper : Working with Greenplum Database using Toad for Data Analysts EMC
 
Jaspersoft Dashboards Webinar Feb 2013
Jaspersoft Dashboards Webinar  Feb 2013Jaspersoft Dashboards Webinar  Feb 2013
Jaspersoft Dashboards Webinar Feb 2013Mike Boyarski
 
Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview EMC
 
Realtime hadoopsigmod2011
Realtime hadoopsigmod2011Realtime hadoopsigmod2011
Realtime hadoopsigmod2011iammutex
 
The 25 Most Promising Open Source Projects
The 25 Most Promising Open Source ProjectsThe 25 Most Promising Open Source Projects
The 25 Most Promising Open Source Projectsaf83
 
How Apollo Group Evaluted MongoDB
How Apollo Group Evaluted MongoDBHow Apollo Group Evaluted MongoDB
How Apollo Group Evaluted MongoDBJeremy Taylor
 
KESW2012 Linked Data for Enterprises and Governments (5 Oct 2012)
KESW2012 Linked Data for Enterprises and Governments (5 Oct 2012)KESW2012 Linked Data for Enterprises and Governments (5 Oct 2012)
KESW2012 Linked Data for Enterprises and Governments (5 Oct 2012)AI4BD GmbH
 
Liquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANALiquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANASAP Technology
 
From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012Anand Deshpande
 
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...Christian Gügi
 

What's hot (18)

Oracle Optimized Datacenter - Storage
Oracle Optimized Datacenter - StorageOracle Optimized Datacenter - Storage
Oracle Optimized Datacenter - Storage
 
Big Data launch Singapore Patrick Buddenbaum
Big Data launch Singapore Patrick BuddenbaumBig Data launch Singapore Patrick Buddenbaum
Big Data launch Singapore Patrick Buddenbaum
 
sigmod08
sigmod08sigmod08
sigmod08
 
The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012The CIOs Guide to NoSQL 2012
The CIOs Guide to NoSQL 2012
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
 
Cómo construimos Oracle TimesTen
Cómo construimos Oracle TimesTenCómo construimos Oracle TimesTen
Cómo construimos Oracle TimesTen
 
Whitepaper : Working with Greenplum Database using Toad for Data Analysts
Whitepaper : Working with Greenplum Database using Toad for Data Analysts Whitepaper : Working with Greenplum Database using Toad for Data Analysts
Whitepaper : Working with Greenplum Database using Toad for Data Analysts
 
Jaspersoft Dashboards Webinar Feb 2013
Jaspersoft Dashboards Webinar  Feb 2013Jaspersoft Dashboards Webinar  Feb 2013
Jaspersoft Dashboards Webinar Feb 2013
 
Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview
 
Realtime hadoopsigmod2011
Realtime hadoopsigmod2011Realtime hadoopsigmod2011
Realtime hadoopsigmod2011
 
Sql no sql
Sql no sqlSql no sql
Sql no sql
 
The 25 Most Promising Open Source Projects
The 25 Most Promising Open Source ProjectsThe 25 Most Promising Open Source Projects
The 25 Most Promising Open Source Projects
 
How Apollo Group Evaluted MongoDB
How Apollo Group Evaluted MongoDBHow Apollo Group Evaluted MongoDB
How Apollo Group Evaluted MongoDB
 
KESW2012 Linked Data for Enterprises and Governments (5 Oct 2012)
KESW2012 Linked Data for Enterprises and Governments (5 Oct 2012)KESW2012 Linked Data for Enterprises and Governments (5 Oct 2012)
KESW2012 Linked Data for Enterprises and Governments (5 Oct 2012)
 
Liquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANALiquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANA
 
From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012
 
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
Using HBase Coprocessors to implement Prospective Search - Berlin Buzzwords -...
 

Similar to Analytics on Hadoop

Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetupRoby Chen
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
Improving MySQL performance with Hadoop
Improving MySQL performance with HadoopImproving MySQL performance with Hadoop
Improving MySQL performance with HadoopSagar Jauhari
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoopHortonworks
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applicationsrussell_jurney
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangaloreTIB Academy
 
Hw09 Data Processing In The Enterprise
Hw09   Data Processing In The EnterpriseHw09   Data Processing In The Enterprise
Hw09 Data Processing In The EnterpriseCloudera, Inc.
 
Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers, Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers, TIB Academy
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 

Similar to Analytics on Hadoop (20)

Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetup
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Hadoop Business Cases
Hadoop Business CasesHadoop Business Cases
Hadoop Business Cases
 
Improving MySQL performance with Hadoop
Improving MySQL performance with HadoopImproving MySQL performance with Hadoop
Improving MySQL performance with Hadoop
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
 
Hortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics ApplicationsHortonworks: Agile Analytics Applications
Hortonworks: Agile Analytics Applications
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Hw09 Data Processing In The Enterprise
Hw09   Data Processing In The EnterpriseHw09   Data Processing In The Enterprise
Hw09 Data Processing In The Enterprise
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers, Hadoop tutorial for Freshers,
Hadoop tutorial for Freshers,
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 

More from EMC

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDEMC
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote EMC
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOEMC
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremioEMC
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereEMC
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History EMC
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewEMC
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeEMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic EMC
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityEMC
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeEMC
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesEMC
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsEMC
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookEMC
 

More from EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Recently uploaded

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Recently uploaded (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Analytics on Hadoop

  • 1. ANALYTICS ON HADOOP Donald Miner Solutions Architect Advanced Technologies Group © Copyright 2012 EMC Corporation. All rights reserved. 1
  • 2. Large Retailer and Pregnancy “ As Pole’s computers crawled through the data, he was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score. More important, he could also ? estimate her due date to within a small window, so they could send coupons timed to very specific stages of her ” pregnancy. © Copyright 2012 EMC Corporation. All rights reserved. 2
  • 3. Hadoop Origins  Open source system based off of papers written by Google  MapReduce used by Google to parse and index web pages and calculate “page rank”  Came from the need of a system that is: – Linearly and horizontally scalable – Able to store massive amounts of data – Fault tolerant – Ready to analyze HTML files – Cheap to build and maintain © Copyright 2012 EMC Corporation. All rights reserved. 3
  • 4. What is Hadoop? Two Core Components HDFS MapReduce Scalable storage in Compute via the Hadoop Distribued MapReduce distributed File System Processing platform  Open source system developed by the Apache Foundation  Storage and compute in one framework  Massively scalable © Copyright 2012 EMC Corporation. All rights reserved. 4
  • 5. Why is Hadoop Important?  Business analytics require new approaches – Data size – Data growth  The new nature of data – Unstructured – Numerous sources  Hadoop makes analytics on large data sets more cost effective © Copyright 2012 EMC Corporation. All rights reserved. 5
  • 6. Structured and Unstructured Data Greenplum DB Partitioning SQL Indexing RDBMS BI Tools GP MapReduce Tables and Schemas STRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 6
  • 7. Structured and Unstructured Data Hadoop Schema on load SequenceFile MapReduce Hive Directories Java XML, JSON, … Flat files Pig No ETL STRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 7
  • 8. Leverage Both in a Unified Platform Greenplum DB Hadoop Partitioning SQL Schema on load Indexing SequenceFile MapReduce RDBMS Hive BI Tools Directories Java XML, JSON, … GP MapReduce Tables and Schemas Pig No ETL Flat files STRUCTURED UNSTRUCTURED © Copyright 2012 EMC Corporation. All rights reserved. 8
  • 9. Hadoop Use Case Launching our new product: The Marshmallow House © Copyright 2012 EMC Corporation. All rights reserved. 9
  • 10. Marshmallow House Release Analysis Greenplum Party © Copyright 2012 EMC Corporation. All rights reserved. 10
  • 11. Website Logs  15 web servers, 5 application servers  Problem: cross-correlation  Problem: 500TB of data with 1TB/day  Problem: extracting insights from text © Copyright 2012 EMC Corporation. All rights reserved. 11
  • 12. Current System  SQL database – ETL process to collect and parse logs – Analyze transactions on the website – Can’t work with the text comfortably  Perl scripts parsing the logs – Doesn’t scale – Hard to correlate across systems – Hard to deploy © Copyright 2012 EMC Corporation. All rights reserved. 12
  • 13. Augmenting Capabilities with Hadoop  Hadoop helps us extract value in more ways  Particular analytics we have in mind: – Interest in product by location – Sessionizing our disparate data – Building behavior models of our customers – Analyzing customers’ sentiment of our products  Why? Target Marshmallow House purchasers © Copyright 2012 EMC Corporation. All rights reserved. 13
  • 14. Geographical Distribution  Problem: We don’t know what the amount of interest is, by location  Value: This will allow us to justify and scope additional marketing efforts  Why Hadoop: Search through text, parsing log, custom data structures © Copyright 2012 EMC Corporation. All rights reserved. 14
  • 15. Geographical Distribution  Solution: Find IP addresses interested in our product, then count them over their locations  MapReduce job: – map: extract ip addresses from all data, enrich with ipgeo information – reduce: group by geographical location, count the number of records – output: location, count  Result: Lots of interest in Virginia © Copyright 2012 EMC Corporation. All rights reserved. 15
  • 16. Sample MapReduce Java Code  A MapReduce job consists of a Mapper, Reducer, and a Driver  The Mapper parses, filters, transforms, enriches, and extracts  The Reducer aggregates, counts, and outputs  The Driver sets up and submits the job for execution © Copyright 2012 EMC Corporation. All rights reserved. 16
  • 17. Mapper Code © Copyright 2012 EMC Corporation. All rights reserved. 17
  • 18. Reducer Code © Copyright 2012 EMC Corporation. All rights reserved. 18
  • 19. Driver Code © Copyright 2012 EMC Corporation. All rights reserved. 19
  • 20. Sessionizing  Problem: Data is scattered  Value: Analyze a user’s experience at a session- level, which shows a bigger picture  Why Hadoop: Hadoop can deal with heterogeneous and hierarchical data well © Copyright 2012 EMC Corporation. All rights reserved. 20
  • 21. Sessionizing  Solution: Load the data sets and group by IP and temporal locality, then output as a hierarchical data structure  MapReduce job: – map: extract IP and date/time, keep the record – reduce: group by IP, then group into sessions; format into JSON documents and output  Result: 1 million sessions a day © Copyright 2012 EMC Corporation. All rights reserved. 21
  • 22. Unstructured and Semi-Structured Data  Unnatural to store in an RDBMS  Unstructured: text, documents, media, raw sensor data  Semi-structured: mixed structured/unstructured; hierarchical  Hadoop’s ability to leverage Java to gives flexibility  “Schema on load”  Data stored as “rich documents” © Copyright 2012 EMC Corporation. All rights reserved. 22
  • 23. Behavioral Model  Problem: We don’t understand how our visitors behave stereotypically  Value: Optimize our interface for usability; understand our customers  Why Hadoop: Advanced analytics and machine learning is possible because of the flexibility of the framework © Copyright 2012 EMC Corporation. All rights reserved. 23
  • 24. Behavioral Model  Solution: Run over the sessions and build a generic model from those  MapReduce job: Use clustering to bring users into stereotypes, then use frequent item set analysis to build correlations between our users’ actions  Results: We have three major types of buyers; casual buyers usually visit the marshmallow house from the main page © Copyright 2012 EMC Corporation. All rights reserved. 24
  • 25. Apache Mahout  Machine learning library built on Hadoop  Scalable machine learning  Open source project  Data mining, advanced analytics, predictive modeling  Main use cases: recommendation engines, clustering, classification, frequent itemset mining © Copyright 2012 EMC Corporation. All rights reserved. 25
  • 26. Hadoop Makes These Possible  Unstructured analysis is possible in Java and Hadoop  Advanced data mining and machine learning techniques are natural  Data analysis can be done on the data in its original form  Analyze large amounts of heterogeneous data © Copyright 2012 EMC Corporation. All rights reserved. 26
  • 27. Provide Feedback & Win!  125 attendees will receive $100 iTunes gift cards. To enter the raffle, simply complete: – 5 sessions surveys – The conference survey  Download the EMC World Conference App to learn more: emcworld.com/app © Copyright 2012 EMC Corporation. All rights reserved. 27
  • 28. © Copyright 2012 EMC Corporation. All rights reserved. 28
  • 29. Thank You © Copyright 2012 EMC Corporation. All rights reserved. 29