Hadoop
                                                         (Shanghai Developer

                                                         Meetup – Sept 15, 2011)

                                                         余家昌 (Andrew Yu)
                                                         EMC Greenplum




© Copyright 2011 EMC Corporation. All rights reserved.                             1
The Elephant Chase




© Copyright 2011 EMC Corporation. All rights reserved.   2
© Copyright 2011 EMC Corporation. All rights reserved.   3
Yahoo! Hadoop use cases
• Personalized Yahoo! Homepage
• Yahoo! Mail anti-spam
• Search and Ad pipelines
• Ad inventory prediction
• Data analytics
• etc




© Copyright 2011 EMC Corporation. All rights reserved.   4
Enterprise Use Case: “Big ETL”
Challenge: Transform Massive Data                          Solution: Hadoop/MapReduce as ETL
Flows Containing Data Needed for                           fabric to load to Analytic Database
Complex Analysis
• Examples:                                                • Components:
         –      Web Traffic Reduction                            –   Hadoop: Massively-parallel ingest, storage and
         –      Network Traffic & Performance Analysis               analysis
         –      Location Analytics for People and Goods          –   MapReduce: Runs multiple cascaded custom
                                                                     analysis / extraction on capture data
         –      Smart Electric Power Grid
                                                                 –   Connectors move structured data to Analytics
         –      Genome Analysis                                      DB
         –      Clinical Outcome Research & Analysis
                                                           • Hadoop’s Roles:
• Data Sources:                                                  –   Capture TBs/day of machine-generated data
         –      Web server & app server logs                     –   Quality: Run data quality tasks in MapReduce
         –      CDR / xDRs                                       –   Execute MapReduce flows
         –      Router & Switching Subsystem Logs                –   Extract/Combine data/metadata
         –      Sensor networks                                  –   Move processed data to analytic DB


• Limitations & Cautions:
         –     Software development, More parts (Cascading/Flow), Maintainability



© Copyright 2011 EMC Corporation. All rights reserved.                                                                5
Enterprise Use Case: Fraud Detection
Challenge: Identify & alert fraudulent                       Solution: Hadoop/MapReduce to filter
activity patterns                                            & correlate communications
• Examples:                                                  • Components:
         –      ESP’s - Email Fraud                                –   Hadoop: Massively-parallel ingest,
         –      Finance/Banking - Bank Fraud                           storage and analysis
         –      Advertising - Click Fraud                          –   Mahout: Machine learning tool for building
         –      Telecom – Network fraud                                fraud algorithms
                                                                   –   MapReduce: Rapid analysis & algorithm
• Data Sources:                                                        deployment
         –      Web & app server logs
                                                             • Hadoop’s Role(s):
         –      IP/Call Records
                                                                   –   Massive ingest of historical/real-time data
         –      Email Traffic
                                                                   –   Build/Validate model for fraud detection
         –      Customer Transaction Data
                                                                       manually or using Mahout
         –      Banking/Credit Data
                                                                   –   Parallel MapReduce jobs for near real-
                                                                       time fraud detection

• Limitations & Cautions:
         –     Software development, Partial Solution (not Real-time, not Interactive)
         –


© Copyright 2011 EMC Corporation. All rights reserved.                                                               6
Enterprise Use Case: Cluster Analysis
 Challenge: Grouping a collection of                      Solution: Process and Refine in
 data according to common similarities                    Hadoop and load into Analytical DB
• Examples:                                               • Components:
         –      Customer segmentation                          –   Hadoop: Flexible data storage as volume
         –      Financial cost/risk analysis                       increases and structures vary
         –      Patient-centric healthcare                     –   MapReduce: Cascading allows data
         –      Financial stock classification                     processing with minimal adjustments
         –      Social network analysis                        –   Optional: Connectors to move results to
                                                                   Analytic DB
• Data Sources:
                                                          • Hadoop’s Role(s):
         –      Health records
                                                               –   Flexible: Allow agile implementation of
         –      Sales data
                                                                   and unit testing of algorithms
         –      Human genome sequences
                                                               –   Large scale analysis in Hadoop creates
         –      Financial trading data                             more accurate groupings
         –      Facebook/Twitter/LinkedIn                      –   Rapid, parallel processing in MapReduce

• Limitations & Cautions:
         –     Software development, Complex Integration with Sources



© Copyright 2011 EMC Corporation. All rights reserved.                                                       7
Greenplum HD:
 Community Edition Stack



              100%
            APACHE




                                                                                                   Hive
                                                                                          Pig




                                                                                                           HBase
                                                          Zookeeper




                                                                      MapReduce Framework (MapRed)


                                                                       Hadoop Distributed File System (HDFS)


Currently supported

 Future releases may include support for Oozie and Mahout
 © Copyright 2011 EMC Corporation. All rights reserved.                                                            9
Greenplum HD:
 Enterprise Edition Stack


              100%
             APACHE




                                                                                                                   Enhanced Monitoring
           INTERFACE




                                                                                                   Hive
                                                                                          Pig




                                                                                                           HBase
                                                          Zookeeper




                                                                      MapReduce Framework (MapRed)


                                                                       Hadoop Distributed File System (HDFS)


Currently supported

 Future releases may include support for Oozie and Mahout
 © Copyright 2011 EMC Corporation. All rights reserved.                                                                                  10
Greenplum HD: Enterprise Edition
Enterprise-Ready Hadoop Platform for Unstructured Data



                                                         • 2 – 5x Faster than Apache
                  Faster                                   Hadoop

                                                         • High Availability
               Reliable                                  • Mirroring

              Easier to                                  • NFS mountable
                Use                                      • System Management




© Copyright 2011 EMC Corporation. All rights reserved.                                 11
Greenplum Enterprise HD is Faster than
Other Distributions

                                           DFSIO                                                         Terasort
                                  (higher is better)                                                (lower is better)

           1000                                                                        250




                                                                 Elapsed time in minutes
            900
            800                                                                        200
            700
  MB/sec




            600                                                                        150
            500
            400                                                                        100
            300
            200                                                                            50
            100
              0                                                                            0
                            Read                         Write                                  3.5 TB



   10 node cluster, 2x Quad-Core, 24G DRAM, 12 x 1TB SATA Drives @ 7200 rpm, Quad NICs




© Copyright 2011 EMC Corporation. All rights reserved.                                                                  12
Greenplum Enterprise HD
Distributed Name Node
• Fully distributed                                      Hadoop      Hadoop
                                                         Node        Node
  service running on                                            NN          NN

  all Hadoop nodes                                       Hadoop      Hadoop
                                                         Node   NN   Node   NN
• Automatic and                                          Hadoop      Hadoop
  transparent failover                                   Node   NN   Node   NN


• Persistent metadata                                    Hadoop
                                                         Node
                                                                     Hadoop
                                                                     Node
                                                                NN          NN

• Highly scalable in                                     Hadoop      Hadoop
                                                         Node        Node
  number of files                                               NN          NN




© Copyright 2011 EMC Corporation. All rights reserved.                           13
Greenplum Enterprise HD
Job Tracker High Availability
• Assures business
  continuity
• Designed for mission                                      Greenplum Enterprise HD
                                                         Distribution for Apache Hadoop
  critical use
         – Automatic stateful restart
         – Task Tracker reconnects                          Enterprise HD MapReduce
           without task loss                                                  Distributed
         – Persistent completed task                     Job Tracker HA       Name Node
           state
                                                                    Enterprise HD
                                                              Lockless Storage Services




© Copyright 2011 EMC Corporation. All rights reserved.                                      14
Greenplum Enterprise HD
Snapshots
• Intelligent Snapshots
         – Automatic data deduplication                  Hadoop / HBASE                 NFS
                                                         APPLICATIONS               APPLICATIONS
         – Block sharing for space
                                                                         READ / WRITE
           savings
                                                              Enterprise HD Lockless Storage
• Fast and flexible                                                      Services


         – Zero performance loss when
                                                                          REDIRECT ON
                                                                             WRITE
                                                                         FOR SNAPSHOT
           writing to the original                        A          B         C        C’         D

• Easy to manage
         – Scheduled or on-demand
         – Drag and drop recovery                                                            Snapshot
                                                          Snapshot           Snapshot
                                                             1                  2               3




© Copyright 2011 EMC Corporation. All rights reserved.                                                  15
Greenplum Enterprise HD
Mirroring
                                                                    • Business Continuity
        Production                                       Research      – Efficient design
                                                                       – Differential deltas are
                                                                         updated
                                                                       – Data is compressed and
   Datacenter 1                     WAN             Datacenter 2         check-summed
                                                                    • Easy to manage
                                                                       – Scheduled or on-demand
                                                                       – Consistent point-in-time
        Production                   WAN                  Cloud




© Copyright 2011 EMC Corporation. All rights reserved.                                              16
Greenplum Enterprise HD
   Direct Access Using NFS

• Simple application
  integration                                               Greenplum Enterprise HD
                                                         Distribution for Apache Hadoop
         – Leverage NFS for
           random read/write
                                                            Enterprise HD MapReduce
           access
• Direct access for                                      Job Tracker HA
                                                                            Distributed
                                                                            Name Node
  standard Hadoop tools
         – Command line utilities                                  Enterprise HD
                                                             Lockless Storage Services
         – File browsers
         – Desktop utilities


© Copyright 2011 EMC Corporation. All rights reserved.                                    17
Greenplum Enterprise HD
 Simple Management

• Intuitive
• Insightful
• Complete
• One node
  or
  thousands




 © Copyright 2011 EMC Corporation. All rights reserved.   18
Greenplum HD: Software Distributions

Features                        Community Edition            Enterprise Edition
Apache Compatibility        100% Apache Open Source        100% API Compatible
Name Node High Availability Reference Implementation Distributed and High Avaiability
Job Tracker HA              Reference Implementation        HT High Availability
Name Node Scalability        NN Metadata in Memory        Distributed Name Node
Premium Support                        Yes                          Yes
Performance                                           2 - 5x than Community Edition
Snapshots                              No                           Yes
Mirrors                                No                           Yes
NFS Mounts                             No                           Yes
System Management                      No                           Yes
Available for Ordering            May 9th 2011                      Q3
Pricing                          Per Node Pricing             Per Node Pricing




 © Copyright 2011 EMC Corporation. All rights reserved.                                 19
Greenplum HD on
Data Computing Appliance
• Introducing the world’s first:
         – High-performance
         – Purpose-built
         – Data co-processing Hadoop
           appliance
• Combining Greenplum Database
  and Greenplum Hadoop in one
  appliance




© Copyright 2011 EMC Corporation. All rights reserved.   20
GPDB  GPHD Interoperability


                                                         GPHD data in/out   GPHD
                                                         in GPDB Query
                                                                            File on
                                                                              HD




                        GPDB
                 External Tables




© Copyright 2011 EMC Corporation. All rights reserved.                                21
Greenplum Database
External Tables for Hadoop

• Bring GPDB relational expressive
                                                         Example:
  power to HDFS
         – HDFS data presented as external tables        Select count(*) from
         – HDFS data supporting full SQL syntax          HDFS_data h,
                                                         GPDB_data g
• Have ALL, PART or NONE of your                         where h.key = g.key;
  data in HDFS
                                                         Insert into
• Leverage full parallelism of both                      HDFS_data select *
  Hadoop and GPDB                                        from GPDB_data;
         – GPDB can read from/write to HDFS,




© Copyright 2011 EMC Corporation. All rights reserved.                          22
Greenplum Enterprise HD
HDFS Integration – Parallelized Flow
• Reading:
         – Each GPDB segment reads a portion of the file
                   • Segment i of n reads the i/n-th portion
         – Access offset from HDFS namenode
         – Read data directly from HDFS datanode
• Writing:
         – Each GPDB segment writes a file
         – HDFS balancing distributes the load evenly
           across datanodes




© Copyright 2011 EMC Corporation. All rights reserved.         23
Big Data Analytics “Stack”
                                                                Analytic Toolsets
                                                          (Business Analytics, BI, Statistics, etc.)



                                                               Greenplum Chorus
                                                         Enterprise Collaboration Platform for Data




                Greenplum Database                                                              Greenplum HD
         World’s Most Scalable MPP Database Platform                                Enterprise Analytics Platform for Unstructured Data




                                       Greenplum Data Computing Appliances
                                                            Purpose-built for Big Data Analytics




© Copyright 2011 EMC Corporation. All rights reserved.                                                                                    24
THANK YOU



© Copyright 2011 EMC Corporation. All rights reserved.   25

Hadoop for shanghai dev meetup

  • 1.
    Hadoop (Shanghai Developer Meetup – Sept 15, 2011) 余家昌 (Andrew Yu) EMC Greenplum © Copyright 2011 EMC Corporation. All rights reserved. 1
  • 2.
    The Elephant Chase ©Copyright 2011 EMC Corporation. All rights reserved. 2
  • 3.
    © Copyright 2011EMC Corporation. All rights reserved. 3
  • 4.
    Yahoo! Hadoop usecases • Personalized Yahoo! Homepage • Yahoo! Mail anti-spam • Search and Ad pipelines • Ad inventory prediction • Data analytics • etc © Copyright 2011 EMC Corporation. All rights reserved. 4
  • 5.
    Enterprise Use Case:“Big ETL” Challenge: Transform Massive Data Solution: Hadoop/MapReduce as ETL Flows Containing Data Needed for fabric to load to Analytic Database Complex Analysis • Examples: • Components: – Web Traffic Reduction – Hadoop: Massively-parallel ingest, storage and – Network Traffic & Performance Analysis analysis – Location Analytics for People and Goods – MapReduce: Runs multiple cascaded custom analysis / extraction on capture data – Smart Electric Power Grid – Connectors move structured data to Analytics – Genome Analysis DB – Clinical Outcome Research & Analysis • Hadoop’s Roles: • Data Sources: – Capture TBs/day of machine-generated data – Web server & app server logs – Quality: Run data quality tasks in MapReduce – CDR / xDRs – Execute MapReduce flows – Router & Switching Subsystem Logs – Extract/Combine data/metadata – Sensor networks – Move processed data to analytic DB • Limitations & Cautions: – Software development, More parts (Cascading/Flow), Maintainability © Copyright 2011 EMC Corporation. All rights reserved. 5
  • 6.
    Enterprise Use Case:Fraud Detection Challenge: Identify & alert fraudulent Solution: Hadoop/MapReduce to filter activity patterns & correlate communications • Examples: • Components: – ESP’s - Email Fraud – Hadoop: Massively-parallel ingest, – Finance/Banking - Bank Fraud storage and analysis – Advertising - Click Fraud – Mahout: Machine learning tool for building – Telecom – Network fraud fraud algorithms – MapReduce: Rapid analysis & algorithm • Data Sources: deployment – Web & app server logs • Hadoop’s Role(s): – IP/Call Records – Massive ingest of historical/real-time data – Email Traffic – Build/Validate model for fraud detection – Customer Transaction Data manually or using Mahout – Banking/Credit Data – Parallel MapReduce jobs for near real- time fraud detection • Limitations & Cautions: – Software development, Partial Solution (not Real-time, not Interactive) – © Copyright 2011 EMC Corporation. All rights reserved. 6
  • 7.
    Enterprise Use Case:Cluster Analysis Challenge: Grouping a collection of Solution: Process and Refine in data according to common similarities Hadoop and load into Analytical DB • Examples: • Components: – Customer segmentation – Hadoop: Flexible data storage as volume – Financial cost/risk analysis increases and structures vary – Patient-centric healthcare – MapReduce: Cascading allows data – Financial stock classification processing with minimal adjustments – Social network analysis – Optional: Connectors to move results to Analytic DB • Data Sources: • Hadoop’s Role(s): – Health records – Flexible: Allow agile implementation of – Sales data and unit testing of algorithms – Human genome sequences – Large scale analysis in Hadoop creates – Financial trading data more accurate groupings – Facebook/Twitter/LinkedIn – Rapid, parallel processing in MapReduce • Limitations & Cautions: – Software development, Complex Integration with Sources © Copyright 2011 EMC Corporation. All rights reserved. 7
  • 8.
    Greenplum HD: CommunityEdition Stack 100% APACHE Hive Pig HBase Zookeeper MapReduce Framework (MapRed) Hadoop Distributed File System (HDFS) Currently supported Future releases may include support for Oozie and Mahout © Copyright 2011 EMC Corporation. All rights reserved. 9
  • 9.
    Greenplum HD: EnterpriseEdition Stack 100% APACHE Enhanced Monitoring INTERFACE Hive Pig HBase Zookeeper MapReduce Framework (MapRed) Hadoop Distributed File System (HDFS) Currently supported Future releases may include support for Oozie and Mahout © Copyright 2011 EMC Corporation. All rights reserved. 10
  • 10.
    Greenplum HD: EnterpriseEdition Enterprise-Ready Hadoop Platform for Unstructured Data • 2 – 5x Faster than Apache Faster Hadoop • High Availability Reliable • Mirroring Easier to • NFS mountable Use • System Management © Copyright 2011 EMC Corporation. All rights reserved. 11
  • 11.
    Greenplum Enterprise HDis Faster than Other Distributions DFSIO Terasort (higher is better) (lower is better) 1000 250 Elapsed time in minutes 900 800 200 700 MB/sec 600 150 500 400 100 300 200 50 100 0 0 Read Write 3.5 TB 10 node cluster, 2x Quad-Core, 24G DRAM, 12 x 1TB SATA Drives @ 7200 rpm, Quad NICs © Copyright 2011 EMC Corporation. All rights reserved. 12
  • 12.
    Greenplum Enterprise HD DistributedName Node • Fully distributed Hadoop Hadoop Node Node service running on NN NN all Hadoop nodes Hadoop Hadoop Node NN Node NN • Automatic and Hadoop Hadoop transparent failover Node NN Node NN • Persistent metadata Hadoop Node Hadoop Node NN NN • Highly scalable in Hadoop Hadoop Node Node number of files NN NN © Copyright 2011 EMC Corporation. All rights reserved. 13
  • 13.
    Greenplum Enterprise HD JobTracker High Availability • Assures business continuity • Designed for mission Greenplum Enterprise HD Distribution for Apache Hadoop critical use – Automatic stateful restart – Task Tracker reconnects Enterprise HD MapReduce without task loss Distributed – Persistent completed task Job Tracker HA Name Node state Enterprise HD Lockless Storage Services © Copyright 2011 EMC Corporation. All rights reserved. 14
  • 14.
    Greenplum Enterprise HD Snapshots •Intelligent Snapshots – Automatic data deduplication Hadoop / HBASE NFS APPLICATIONS APPLICATIONS – Block sharing for space READ / WRITE savings Enterprise HD Lockless Storage • Fast and flexible Services – Zero performance loss when REDIRECT ON WRITE FOR SNAPSHOT writing to the original A B C C’ D • Easy to manage – Scheduled or on-demand – Drag and drop recovery Snapshot Snapshot Snapshot 1 2 3 © Copyright 2011 EMC Corporation. All rights reserved. 15
  • 15.
    Greenplum Enterprise HD Mirroring • Business Continuity Production Research – Efficient design – Differential deltas are updated – Data is compressed and Datacenter 1 WAN Datacenter 2 check-summed • Easy to manage – Scheduled or on-demand – Consistent point-in-time Production WAN Cloud © Copyright 2011 EMC Corporation. All rights reserved. 16
  • 16.
    Greenplum Enterprise HD Direct Access Using NFS • Simple application integration Greenplum Enterprise HD Distribution for Apache Hadoop – Leverage NFS for random read/write Enterprise HD MapReduce access • Direct access for Job Tracker HA Distributed Name Node standard Hadoop tools – Command line utilities Enterprise HD Lockless Storage Services – File browsers – Desktop utilities © Copyright 2011 EMC Corporation. All rights reserved. 17
  • 17.
    Greenplum Enterprise HD Simple Management • Intuitive • Insightful • Complete • One node or thousands © Copyright 2011 EMC Corporation. All rights reserved. 18
  • 18.
    Greenplum HD: SoftwareDistributions Features Community Edition Enterprise Edition Apache Compatibility 100% Apache Open Source 100% API Compatible Name Node High Availability Reference Implementation Distributed and High Avaiability Job Tracker HA Reference Implementation HT High Availability Name Node Scalability NN Metadata in Memory Distributed Name Node Premium Support Yes Yes Performance 2 - 5x than Community Edition Snapshots No Yes Mirrors No Yes NFS Mounts No Yes System Management No Yes Available for Ordering May 9th 2011 Q3 Pricing Per Node Pricing Per Node Pricing © Copyright 2011 EMC Corporation. All rights reserved. 19
  • 19.
    Greenplum HD on DataComputing Appliance • Introducing the world’s first: – High-performance – Purpose-built – Data co-processing Hadoop appliance • Combining Greenplum Database and Greenplum Hadoop in one appliance © Copyright 2011 EMC Corporation. All rights reserved. 20
  • 20.
    GPDB  GPHDInteroperability GPHD data in/out GPHD in GPDB Query File on HD GPDB External Tables © Copyright 2011 EMC Corporation. All rights reserved. 21
  • 21.
    Greenplum Database External Tablesfor Hadoop • Bring GPDB relational expressive Example: power to HDFS – HDFS data presented as external tables Select count(*) from – HDFS data supporting full SQL syntax HDFS_data h, GPDB_data g • Have ALL, PART or NONE of your where h.key = g.key; data in HDFS Insert into • Leverage full parallelism of both HDFS_data select * Hadoop and GPDB from GPDB_data; – GPDB can read from/write to HDFS, © Copyright 2011 EMC Corporation. All rights reserved. 22
  • 22.
    Greenplum Enterprise HD HDFSIntegration – Parallelized Flow • Reading: – Each GPDB segment reads a portion of the file • Segment i of n reads the i/n-th portion – Access offset from HDFS namenode – Read data directly from HDFS datanode • Writing: – Each GPDB segment writes a file – HDFS balancing distributes the load evenly across datanodes © Copyright 2011 EMC Corporation. All rights reserved. 23
  • 23.
    Big Data Analytics“Stack” Analytic Toolsets (Business Analytics, BI, Statistics, etc.) Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Database Greenplum HD World’s Most Scalable MPP Database Platform Enterprise Analytics Platform for Unstructured Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics © Copyright 2011 EMC Corporation. All rights reserved. 24
  • 24.
    THANK YOU © Copyright2011 EMC Corporation. All rights reserved. 25