EMC Isilon Big Data
                                            Storage and Hadoop
                                            Analytics


                                              Jemish Patel




© Copyright 2012 EMC Corporation. All rights reserved.            1
Today’s Agenda
 • The Big Data Opportunity
 • Big Data Analytics with Hadoop
 • Technology Challenges of Hadoop
 • EMC’s Hadoop Solutions for the Enterprise
 • Q+A




© Copyright 2012 EMC Corporation. All rights reserved.   2
The Big Data
                                           Opportunity




© Copyright 2012 EMC Corporation. All rights reserved.    3
!!!
                                                                                     !!!
“Big Data Is Less
 About Size, And
 More About Freedom”
                       ―Techcrunch                                                                    !!!
                                                                                                !!!
                                      !!!
                                                 “Findings: ‘Big Data’ Is
                                                  More Extreme Than
                                                  Volume”                 “Big Data! It’s Real, It’s
                                                                       ― Gartner
                                                                                   Real-time, and It’s
                                                                                   Already Changing Your
                                                                                   World”
                                     “Total data:                                                     ―IDC

                !!!                   ‘bigger’ than big
                                      data”
                                                                          !!!
                                                         ― 451 Group
                                                                                   !!!

© Copyright 2012 EMC Corporation. All rights reserved.                                                       4
!!!
                                                                                         !!!
“Big Data Is Less
 About Size, And
 More About Freedom”
                       ―Techcrunch
                                                     THE ERA OF                                      !!!
                                                                                          !!!


                           BIG DATA
                                                 “Findings: ‘Big Data’ Is
                                      !!!
                                                  More Extreme Than
                                                  Volume”                 “Big Data! It’s Real, It’s
                                                                       ― Gartner    Real-time, and It’s
                                                                                    Already Changing Your
                                                          IS HERE                   World”
                                     “Total data:                                                   ―IDC
                                                                       !!!
                !!!                   ‘bigger’ than big
                                      data”                                        !!!
                                                         ― 451 Group




© Copyright 2012 EMC Corporation. All rights reserved.                                                      5
BIG DATA
                                                         IS TRANSFORMING
                                                         BUSINESS


© Copyright 2012 EMC Corporation. All rights reserved.                     6
Big Data in Action
• Healthcare

       – Leverage historical data to discover better
         treatments

• Financial Services

       – Data-driven banking stress tests & risk
         analysis

• Utilities

       – Machine-learning to predict service outages
         & prevent energy theft



 © Copyright 2012 EMC Corporation. All rights reserved.   7
Hadoop & Big Data




© Copyright 2012 EMC Corporation. All rights reserved.   8
The Promise of Big Data Analytics
 Leverage data assets to identify key
  trends and new business opportunities

 Analyze new sources of information to
  gain competitive advantages

 Take an agile approach to analytics that
  can adapt at the speed of business

 Scale your storage and analysis
  platform to handle Big Data’s volume,
  velocity and variety




© Copyright 2012 EMC Corporation. All rights reserved.   9
The Emergence of Hadoop
• Created 5-6 years ago by former Yahoo!
  Engineer, Doug Cutting
• Software platform designed to analyze
  massive amounts of unstructured data
• Two core components:
       – Hadoop Distributed File System (HDFS) (storage)

       – MapReduce (compute)

• Now a top-level Apache project backed by
  large, open source development community



© Copyright 2012 EMC Corporation. All rights reserved.     10
MapReduce
•"Map" step: The master node takes the input, divides it into
smaller sub-problems, and distributes them to worker nodes. A
worker node may do this again in turn, leading to a multi-level
tree structure. The worker node processes the smaller problem,
and passes the answer back to its master node.


•"Reduce" step: The master node then collects the answers to all
the sub-problems and combines them in some way to form the
output – the answer to the problem it was originally trying to
solve.




© Copyright 2012 EMC Corporation. All rights reserved.             11
MapReduce




© Copyright 2012 EMC Corporation. All rights reserved.   12
Services for MapReduce
•JobTracker – A master node that manages job submissions, scheduling
and reprocessing in case of job failures. Jobs consist of a mapper, a
reducer and a list of inputs.
•TaskTracker- Each slave node in the cluster runs a TaskTracker process.
The JobTracker instructs the TaskTrackers to run and monitor a task. A
task consists of a map or a reduce over a piece of data.




© Copyright 2012 EMC Corporation. All rights reserved.                     13
HDFS – Hadoop Distributed Filesystem
• HDFS is a filesystem designed for storing very large files with
streaming data access patterns, running on clusters of
commodity hardware.
•HDFS has a permissions model for files and directories that is
much like POSIX.




© Copyright 2012 EMC Corporation. All rights reserved.              14
Services for HDFS
•Namenode - manages the filesystem namespace. It maintains the
filesystem tree and the metadata for all the files and directories in the
tree. This information is stored persistently on the local disk in the form
of two files: the namespace image and the edit log.
•Datanode- Workhorses of the filesystem. They store and retrieve
blocks when they are told to (by clients or the namenode), and they
report back to the namenode periodically with lists of blocks that they
are storing.
•Secondary Namenode - Its main role is to periodically merge the
namespace image with the edit log to prevent the edit log from
becoming too large. The secondary namenode usually runs on a
separate physical machine



© Copyright 2012 EMC Corporation. All rights reserved.                        15
Hadoop Eco-System Components
 Pig - A high-level data-flow language and execution framework for parallel computation
 Mahout - A Scalable machine learning and data mining library
 Hive - A data warehouse infrastructure that provides data summarization and ad hoc
  querying (SQL)
 Hbase - A scalable, distributed database that supports structured data storage for large
  tables
 R(RHIPE) – Combines Hadoop + R analytics language



                       R                         Pig           Mahou        Hive        HBase
                    (RHIPE)                                      t
                                                            Ecosystem

         C                                               MapReduce – Compute Layer
                                                         (Job Scheduling / Execution)
         o
          r                                              HDFS – Storage Layer
                                                         (Hadoop Distributed Filesystem)
         e

© Copyright 2012 EMC Corporation. All rights reserved.                                          16
Why Hadoop is Important
 Pragmatic approach to analytics on a very large scale
        – Opens up new ways of gaining insights and identifying
          opportunities for businesses

 Designed to address the rise of unstructured data
        – Enterprise data to grow by 650% over next 5 years

        – More than 80% of this growth will be unstructured data




© Copyright 2012 EMC Corporation. All rights reserved.             17
Evolution of the Hadoop Market




         Innovators/                            Early Majority   Late Majority         Laggards
        Early Adopters




              Hadoop Early Adopters                                     Hadoop Early Majority



© Copyright 2012 EMC Corporation. All rights reserved.                                            18
Evolution of the Hadoop Market
             HADOOP PROFILE (TO DATE)




                      Pioneers and academics
                      Application Architect
                      Visionary

                      Open source / community driven
                      Build-your-own server, application &
                      storage infrastructure
                      Commodity components

                      Web 2.0
                      Universities
                      Life Sciences




                   Hadoop Early Adopters                     Hadoop Early Majority



© Copyright 2012 EMC Corporation. All rights reserved.                               19
Evolution of the Hadoop Market
    HADOOP PROFILE (TO DATE)                                 HADOOP PROFILE (EMERGING)




                      Pioneers and academics                  IT Manager & CIO
                      Application Architect                   Data Scientist
                      Visionary                               Line-of-business

                      Open source / community driven          Commercial distribution
                      Build-your-own server, application &    Turnkey solution
                      storage infrastructure
                                                              End-to-End Data protection
                      Commodity components

                      Web 2.0                                 Fortune 1000
                      Universities                            Financial Services
                      Life Sciences                           Retail




              Hadoop Early Adopters                                    Hadoop Early Majority



© Copyright 2012 EMC Corporation. All rights reserved.                                         20
Technology Challenges
                    of Hadoop




© Copyright 2012 EMC Corporation. All rights reserved.   21
Hadoop Architecture
   1. Data is ingested into the Hadoop File System (HDFS)
   2. Computation occurs inside Hadoop (MapReduce)
   3. Results are exported from HDFS for use




    Hadoop Data Node                          Hadoop Data Node   Hadoop Data Node

  Ethernet                                                                          Hadoop
                                                                                    Name Node




    Hadoop Data Node                          Hadoop Data Node   Hadoop Data Node




© Copyright 2012 EMC Corporation. All rights reserved.                                          22
Writing Data into Hadoop




© Copyright 2012 EMC Corporation. All rights reserved.   23
Reading Data from HDFS




© Copyright 2012 EMC Corporation. All rights reserved.   24
Technology Challenges of Hadoop
              Dedicated Storage Infrastructure
                                                         Hadoop DAS Environment
    1             – One-off for Hadoop only                            Name node

              Single Point of Failure
    2             – Namenode

              Lacking Enterprise Data Protection
    3             – No Snapshots, replication, backup

              Poor Storage Efficiency
    4             – 3X mirroring

              Fixed Scalability
    5             – Rigid compute to storage ratio

              Manual Import/Export
    6             – No protocol support




© Copyright 2012 EMC Corporation. All rights reserved.                             25
Technology Challenges of Hadoop
              Dedicated Storage Infrastructure
                                                         Hadoop DAS Environment
    1             – One-off for Hadoop only                            Namenode
                                                                1x

              Single Point of Failure
    2             – Namenode
                                                                1x            1x
              Lacking Enterprise Data Protection
    3             – No Snapshots, replication, backup

                                                                2x            2x
              Poor Storage Efficiency
    4             – 3X mirroring

              Fixed Scalability                                 2x            3x
    5             – Rigid compute to storage ratio

              Manual Import/Export                              3x            3x
    6             – No protocol support




© Copyright 2012 EMC Corporation. All rights reserved.                             26
EMC Addresses the Hadoop Challenge
              Dedicated Storage Infrastructure               Scale-Out Storage Platform
    1             – One-off for Hadoop only
                                                         1     – Multiple applications & workflows

              Single Point of Failure                        No Single Point of Failure
    2             – Namenode
                                                         2     – Distributed Namenode

              Lacking Enterprise Data Protection             End-to-End Data Protection
    3                                                    3     – SnapshotIQ, SyncIQ, NDMP Backup
                  – No Snapshots, replication, backup

                                                             Industry-Leading Storage Efficiency
              Poor Storage Efficiency                    4
    4             – 3X mirroring
                                                               – >80% Storage Utilization

                                                             Independent Scalability
              Fixed Scalability                          5
    5             – Rigid compute to storage ratio
                                                               – Add compute & storage separately

                                                             Multi-Protocol
    6
              Manual Import/Export                       6     – Industry standard protocols
                  – No protocol support                        – NFS, CIFS, FTP, HTTP, HDFS




© Copyright 2012 EMC Corporation. All rights reserved.                                               27
The EMC Isilon Advantage for Hadoop
                                                             Scale-Out Storage Platform
                                                         1     – Multiple applications & workflows

                                                             No Single Point of Failure
                                                         2     – Distributed Namenode

                                                             End-to-End Data Protection
                                                         3     – SnapshotIQ, SyncIQ, NDMP Backup

                                                             Industry-Leading Storage Efficiency
                                                         4     – >80% Storage Utilization

                                                             Independent Scalability
                                                         5     – Add compute & storage separately

                                                             Multi-Protocol
                                                         6     – Industry standard protocols
                                                               – NFS, CIFS, FTP, HTTP, HDFS




© Copyright 2012 EMC Corporation. All rights reserved.                                               28
Writing into Hadoop with Isilon




•Isilon becomes the namenode as well as the data node
•Provides scalability and protection of the data.
•Hadoop cluster no longer has a single point of failure and no longer writes
multiple 64MB-128MB chunks of data to datanodes


© Copyright 2012 EMC Corporation. All rights reserved.                         29
Reading Hadoop Data with Isilon




Data is read off the cluster back to the compute nodes.
 The datanodes are now just compute nodes and are
independent of the data in the Hadoop cluster.
        –Benefits are that the Hadoop hardware can be upgraded without the need
        for migration of the data



© Copyright 2012 EMC Corporation. All rights reserved.                            30
Industry’s First and Only Scale-Out Storage
Solution with Native Hadoop Integration


                                                         Accelerating the Benefits of
                                                         Hadoop for the Enterprise

                                                         Reducing Risk

                                                         End-to-End Data Protection

                                                         Organizational
                                                         Knowledge/Experience
© Copyright 2012 EMC Corporation. All rights reserved.                                  31
EMC’s Enterprise Hadoop Solution
EMC Greenplum HD and EMC Isilon Scale-Out Storage


                                                          Apache Hadoop certified by
                                                           Greenplum
         Compute




                                                          Simple platform management and
                                                           control
                                                          Parallel analytics access with
                                                           Greenplum Database
         Storage




© Copyright 2012 EMC Corporation. All rights reserved.                                      32
Greenplum: Not Just About Technology
                                                 • Data Science teams will become the
                                                   driving force for success with big data
                                                   analytics
                                                 • Greenplum is committed to the future
                                                   of data science
                                                         – University data science program collaboration
                                                           with Stanford and UC Berkeley
                                                         – Community investment including the
                                                           Greenplum Analytic Workbench, Community
                                                           edition software, and Data Science Summits

                                                 • Greenplum built its own Data Science
                                                   practice
                                                         – Leading PhDs with analytic tools expertise



© Copyright 2012 EMC Corporation. All rights reserved.                                                     33
Hadoop in Action




© Copyright 2012 EMC Corporation. All rights reserved.   34
Customer Case Study
   Purdue University




                                                         Leading Big Ten university renowned
                                                         worldwide for its research and academic
                                                         excellence.

Background

Challenge

Solution



© Copyright 2012 EMC Corporation. All rights reserved.                                             35
Customer Case Study
   Purdue University




                                                         • Large Hadoop environment for
                                                           researchers in Statistics Department

                                                         • No central storage infrastructure, leading
                                                           to many different, disparate islands of
                                                           data without consistent protection or
Background                                                 performance

Challenge                                                • Small IT staff managing large amounts of
                                                           data and hundreds of data-intensive users
Solution



© Copyright 2012 EMC Corporation. All rights reserved.                                                  36
Customer Case Study
   Purdue University


                                                         • Deployed Isilon with HDFS, which
                                                           plugged seamlessly into their Hadoop
                                                           environment

                                                         • Created a single, shared storage resource
                                                           for data computing and analytics

                                                         • Delivered a highly reliable and flexible
                                                           storage infrastructure that protected data
Background                                                 from loss or corruption
Challenge                                                • Eliminated need to migrate data between
                                                           storage silos, delivering immediate
Solution                                                   accessibility and significantly higher
                                                           performance




© Copyright 2012 EMC Corporation. All rights reserved.                                                  37
Customer Case Study
   Purdue University




                                                         “We tested EMC Isilon with Hadoop in our
                                                         statistics department, which must often
                                                         analyze huge data sets. EMC Isilon's multi-
                                                         protocol capabilities provided fast and
                                                         reliable delivery of data to our statisticians,
                                                         demonstrating the potential to increase the
Background                                               time spent on actually doing the science,
                                                         while reducing management costs.”
Challenge
                                                         Alex Younts,
                                                         Purdue University
Solution



© Copyright 2012 EMC Corporation. All rights reserved.                                                     38
Customer Case Study
   Global Shipping & Transportation Co.




                                                         Leading Global Shipping and Transportation
                                                         company.

Background

Challenge

Solution



© Copyright 2012 EMC Corporation. All rights reserved.                                                39
Customer Case Study
   Global Shipping & Transportation Co.



                                                         • Large amounts of data in different
                                                           formats from various business units.
                                                           Focused on E-commerce self service site
                                                           with semi-structured (XML) and
                                                           unstructured log data

                                                         • Looking to optimize their current ways of
                                                           analyzing this data regardless of format.
Background
                                                         • They wanted to understand what devices
Challenge                                                  were accessing their self-service site in
                                                           order to measure usage patterns to
                                                           enhance user experience on their E-
Solution                                                   commerce site




© Copyright 2012 EMC Corporation. All rights reserved.                                                 40
Customer Case Study
   Global Shipping & Transportation Co.


                                                         • Using Isilon with HDFS as the central
                                                           storage for their Hadoop environment,
                                                           they eliminated any ETL steps as data
                                                           could simply be copied over standard
                                                           protocols

                                                         • Created a single, shared storage resource
                                                           for data analytics regardless of structured,
                                                           semi-structured or unstructured data
Background                                                 queries across their entire data set.
Challenge                                                • Delivered a highly reliable and flexible
                                                           storage infrastructure that enabled
Solution                                                   mechanisms such as backup and archive
                                                           to be part of their analytics workflow




© Copyright 2012 EMC Corporation. All rights reserved.                                                    41
Questions?




© Copyright 2012 EMC Corporation. All rights reserved.        42
Thank You!




© Copyright 2012 EMC Corporation. All rights reserved.        43
Provide Feedback & Win!


                                                          125 attendees will receive
                                                           $100 iTunes gift cards. To
                                                           enter the raffle, simply
                                                           complete:
                                                            – 5 sessions surveys
                                                            – The conference survey

                                                          Download the EMC World
                                                           Conference App to learn
                                                           more: emcworld.com/app



© Copyright 2012 EMC Corporation. All rights reserved.                                  44
© Copyright 2012 EMC Corporation. All rights reserved.   45
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for High Impact Business Insight

  • 1.
    EMC Isilon BigData Storage and Hadoop Analytics Jemish Patel © Copyright 2012 EMC Corporation. All rights reserved. 1
  • 2.
    Today’s Agenda •The Big Data Opportunity • Big Data Analytics with Hadoop • Technology Challenges of Hadoop • EMC’s Hadoop Solutions for the Enterprise • Q+A © Copyright 2012 EMC Corporation. All rights reserved. 2
  • 3.
    The Big Data Opportunity © Copyright 2012 EMC Corporation. All rights reserved. 3
  • 4.
    !!! !!! “Big Data Is Less About Size, And More About Freedom” ―Techcrunch !!! !!! !!! “Findings: ‘Big Data’ Is More Extreme Than Volume” “Big Data! It’s Real, It’s ― Gartner Real-time, and It’s Already Changing Your World” “Total data: ―IDC !!! ‘bigger’ than big data” !!! ― 451 Group !!! © Copyright 2012 EMC Corporation. All rights reserved. 4
  • 5.
    !!! !!! “Big Data Is Less About Size, And More About Freedom” ―Techcrunch THE ERA OF !!! !!! BIG DATA “Findings: ‘Big Data’ Is !!! More Extreme Than Volume” “Big Data! It’s Real, It’s ― Gartner Real-time, and It’s Already Changing Your IS HERE World” “Total data: ―IDC !!! !!! ‘bigger’ than big data” !!! ― 451 Group © Copyright 2012 EMC Corporation. All rights reserved. 5
  • 6.
    BIG DATA IS TRANSFORMING BUSINESS © Copyright 2012 EMC Corporation. All rights reserved. 6
  • 7.
    Big Data inAction • Healthcare – Leverage historical data to discover better treatments • Financial Services – Data-driven banking stress tests & risk analysis • Utilities – Machine-learning to predict service outages & prevent energy theft © Copyright 2012 EMC Corporation. All rights reserved. 7
  • 8.
    Hadoop & BigData © Copyright 2012 EMC Corporation. All rights reserved. 8
  • 9.
    The Promise ofBig Data Analytics  Leverage data assets to identify key trends and new business opportunities  Analyze new sources of information to gain competitive advantages  Take an agile approach to analytics that can adapt at the speed of business  Scale your storage and analysis platform to handle Big Data’s volume, velocity and variety © Copyright 2012 EMC Corporation. All rights reserved. 9
  • 10.
    The Emergence ofHadoop • Created 5-6 years ago by former Yahoo! Engineer, Doug Cutting • Software platform designed to analyze massive amounts of unstructured data • Two core components: – Hadoop Distributed File System (HDFS) (storage) – MapReduce (compute) • Now a top-level Apache project backed by large, open source development community © Copyright 2012 EMC Corporation. All rights reserved. 10
  • 11.
    MapReduce •"Map" step: Themaster node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. •"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. © Copyright 2012 EMC Corporation. All rights reserved. 11
  • 12.
    MapReduce © Copyright 2012EMC Corporation. All rights reserved. 12
  • 13.
    Services for MapReduce •JobTracker– A master node that manages job submissions, scheduling and reprocessing in case of job failures. Jobs consist of a mapper, a reducer and a list of inputs. •TaskTracker- Each slave node in the cluster runs a TaskTracker process. The JobTracker instructs the TaskTrackers to run and monitor a task. A task consists of a map or a reduce over a piece of data. © Copyright 2012 EMC Corporation. All rights reserved. 13
  • 14.
    HDFS – HadoopDistributed Filesystem • HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. •HDFS has a permissions model for files and directories that is much like POSIX. © Copyright 2012 EMC Corporation. All rights reserved. 14
  • 15.
    Services for HDFS •Namenode- manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. •Datanode- Workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing. •Secondary Namenode - Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large. The secondary namenode usually runs on a separate physical machine © Copyright 2012 EMC Corporation. All rights reserved. 15
  • 16.
    Hadoop Eco-System Components Pig - A high-level data-flow language and execution framework for parallel computation  Mahout - A Scalable machine learning and data mining library  Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying (SQL)  Hbase - A scalable, distributed database that supports structured data storage for large tables  R(RHIPE) – Combines Hadoop + R analytics language R Pig Mahou Hive HBase (RHIPE) t Ecosystem C MapReduce – Compute Layer (Job Scheduling / Execution) o r HDFS – Storage Layer (Hadoop Distributed Filesystem) e © Copyright 2012 EMC Corporation. All rights reserved. 16
  • 17.
    Why Hadoop isImportant  Pragmatic approach to analytics on a very large scale – Opens up new ways of gaining insights and identifying opportunities for businesses  Designed to address the rise of unstructured data – Enterprise data to grow by 650% over next 5 years – More than 80% of this growth will be unstructured data © Copyright 2012 EMC Corporation. All rights reserved. 17
  • 18.
    Evolution of theHadoop Market Innovators/ Early Majority Late Majority Laggards Early Adopters Hadoop Early Adopters Hadoop Early Majority © Copyright 2012 EMC Corporation. All rights reserved. 18
  • 19.
    Evolution of theHadoop Market HADOOP PROFILE (TO DATE) Pioneers and academics Application Architect Visionary Open source / community driven Build-your-own server, application & storage infrastructure Commodity components Web 2.0 Universities Life Sciences Hadoop Early Adopters Hadoop Early Majority © Copyright 2012 EMC Corporation. All rights reserved. 19
  • 20.
    Evolution of theHadoop Market HADOOP PROFILE (TO DATE) HADOOP PROFILE (EMERGING) Pioneers and academics IT Manager & CIO Application Architect Data Scientist Visionary Line-of-business Open source / community driven Commercial distribution Build-your-own server, application & Turnkey solution storage infrastructure End-to-End Data protection Commodity components Web 2.0 Fortune 1000 Universities Financial Services Life Sciences Retail Hadoop Early Adopters Hadoop Early Majority © Copyright 2012 EMC Corporation. All rights reserved. 20
  • 21.
    Technology Challenges of Hadoop © Copyright 2012 EMC Corporation. All rights reserved. 21
  • 22.
    Hadoop Architecture 1. Data is ingested into the Hadoop File System (HDFS) 2. Computation occurs inside Hadoop (MapReduce) 3. Results are exported from HDFS for use Hadoop Data Node Hadoop Data Node Hadoop Data Node Ethernet Hadoop Name Node Hadoop Data Node Hadoop Data Node Hadoop Data Node © Copyright 2012 EMC Corporation. All rights reserved. 22
  • 23.
    Writing Data intoHadoop © Copyright 2012 EMC Corporation. All rights reserved. 23
  • 24.
    Reading Data fromHDFS © Copyright 2012 EMC Corporation. All rights reserved. 24
  • 25.
    Technology Challenges ofHadoop Dedicated Storage Infrastructure Hadoop DAS Environment 1 – One-off for Hadoop only Name node Single Point of Failure 2 – Namenode Lacking Enterprise Data Protection 3 – No Snapshots, replication, backup Poor Storage Efficiency 4 – 3X mirroring Fixed Scalability 5 – Rigid compute to storage ratio Manual Import/Export 6 – No protocol support © Copyright 2012 EMC Corporation. All rights reserved. 25
  • 26.
    Technology Challenges ofHadoop Dedicated Storage Infrastructure Hadoop DAS Environment 1 – One-off for Hadoop only Namenode 1x Single Point of Failure 2 – Namenode 1x 1x Lacking Enterprise Data Protection 3 – No Snapshots, replication, backup 2x 2x Poor Storage Efficiency 4 – 3X mirroring Fixed Scalability 2x 3x 5 – Rigid compute to storage ratio Manual Import/Export 3x 3x 6 – No protocol support © Copyright 2012 EMC Corporation. All rights reserved. 26
  • 27.
    EMC Addresses theHadoop Challenge Dedicated Storage Infrastructure Scale-Out Storage Platform 1 – One-off for Hadoop only 1 – Multiple applications & workflows Single Point of Failure No Single Point of Failure 2 – Namenode 2 – Distributed Namenode Lacking Enterprise Data Protection End-to-End Data Protection 3 3 – SnapshotIQ, SyncIQ, NDMP Backup – No Snapshots, replication, backup Industry-Leading Storage Efficiency Poor Storage Efficiency 4 4 – 3X mirroring – >80% Storage Utilization Independent Scalability Fixed Scalability 5 5 – Rigid compute to storage ratio – Add compute & storage separately Multi-Protocol 6 Manual Import/Export 6 – Industry standard protocols – No protocol support – NFS, CIFS, FTP, HTTP, HDFS © Copyright 2012 EMC Corporation. All rights reserved. 27
  • 28.
    The EMC IsilonAdvantage for Hadoop Scale-Out Storage Platform 1 – Multiple applications & workflows No Single Point of Failure 2 – Distributed Namenode End-to-End Data Protection 3 – SnapshotIQ, SyncIQ, NDMP Backup Industry-Leading Storage Efficiency 4 – >80% Storage Utilization Independent Scalability 5 – Add compute & storage separately Multi-Protocol 6 – Industry standard protocols – NFS, CIFS, FTP, HTTP, HDFS © Copyright 2012 EMC Corporation. All rights reserved. 28
  • 29.
    Writing into Hadoopwith Isilon •Isilon becomes the namenode as well as the data node •Provides scalability and protection of the data. •Hadoop cluster no longer has a single point of failure and no longer writes multiple 64MB-128MB chunks of data to datanodes © Copyright 2012 EMC Corporation. All rights reserved. 29
  • 30.
    Reading Hadoop Datawith Isilon Data is read off the cluster back to the compute nodes.  The datanodes are now just compute nodes and are independent of the data in the Hadoop cluster. –Benefits are that the Hadoop hardware can be upgraded without the need for migration of the data © Copyright 2012 EMC Corporation. All rights reserved. 30
  • 31.
    Industry’s First andOnly Scale-Out Storage Solution with Native Hadoop Integration Accelerating the Benefits of Hadoop for the Enterprise Reducing Risk End-to-End Data Protection Organizational Knowledge/Experience © Copyright 2012 EMC Corporation. All rights reserved. 31
  • 32.
    EMC’s Enterprise HadoopSolution EMC Greenplum HD and EMC Isilon Scale-Out Storage  Apache Hadoop certified by Greenplum Compute  Simple platform management and control  Parallel analytics access with Greenplum Database Storage © Copyright 2012 EMC Corporation. All rights reserved. 32
  • 33.
    Greenplum: Not JustAbout Technology • Data Science teams will become the driving force for success with big data analytics • Greenplum is committed to the future of data science – University data science program collaboration with Stanford and UC Berkeley – Community investment including the Greenplum Analytic Workbench, Community edition software, and Data Science Summits • Greenplum built its own Data Science practice – Leading PhDs with analytic tools expertise © Copyright 2012 EMC Corporation. All rights reserved. 33
  • 34.
    Hadoop in Action ©Copyright 2012 EMC Corporation. All rights reserved. 34
  • 35.
    Customer Case Study Purdue University Leading Big Ten university renowned worldwide for its research and academic excellence. Background Challenge Solution © Copyright 2012 EMC Corporation. All rights reserved. 35
  • 36.
    Customer Case Study Purdue University • Large Hadoop environment for researchers in Statistics Department • No central storage infrastructure, leading to many different, disparate islands of data without consistent protection or Background performance Challenge • Small IT staff managing large amounts of data and hundreds of data-intensive users Solution © Copyright 2012 EMC Corporation. All rights reserved. 36
  • 37.
    Customer Case Study Purdue University • Deployed Isilon with HDFS, which plugged seamlessly into their Hadoop environment • Created a single, shared storage resource for data computing and analytics • Delivered a highly reliable and flexible storage infrastructure that protected data Background from loss or corruption Challenge • Eliminated need to migrate data between storage silos, delivering immediate Solution accessibility and significantly higher performance © Copyright 2012 EMC Corporation. All rights reserved. 37
  • 38.
    Customer Case Study Purdue University “We tested EMC Isilon with Hadoop in our statistics department, which must often analyze huge data sets. EMC Isilon's multi- protocol capabilities provided fast and reliable delivery of data to our statisticians, demonstrating the potential to increase the Background time spent on actually doing the science, while reducing management costs.” Challenge Alex Younts, Purdue University Solution © Copyright 2012 EMC Corporation. All rights reserved. 38
  • 39.
    Customer Case Study Global Shipping & Transportation Co. Leading Global Shipping and Transportation company. Background Challenge Solution © Copyright 2012 EMC Corporation. All rights reserved. 39
  • 40.
    Customer Case Study Global Shipping & Transportation Co. • Large amounts of data in different formats from various business units. Focused on E-commerce self service site with semi-structured (XML) and unstructured log data • Looking to optimize their current ways of analyzing this data regardless of format. Background • They wanted to understand what devices Challenge were accessing their self-service site in order to measure usage patterns to enhance user experience on their E- Solution commerce site © Copyright 2012 EMC Corporation. All rights reserved. 40
  • 41.
    Customer Case Study Global Shipping & Transportation Co. • Using Isilon with HDFS as the central storage for their Hadoop environment, they eliminated any ETL steps as data could simply be copied over standard protocols • Created a single, shared storage resource for data analytics regardless of structured, semi-structured or unstructured data Background queries across their entire data set. Challenge • Delivered a highly reliable and flexible storage infrastructure that enabled Solution mechanisms such as backup and archive to be part of their analytics workflow © Copyright 2012 EMC Corporation. All rights reserved. 41
  • 42.
    Questions? © Copyright 2012EMC Corporation. All rights reserved. 42
  • 43.
    Thank You! © Copyright2012 EMC Corporation. All rights reserved. 43
  • 44.
    Provide Feedback &Win!  125 attendees will receive $100 iTunes gift cards. To enter the raffle, simply complete: – 5 sessions surveys – The conference survey  Download the EMC World Conference App to learn more: emcworld.com/app © Copyright 2012 EMC Corporation. All rights reserved. 44
  • 45.
    © Copyright 2012EMC Corporation. All rights reserved. 45