v7.0 – 09/07/2012




Accelerating Decisions Through
Enterprise Hadoop
Evolving Hadoop to support Enterprise Computing




v7.0 – 09/07/2012                                            Joey Jablonski
                                                             Practice Director, Analytic Services




           ©2012 DataDirect Networks. All Rights Reserved.                                       ddn.com
Agenda for The Data Challenge

►   Overview of DataDirect Network

►   What is Storage Fusion Processing™,
                      it’s advantages & applications

►   Overview of Analytics

►   Introduction to Apache Hadoop

►   An overview of DDN hScaler solution

►   Conclusion


         ©2012 DataDirect Networks. All Rights Reserved.   ddn.com
DDN | We Accelerate Information Insight

     DDN provides a competitive advantage by maximizing your
     datacenter investment while mitigating growth challenges
     over your discovery process.
 ►   Established: 1998
 ►   Revenue: $226M (2011) – Profitable, Fast Growth
 ►   Main Office: Sunnyvale, California, USA
 ►   Employees: 600+ Worldwide
 ►   Worldwide Presence: 16 Countries
 ►   Installed Base: 1,000+ End Customers; 50+ Countries
 ►   Go To Market: Global Partners, Resellers, Direct




 World-Renowned & Award-Winning



          ©2012 DataDirect Networks. All Rights Reserved.       ddn.com
DDN | 15 Years in HPC
  Investment In Scale & Innovation
                       First HPC
     DDN               Customer
 Incorporated

  DDN                        1st Customer                                  SFA Project          WOS Project       Largest private              500+
  FOUNDED                    NASA                                           Inception            Inception       storage co. (IDC)          EMPLOYEES




    1998    1999        2000        2001        2002         2003   2004     2005        2006   2007      2008   2009     2010       2011     2012




                                                         S2A8000                                S2A9900
                   S2A6000
                                                                                    S2A9550
                                         S2A3000




AWARDS
                                                                                                     6620           10K                        12K




                ©2012 DataDirect Networks. All Rights Reserved.                                                                                 ddn.com
Agenda for The Data Challenge

►   Overview of DataDirect Network

►   What is Storage Fusion Processing™,
                      it’s advantages & applications

►   Overview of Analytics

►   Introduction to Apache Hadoop

►   An overview of DDN hScaler solution

►   Conclusion


         ©2012 DataDirect Networks. All Rights Reserved.   ddn.com
Storage Fusion Processing™

                                                                                        Applications
    DDN’s
Storage Fusion                                                                         GRIDScaler™
 Architecture


                                                                   Network Interface                    Network Interface

                            SAS                                                        Storage Server
                          Interface                                                                                         Compute
     Storage                                    RAID                                                                        Resource
      Media                                    Controller




      • Driving Imperatives = Improved OPEX
             Massive bandwidth and low latency to storage media
             Multi-core processors + Big DRAMs
             Virtualization / Hypervisor

                 ©2012 DataDirect Networks. All Rights Reserved.                                                             ddn.com
DDN | Appliance Portfolio

             GRIDScaler™                                        EXAScaler™




  SFA12K-E                                SFA10K-E               SFA10K-M                  WOS6000
  Bandwidth: 40GB/s                     Bandwidth: 15GB/s         Bandwidth: 2GB/s       4U, 60-Drive System
  Flash IOPS: 1.4M                      Flash IOPS: 840K          Flash IOPS: 840K        8 x GbE per Node
Scales to 1680 Drives                  Scales to 1200 dives       Scales to 120 dives   2PB/Rack, 23PB/Cluster
In-Storage Processing                 In-Storage Processing     In-Storage Processing     25B Objects/Rack


                 Maximize Value: Best-In-Class Performance to Accelerate Applications

              Minimize OPEX: >2x More Data Center Efficient Than Competing Systems

               Minimize Overhead: Autonomous System Fault Management & Recovery

              ©2012 DataDirect Networks. All Rights Reserved.                                                    ddn.com
Storage Fusion Processing™
A Unique DDN Vision

Embedded Data-Intensive Applications
Within Storage Infrastructure

►Reduce  complexity, infrastructure,
 administration, TCO
►Reduce   infrastructure & OPEX
►Increase performance for
 latency sensitive applications
►Success    today with: File-Systems,
 iRODS, Hadoop, BWA, FASTA/SAM/BAM
►Work   with your research teams to:
  • Identify application candidates                         Gap Aligners?
  • Port to our VMs/Hypervisor and Benchmark                Molecular Dynamics?
  • Deploy to your community                                Deep and wide search?
                                                            Query engine?

          ©2012 DataDirect Networks. All Rights Reserved.                    ddn.com
Agenda for The Data Challenge

►   Overview of DataDirect Network

►   What is Storage Fusion Processing™,
                      it’s advantages & applications

►   Overview of Analytics

►   Introduction to Apache Hadoop

►   An overview of DDN hScaler solution

►   Conclusion


         ©2012 DataDirect Networks. All Rights Reserved.   ddn.com
Why Data Analytics is so Hard?


           Technical                                               Business


         Hacking Skills                                           Business Acumen




                     Data
                    Science                                               Analytics

   Math &




                                                                           Decisioning
                      Traditional
                      Research




                                    Substantive
  Statistics




                                                                              Poor
                                                         Communications                  Curiosity
                                     Expertise
 knowledge




       ©2012 DataDirect Networks. All Rights Reserved.                                          ddn.com
Analytics | Looking for Actionable Data



Billions of
   Data
Points to
Consider



•   Consumer purchasing trends
•   Product perception
•   Drug Discovery
•   Genomics
•   Surveillance
•   Financial Analysis

              ©2012 DataDirect Networks. All Rights Reserved.   ddn.com
How do I leverage Analytics?




                                                                 Improved
                                                                  Results




                                                                             Modify
                                                       Insight
                                                                            Behavior


     ©2012 DataDirect Networks. All Rights Reserved.                          ddn.com
Data Gravity
Warps the Application Space

     Applications


                                                        DATA

                                                          Services




      ©2012 DataDirect Networks. All Rights Reserved.                ddn.com
Todays Enterprise Picture
 Empowered




                                                                       Enabled
                                              Aware
                                              Users




                                                                        Users
   Users




                                                           The Cloud




         ©2012 DataDirect Networks. All Rights Reserved.                         ddn.com
Agenda for The Data Challenge

►   Overview of DataDirect Network

►   What is Storage Fusion Processing™,
                      it’s advantages & applications

►   Overview of Analytics

►   Introduction to Apache Hadoop

►   An overview of DDN hScaler solution

►   Conclusion


         ©2012 DataDirect Networks. All Rights Reserved.   ddn.com
The tools of the Trade
Ecosystem
 Hadoop




                     4             3                   5
Core Apache Hadoop




                     2             6                   1



                                                                                   Map   Reduce




                     1   2   3         4      5       6




                                 ©2012 DataDirect Networks. All Rights Reserved.              ddn.com
Hadoop & HPC Compared

                    Data Locality                         Inter-process Communication
                                                                   Job Input
      HPC




               1       2      3        4    5         6
                                                                 Slic      Slic
                                                                 e1        en


                4                  3                  5
                                                                    Job Input
                2                  6                  1
    Hadoop




                                                                 Slic     Slic
                                                                 e1       en
                1      2      3        4    5         6



    ©2012 DataDirect Networks. All Rights Reserved.                                     ddn.com
Organizational Scalability
Higher is Better
   Adoption




                                                                                         Goal for Human Costs




                                                                              Capacity
      18           6/8/12   ©2012 DataDirect Networks. All Rights Reserved.                                     ddn.com
Agenda for The Data Challenge

►   Overview of DataDirect Network

►   What is Storage Fusion Processing™,
                      it’s advantages & applications

►   Overview of Analytics

►   Introduction to Apache Hadoop

►   An overview of DDN hScaler solution

►   Conclusion


         ©2012 DataDirect Networks. All Rights Reserved.   ddn.com
Hadoop Cluster Lifecycle


                                                           Deploy




                                    Upgrade                              Manage




                                                 Respond            Monitor




Software Platform                                                                 Hardware Platform
        ©2012 DataDirect Networks. All Rights Reserved.                                     ddn.com
Infrastructure Chargeback




                                                          • Visibility to Trends
                                                          • Actionable Reporting
                                                          • Limits & Enforcement
                                                       Site Overview




     ©2012 DataDirect Networks. All Rights Reserved.                          ddn.com
Analytics Services Portfolio




  Architect                                     Deploy                        Manage                   Customize


• Data Transformation                   •   hScaler Installation      •       Data Curation            •   Data Migration
• Data & Analytics                      •   hScaler Upgrade           •       hScaler Administration   •   DR&BC
  Strategy                              •   Environment Integration   •       System Tuning            •   Application Integration
• Security Strategy in                  •   Performance Testing       •       Health Checks            •   Data Curation
  shared-data                           •   Operational Validation                                     •   Application Development
  Environments                          •   Factory Build                                              •   Data Cleansing
• DR&BC
• Data Curation
• Solution Sizing
• Data Center Preparation
                                                                               Support
• Process Integration                                                     •   Phone/Email
• ETL planning                                                            •   Phone Home Monitoring
• Compliance Planning                                                     •   Patches & Upgrades
                                                                          •   Remote Diagnostics
                 ©2012 DataDirect Networks. All Rights Reserved.                                                          ddn.com
Apache Hadoop
Genomics Application Examples

 ►    Apache Hadoop™ MapReduce™ computing efficiency:
      • The algorithm-performance should scale with CPU count
      • The algorithm should be embarrassingly parallel
      • There should be no dependence on how the data is distributed
      • The data should be static

 ►    Example genomics application that work well within Hadoop:
      • Crossbow. Whole genome re-sequencing & SNP genotyping (short reads)
      • Contrail. De novo assembly from short sequencing reads.
      • Myrna. Fast short-read & differential gene expression aligner (RNA-seq)
      • PeakRanger. Cloud-enabled peak caller for ChIP-seq data.
      • Quake. Quality-aware detection and sequencing error correction tool.
      • BlastReduce. High-performance short read mapping.
      • CloudBLAST. Hadoop implementation of NCBI’s Blast.
      • MrsRF. Algorithm for analyzing large evolutionary trees.
 23         ©2012 DataDirect Networks. All Rights Reserved.                    ddn.com
CloudBLAST Application Example

                                                                                                            StreamInputFormat
     CloudBLAST is a Map-Reduce
     version of the commonly used                                                              S=
                                                                                          {s1, s2, … sk}
                                                                                                                              S=
                                                                                                                           {s1, s2, … sk}
                                                                                                                                                           S=
                                                                                                                                                      {s1, s2, … sk}

     bioinformatics application NCBI
     BLAST




                                                                                                                                                                       CPU - N
                                                                                CPU - 0


                                                                                           CPU - 1


                                                                                                       CPU - 2


                                                                                                                 CPU - 3


                                                                                                                                  CPU - 4


                                                                                                                                            CPU - 5


                                                                                                                                                      CPU -6
     1. Stream Input Formatted data is split
        into “960 long chunks” base on new
        line.
     2. Data “chunks” split into sequences as
        keys for the MapReduce
     3. Blast output is written to local file




                                                                                                     Data Merger

Based on work by Andréa Matsunaga, Maurício Tsugawa and José Fortes - University of Florida

    24              ©2012 DataDirect Networks. All Rights Reserved.                                                                                                    ddn.com
Agenda for The Data Challenge

►   Overview of DataDirect Network

►   What is Storage Fusion Processing™,
                      it’s advantages & applications

►   Overview of Analytics

►   Introduction to Apache Hadoop

►   An overview of DDN hScaler solution

►   Conclusion


         ©2012 DataDirect Networks. All Rights Reserved.   ddn.com
How DDN can
    Accelerate Your Analytics
►   Lower Total Cost of Ownership and Improved OPEX:
    • Scale – Dynamically add capacity to match your complex workloads
    • Value – Grow storage capacity economically: Access, Solve, Archive
    • High Availability - Always running with world-class 24/7 service & support

►   Drive Innovation:
    • Performance at Scale – A homogeneous platform that performs at scale
    • Eloquent - Leverage virtualization to deliver analytics platform to provide the
      quickest answers to your most complex questions
    • Collaboration – Centralize & share discoveries across the globe, securely

►   Deliver Experience:
    • Fifteen Years of HPC – Government Labs, DoE, and Universities trust DDN
    • HPC community rely on DDN – 60% of the top 500 Supercomputer & growing
    • Single vendor solution - OEMs provide DDN with their datacenter solutions.



             ©2012 DataDirect Networks. All Rights Reserved.                    ddn.com
Thank you – Questions?



DataDirect Networks, Information in Motion, Silicon Storage Appliance, S2A, Storage Fusion Architecture, SFA, Storage Fusion Fabric, Web Object Scaler, WOS, EXAScaler, GRIDScaler,
       xSTREAMScaler, NAS Scaler, ReAct, ObjectAssure, In-Storage Processing and SATAssure are all trademarks of DataDirect Networks. Any unauthorized use is prohibited.

                       ©2012 DataDirect Networks. All Rights Reserved.                                                                                              ddn.com

DDN Accelerating-Decisions-Through-Enterprise-Hadoop-final

  • 1.
    v7.0 – 09/07/2012 AcceleratingDecisions Through Enterprise Hadoop Evolving Hadoop to support Enterprise Computing v7.0 – 09/07/2012 Joey Jablonski Practice Director, Analytic Services ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 2.
    Agenda for TheData Challenge ► Overview of DataDirect Network ► What is Storage Fusion Processing™, it’s advantages & applications ► Overview of Analytics ► Introduction to Apache Hadoop ► An overview of DDN hScaler solution ► Conclusion ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 3.
    DDN | WeAccelerate Information Insight DDN provides a competitive advantage by maximizing your datacenter investment while mitigating growth challenges over your discovery process. ► Established: 1998 ► Revenue: $226M (2011) – Profitable, Fast Growth ► Main Office: Sunnyvale, California, USA ► Employees: 600+ Worldwide ► Worldwide Presence: 16 Countries ► Installed Base: 1,000+ End Customers; 50+ Countries ► Go To Market: Global Partners, Resellers, Direct World-Renowned & Award-Winning ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 4.
    DDN | 15Years in HPC Investment In Scale & Innovation First HPC DDN Customer Incorporated DDN 1st Customer SFA Project WOS Project Largest private 500+ FOUNDED NASA Inception Inception storage co. (IDC) EMPLOYEES 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 S2A8000 S2A9900 S2A6000 S2A9550 S2A3000 AWARDS 6620 10K 12K ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 5.
    Agenda for TheData Challenge ► Overview of DataDirect Network ► What is Storage Fusion Processing™, it’s advantages & applications ► Overview of Analytics ► Introduction to Apache Hadoop ► An overview of DDN hScaler solution ► Conclusion ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 6.
    Storage Fusion Processing™ Applications DDN’s Storage Fusion GRIDScaler™ Architecture Network Interface Network Interface SAS Storage Server Interface Compute Storage RAID Resource Media Controller • Driving Imperatives = Improved OPEX  Massive bandwidth and low latency to storage media  Multi-core processors + Big DRAMs  Virtualization / Hypervisor ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 7.
    DDN | AppliancePortfolio GRIDScaler™ EXAScaler™ SFA12K-E SFA10K-E SFA10K-M WOS6000 Bandwidth: 40GB/s Bandwidth: 15GB/s Bandwidth: 2GB/s 4U, 60-Drive System Flash IOPS: 1.4M Flash IOPS: 840K Flash IOPS: 840K 8 x GbE per Node Scales to 1680 Drives Scales to 1200 dives Scales to 120 dives 2PB/Rack, 23PB/Cluster In-Storage Processing In-Storage Processing In-Storage Processing 25B Objects/Rack Maximize Value: Best-In-Class Performance to Accelerate Applications Minimize OPEX: >2x More Data Center Efficient Than Competing Systems Minimize Overhead: Autonomous System Fault Management & Recovery ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 8.
    Storage Fusion Processing™ AUnique DDN Vision Embedded Data-Intensive Applications Within Storage Infrastructure ►Reduce complexity, infrastructure, administration, TCO ►Reduce infrastructure & OPEX ►Increase performance for latency sensitive applications ►Success today with: File-Systems, iRODS, Hadoop, BWA, FASTA/SAM/BAM ►Work with your research teams to: • Identify application candidates Gap Aligners? • Port to our VMs/Hypervisor and Benchmark Molecular Dynamics? • Deploy to your community Deep and wide search? Query engine? ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 9.
    Agenda for TheData Challenge ► Overview of DataDirect Network ► What is Storage Fusion Processing™, it’s advantages & applications ► Overview of Analytics ► Introduction to Apache Hadoop ► An overview of DDN hScaler solution ► Conclusion ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 10.
    Why Data Analyticsis so Hard? Technical Business Hacking Skills Business Acumen Data Science Analytics Math & Decisioning Traditional Research Substantive Statistics Poor Communications Curiosity Expertise knowledge ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 11.
    Analytics | Lookingfor Actionable Data Billions of Data Points to Consider • Consumer purchasing trends • Product perception • Drug Discovery • Genomics • Surveillance • Financial Analysis ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 12.
    How do Ileverage Analytics? Improved Results Modify Insight Behavior ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 13.
    Data Gravity Warps theApplication Space Applications DATA Services ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 14.
    Todays Enterprise Picture Empowered Enabled Aware Users Users Users The Cloud ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 15.
    Agenda for TheData Challenge ► Overview of DataDirect Network ► What is Storage Fusion Processing™, it’s advantages & applications ► Overview of Analytics ► Introduction to Apache Hadoop ► An overview of DDN hScaler solution ► Conclusion ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 16.
    The tools ofthe Trade Ecosystem Hadoop 4 3 5 Core Apache Hadoop 2 6 1 Map Reduce 1 2 3 4 5 6 ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 17.
    Hadoop & HPCCompared Data Locality Inter-process Communication Job Input HPC 1 2 3 4 5 6 Slic Slic e1 en 4 3 5 Job Input 2 6 1 Hadoop Slic Slic e1 en 1 2 3 4 5 6 ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 18.
    Organizational Scalability Higher isBetter Adoption Goal for Human Costs Capacity 18 6/8/12 ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 19.
    Agenda for TheData Challenge ► Overview of DataDirect Network ► What is Storage Fusion Processing™, it’s advantages & applications ► Overview of Analytics ► Introduction to Apache Hadoop ► An overview of DDN hScaler solution ► Conclusion ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 20.
    Hadoop Cluster Lifecycle Deploy Upgrade Manage Respond Monitor Software Platform Hardware Platform ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 21.
    Infrastructure Chargeback • Visibility to Trends • Actionable Reporting • Limits & Enforcement Site Overview ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 22.
    Analytics Services Portfolio Architect Deploy Manage Customize • Data Transformation • hScaler Installation • Data Curation • Data Migration • Data & Analytics • hScaler Upgrade • hScaler Administration • DR&BC Strategy • Environment Integration • System Tuning • Application Integration • Security Strategy in • Performance Testing • Health Checks • Data Curation shared-data • Operational Validation • Application Development Environments • Factory Build • Data Cleansing • DR&BC • Data Curation • Solution Sizing • Data Center Preparation Support • Process Integration • Phone/Email • ETL planning • Phone Home Monitoring • Compliance Planning • Patches & Upgrades • Remote Diagnostics ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 23.
    Apache Hadoop Genomics ApplicationExamples ► Apache Hadoop™ MapReduce™ computing efficiency: • The algorithm-performance should scale with CPU count • The algorithm should be embarrassingly parallel • There should be no dependence on how the data is distributed • The data should be static ► Example genomics application that work well within Hadoop: • Crossbow. Whole genome re-sequencing & SNP genotyping (short reads) • Contrail. De novo assembly from short sequencing reads. • Myrna. Fast short-read & differential gene expression aligner (RNA-seq) • PeakRanger. Cloud-enabled peak caller for ChIP-seq data. • Quake. Quality-aware detection and sequencing error correction tool. • BlastReduce. High-performance short read mapping. • CloudBLAST. Hadoop implementation of NCBI’s Blast. • MrsRF. Algorithm for analyzing large evolutionary trees. 23 ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 24.
    CloudBLAST Application Example StreamInputFormat CloudBLAST is a Map-Reduce version of the commonly used S= {s1, s2, … sk} S= {s1, s2, … sk} S= {s1, s2, … sk} bioinformatics application NCBI BLAST CPU - N CPU - 0 CPU - 1 CPU - 2 CPU - 3 CPU - 4 CPU - 5 CPU -6 1. Stream Input Formatted data is split into “960 long chunks” base on new line. 2. Data “chunks” split into sequences as keys for the MapReduce 3. Blast output is written to local file Data Merger Based on work by Andréa Matsunaga, Maurício Tsugawa and José Fortes - University of Florida 24 ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 25.
    Agenda for TheData Challenge ► Overview of DataDirect Network ► What is Storage Fusion Processing™, it’s advantages & applications ► Overview of Analytics ► Introduction to Apache Hadoop ► An overview of DDN hScaler solution ► Conclusion ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 26.
    How DDN can Accelerate Your Analytics ► Lower Total Cost of Ownership and Improved OPEX: • Scale – Dynamically add capacity to match your complex workloads • Value – Grow storage capacity economically: Access, Solve, Archive • High Availability - Always running with world-class 24/7 service & support ► Drive Innovation: • Performance at Scale – A homogeneous platform that performs at scale • Eloquent - Leverage virtualization to deliver analytics platform to provide the quickest answers to your most complex questions • Collaboration – Centralize & share discoveries across the globe, securely ► Deliver Experience: • Fifteen Years of HPC – Government Labs, DoE, and Universities trust DDN • HPC community rely on DDN – 60% of the top 500 Supercomputer & growing • Single vendor solution - OEMs provide DDN with their datacenter solutions. ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 27.
    Thank you –Questions? DataDirect Networks, Information in Motion, Silicon Storage Appliance, S2A, Storage Fusion Architecture, SFA, Storage Fusion Fabric, Web Object Scaler, WOS, EXAScaler, GRIDScaler, xSTREAMScaler, NAS Scaler, ReAct, ObjectAssure, In-Storage Processing and SATAssure are all trademarks of DataDirect Networks. Any unauthorized use is prohibited. ©2012 DataDirect Networks. All Rights Reserved. ddn.com