MapR's Hadoop Distribution on
Google Compute Engine
Who am I?

     http://www.mapr.com/company/events/
           speaking/devfest-dc-9-28-12
•     Keys Botzum
•     kbotzum@maprtech.com
•     Senior Principal Technologist, MapR Technologies
•     MapR Federal and Eastern Region
MapR’s Experience with Google Compute Engine


 •  Fast
    –    Virtualized public cloud
    –    Rivals on-premise physical

 •  Easy
     –  Provision 1,000s of servers in
        minutes

 •  Cost effective
     –  Pay only for what you use
gcutil is your friend

•    Command line tool that runs on your client machines to manage your
     instances in your cloud
•    Remarkably easy to use
      –  New server/instance: gcutil addinstance
      –  Connect to a server/instance: gcutil ssh
•    Can create your own custom images using Google’s tools
      –  Using custom images is as easy as addinstance –image <image name>
      –  MapR is creating custom images for MapR clusters
MapReduce: A Paradigm Shift
•    Distributed computing platform
      •  Large clusters
      •  Commodity hardware
•    Pioneered at Google
      •  BigTable, MapReduce and Google File System
•    Commercially available as Hadoop
MapR Technologies
•  Open, enterprise-grade distribution for
   Hadoop
   –    Easy, dependable and fast
   –    Open source with standards-based extensions

•  Hadoop
   –    Big data analytics
   –    Inspired by MapReduce paper published by Google
        scientists Jeffrey Dean and Sanjay Ghemawat in
        2004

•  MapR is recognized as a technology
   leader

•  MapR Hadoop Cloud Service now
   available on Google Compute Engine
MapR Partners
MapR’s Complete Distribution
for Apache Hadoop
                                                           MapR Control System
•    Integrated, tested,
     hardened and supported                 MapR
                                          Heatmap™
                                                          LDAP, NIS
                                                          Integration
                                                                           Quotas,           CLI,
                                                                                           REST APT
                                                                        Alerts, Alarms
•    Integrated with
     Accumulo
                                     Hive           Pig       Oozle        Sqoop         HBase        Whirr
•    Runs on commodity
     hardware
•    Open source with          Accumulo    Mahout     Cascading     Naglos       Ganglia         Flume        Zoo-
                                                                   Integration   Integration                 keeper
     standards-based
     extensions for:
      •  Security
      •  File-based access
                               Direct                                            Snap-
      •  Most SQL-based        Access
                                            Real-     Volumes      Mirrors                       Data
                                            Time                                 shots         Placemen
         access                 NFS       Streamin                                                 t
      •  Easiest integration                  g
                                     No NameNode             High Performance            Stateful Failover
•    High availability                Architecture             Direct Shuffle            and Self Healing

•    Best performance
                                                      MapR’s Storage Services™
                                                                2.7	
  
Overview of Starting a Cluster
•    Google’s gcutil is your friend
      •  Very easy tool for spinning up instances
•    MapR is creating a tool and infrastructure to spin up a fully functional MapR
     cluster composed of many nodes
      •  ./mapr-start-cluster.sh –machine-type <…> -masters <#> -slaves <#>
      •  …wait a few minutes
      •  gcutil ssh <node running admin server> and set admin’s password
      •  gcutil listinstances (to find your cluster’s IP addresses)
      •  … use the cluster, it’s fully functional
      •  ./mapr-stop-cluster.sh
      •  …billing for cluster stops



* Note that this is not the final interface, but rather is representative of what will be released. Some details omitted for
clarity.
Demo


Let’s run a large sort
Run TeraSort on a 1250-node MapR Hadoop cluster on
Google Compute Engine

      (10 billion records, 1TB of data)
How does this Compare to Terasort
Records?



               MapR on        Record on physical
            Google Compute        hardware
                Engine
Hardware      Virtual/Cloud        Physical
Cores             5024              11680
Disks             1256               5840
Servers           1256               1460
Time            1:20 min           1:02 min
Deployment Comparison


 Current Record

 1460 physical servers        1256 instances
   Prepare datacenter     Invoke gcutil command
 Rack and stack servers
   Maintain hardware



   Months                  Minutes
Cost Comparison


 Current Record

   1460 1U servers x   1256 n1-standard-4-d x
     $4K/server =       $.58/instance hour x
                           80 seconds =




$5,840,000                   $16
                             ($728/hour)
Easy Management at Scale



•  Health
   Monitoring
•  Cluster
   Administration
•  Application
   Resource
   Provisioning
Direct Access NFS™
  File	
  Browsers	
                                     Standard	
  Linux	
  
                                                       Commands	
  &	
  Tools	
  
                                                              grep!
                          Access	
  Directly	
  	
            sed!
                          “Drag	
  &	
  Drop”	
               sort!
                                                              tar!




                         Random	
  Read	
  
                         Random	
  Write	
  


                           Log	
  directly	
  
   Applica=ons	
  
Multi-tenancy
§  Consider a large cluster with lots of storage and
    numerous jobs supporting multiple
    organizations
§  Volumes
     §  Control storage usage
           §  quotas on volumes
           §  quotas on cluster storage by user or
               group
     §  Control data placement
           §  ensure that data is stored in the
               locations you want
     §  Control mirroring and snapshotting
§  Job management
     §  Control where jobs run
           §  ensure that jobs run where you want
     §  Historical view of metrics collected from
         jobs
           §  ease troubleshooting of job issues
§  Security/Protection
     §  Fine grained permissions on volume and
         cluster management, including delegation
MapR: Lights Out Data Center Ready


                                                  Dependable
Reliable Compute
                                                   Storage


 •  Automated	
  stateful	
  failover	
     §  Business	
  con=nuity	
  with	
  	
  
                                                snapshots	
  	
  and	
  mirrors	
  
 •  Automated	
  re-­‐replica=on	
          §  Recover	
  to	
  a	
  point	
  in	
  =me	
  
 •  Self-­‐healing	
  from	
  HW	
  	
      §  End-­‐to-­‐end	
  check	
  
    and	
  SW	
  failures	
                     summing	
  	
  
 •  Load	
  balancing	
                     §  Strong	
  consistency	
  
                                            §  Built	
  in	
  compression	
  
 •  Rolling	
  upgrades	
  
                                            §  Mirror	
  across	
  sites	
  to	
  
 •  No	
  lost	
  jobs	
  or	
  data	
          meet	
  
 •  99999’s	
  of	
  up=me	
                    Recovery	
  Time	
  Objec=ves
MapR Mirroring/COOP Requirements

                                                      Business	
  Con=nuity	
  	
  
  Production                 Research                 and	
  Efficiency	
  

                                                      Efficient	
  design	
  
                      WAN                             §    Differen=al	
  deltas	
  are	
  updated	
  
Datacenter	
  1	
           Datacenter	
  1	
  
                                                      §    Compressed	
  and	
  	
  
                                                            check-­‐summed	
  


                                                      Easy	
  to	
  manage	
  
  Production
                      WAN
                             Cloud                    §    Scheduled	
  or	
  on-­‐demand	
  
                                                      §    WAN,	
  Remote	
  Seeding	
  
                                                      §    Consistent	
  point-­‐in-­‐=me	
  

                                           Compute Engine
MapR Drives Hardware Performance
                                                                                       Typical Hadoop
    % Performance vs. Apache/CDH
                                            450%

                                            400%
                                                                                  Commodity Hardware
                                            350%

                                            300%

                                            250%                                                                                                                  % Perf vs.
                                                                                                                                                                  Apache/CDH
                                            200%

                                            150%

                                            100%

                                              50%

                                               0%
                                                            400MBPS                 1200MBPS              1800MBPS                      SSD
                                                            <6 Drives           12*5400RPM Drives     12*7200RPM Drives               2*10GbE
                                                              1NIC               >1NIC or 10GbE        >1NIC or 10GbE                12+ Cores
                                                             6 Cores                 8 Cores               12 Cores                  64G DRAM
                                                           24G DRAM                 32G DRAM              48G DRAM




                 Why is MapR faster and more efficient?
              §                   No	
  redundant	
  layers	
  (not	
  a	
  file	
  system	
        §    Na=ve	
  compression	
  
                                   over	
  a	
  file	
  system)	
                                    §    Op=mized	
  shuffle	
  
              §                   C/C++	
  vs.	
  Java	
  (higher	
  performance	
  and	
          §    Advanced	
  cache	
  manager	
  
                                   no	
  garbage	
  collec=on	
  freezes)	
                         §    Port	
  scaling	
  (mul=-­‐NIC	
  support)	
  and	
  
              §                   Distributed	
  metadata	
                                              high-­‐speed	
  RPC	
  
Designed for Performance and Scale
                           MapR                    Apache/CDH
     Terasort w/ 1x replication (no compression)
     Total (minutes)       24 min 34 sec           49 min 33 sec
     Map                   9 min 54 sec            28 min 12 sec
     Shuffle               9 min 8 sec             27 min 0 sec
     Terasort w/ 3x replication (no compression)
     Total                 47 min 4 sec            73 min 42 sec
     Map                   11 min 2 sec            30 min 8 sec
     Shuffle               9 min 17 sec            28 min 40 sec
     DFSIO/local write
     Throughput/node       870 MB/s                240 MB/s
     YCSB (HBase benchmark, 50% read, 50% update)
     Throughput            33102 ops/sec           7904 ops/sec
     Latency (r/u)         2.9-4 ms/0.4 ms         7-30 ms/0-5 ms
     YCSB (HBase benchmark, 95% read, 5% update)
     Throughput            18K ops/sec             8500 ops/sec
     Latency (r/u)         5.5-5.7 ms/0.6 ms       12-30 ms/1 ms

     HW: 10 servers, 2 x 4 cores (2.4 GHz), 11 x 2TB, 32 GB
Customer Support

•    24x7x365 “Follow-The-Sun” coverage
      •  Critical customer issues are worked on
         around the clock
•    Dedicated team of Hadoop engineering
     experts
•    Contacting MapR support
      •  Email: support@mapr.com
         (automatically opens a case)
      •  Phone: 1.855.669.6277
      •  Self Service options:
           §  http://answers.mapr.com/
           §  Web Portal: http://mapr.com/
               support
Two MapR Editions – M3 and M5


§    Control	
  System	
                       §    Control	
  System	
  
§    NFS	
  Access	
                           §    NFS	
  Access	
  
§    Performance	
                             §    Performance	
  
§    Unlimited	
  Nodes	
                      §    High	
  Availability	
  
§    Free	
  	
                                §    Snapshots	
  &	
  Mirroring	
  
                                                §    24	
  X	
  7	
  Support	
  
Also Available through:
                                                §    Annual	
  Subscrip=on	
  




                               Compute Engine
Try MapR on Google
   Compute Engine
www.mapr.com/google
Apache Drill
 Interactive Analysis of Large-Scale Datasets
Latency Matters

•    Ad-hoc analysis with interactive tools

•    Real-time dashboards

•    Event/trend detection and analysis
      •  Network intrusion analysis on the fly
      •  Fraud
      •  Failure detection and analysis
Big Data Processing

                 Batch processing   Interactive analysis   Stream processing
Query runtime    Minutes to hours   Milliseconds to        Never-ending
                                    minutes
Data volume      TBs to PBs         GBs to PBs             Continuous stream
Programming      MapReduce          Queries                DAG
model
Users            Developers         Analysts and           Developers
                                    developers
Google project   MapReduce          Dremel
Open source      Hadoop                                    Storm and S4
project          MapReduce




          Introducing Apache Drill…
Innovations
•  MapReduce
    •    Scalable IO and compute trumps efficiency with today's commodity hardware
    •    With large datasets, schemas and indexes are too limiting
    •    Flexibility is more important than efficiency
    •    An easy to use scalable, fault tolerant execution framework is key for large
         clusters
•  Dremel
    •    Columnar storage provides significant performance benefits at scale
    •    Columnar storage with nesting preserves structure and can be very efficient
    •    Avoiding final record assembly as long as possible improves efficiency
    •    Optimizing for the query use case can avoid the full generality of MR and thus
         significantly reduce latency. No need to start JVMs, just push compact queries to
         running agents.
•  Apache Drill
    •  Open source project based upon Dremel’s ideas
    •  More flexibility and openness
More Reading on Apache Drill
•    MapR and Apache Drill
      •  http://www.mapr.com/drill
•    Apache Drill project page
      •  http://incubator.apache.org/projects/drill.html
•    Google’s Dremel
      •  http://research.google.com/pubs/pub36632.html
•    Google’s BigQuery
      •  https://developers.google.com/bigquery/docs/query-reference
•    MIT’s C-Store – a columnar database
      •  http://db.csail.mit.edu/projects/cstore/
•    Microsoft’s Dryad
      •  Distributed execution engine
      •  http://research.microsoft.com/en-us/projects/dryad/
•    Google’s Protobufs
      •  https://developers.google.com/protocol-buffers/docs/proto

Google Compute and MapR

  • 1.
    MapR's Hadoop Distributionon Google Compute Engine
  • 2.
    Who am I? http://www.mapr.com/company/events/ speaking/devfest-dc-9-28-12 •  Keys Botzum •  kbotzum@maprtech.com •  Senior Principal Technologist, MapR Technologies •  MapR Federal and Eastern Region
  • 3.
    MapR’s Experience withGoogle Compute Engine •  Fast –  Virtualized public cloud –  Rivals on-premise physical •  Easy –  Provision 1,000s of servers in minutes •  Cost effective –  Pay only for what you use
  • 4.
    gcutil is yourfriend •  Command line tool that runs on your client machines to manage your instances in your cloud •  Remarkably easy to use –  New server/instance: gcutil addinstance –  Connect to a server/instance: gcutil ssh •  Can create your own custom images using Google’s tools –  Using custom images is as easy as addinstance –image <image name> –  MapR is creating custom images for MapR clusters
  • 5.
    MapReduce: A ParadigmShift •  Distributed computing platform •  Large clusters •  Commodity hardware •  Pioneered at Google •  BigTable, MapReduce and Google File System •  Commercially available as Hadoop
  • 6.
    MapR Technologies •  Open,enterprise-grade distribution for Hadoop –  Easy, dependable and fast –  Open source with standards-based extensions •  Hadoop –  Big data analytics –  Inspired by MapReduce paper published by Google scientists Jeffrey Dean and Sanjay Ghemawat in 2004 •  MapR is recognized as a technology leader •  MapR Hadoop Cloud Service now available on Google Compute Engine
  • 7.
  • 8.
    MapR’s Complete Distribution forApache Hadoop MapR Control System •  Integrated, tested, hardened and supported MapR Heatmap™ LDAP, NIS Integration Quotas, CLI, REST APT Alerts, Alarms •  Integrated with Accumulo Hive Pig Oozle Sqoop HBase Whirr •  Runs on commodity hardware •  Open source with Accumulo Mahout Cascading Naglos Ganglia Flume Zoo- Integration Integration keeper standards-based extensions for: •  Security •  File-based access Direct Snap- •  Most SQL-based Access Real- Volumes Mirrors Data Time shots Placemen access NFS Streamin t •  Easiest integration g No NameNode High Performance Stateful Failover •  High availability Architecture Direct Shuffle and Self Healing •  Best performance MapR’s Storage Services™ 2.7  
  • 9.
    Overview of Startinga Cluster •  Google’s gcutil is your friend •  Very easy tool for spinning up instances •  MapR is creating a tool and infrastructure to spin up a fully functional MapR cluster composed of many nodes •  ./mapr-start-cluster.sh –machine-type <…> -masters <#> -slaves <#> •  …wait a few minutes •  gcutil ssh <node running admin server> and set admin’s password •  gcutil listinstances (to find your cluster’s IP addresses) •  … use the cluster, it’s fully functional •  ./mapr-stop-cluster.sh •  …billing for cluster stops * Note that this is not the final interface, but rather is representative of what will be released. Some details omitted for clarity.
  • 10.
    Demo Let’s run alarge sort Run TeraSort on a 1250-node MapR Hadoop cluster on Google Compute Engine (10 billion records, 1TB of data)
  • 11.
    How does thisCompare to Terasort Records? MapR on Record on physical Google Compute hardware Engine Hardware Virtual/Cloud Physical Cores 5024 11680 Disks 1256 5840 Servers 1256 1460 Time 1:20 min 1:02 min
  • 12.
    Deployment Comparison CurrentRecord 1460 physical servers 1256 instances Prepare datacenter Invoke gcutil command Rack and stack servers Maintain hardware Months Minutes
  • 13.
    Cost Comparison CurrentRecord 1460 1U servers x 1256 n1-standard-4-d x $4K/server = $.58/instance hour x 80 seconds = $5,840,000 $16 ($728/hour)
  • 14.
    Easy Management atScale •  Health Monitoring •  Cluster Administration •  Application Resource Provisioning
  • 15.
    Direct Access NFS™ File  Browsers   Standard  Linux   Commands  &  Tools   grep! Access  Directly     sed! “Drag  &  Drop”   sort! tar! Random  Read   Random  Write   Log  directly   Applica=ons  
  • 16.
    Multi-tenancy §  Consider alarge cluster with lots of storage and numerous jobs supporting multiple organizations §  Volumes §  Control storage usage §  quotas on volumes §  quotas on cluster storage by user or group §  Control data placement §  ensure that data is stored in the locations you want §  Control mirroring and snapshotting §  Job management §  Control where jobs run §  ensure that jobs run where you want §  Historical view of metrics collected from jobs §  ease troubleshooting of job issues §  Security/Protection §  Fine grained permissions on volume and cluster management, including delegation
  • 17.
    MapR: Lights OutData Center Ready Dependable Reliable Compute Storage •  Automated  stateful  failover   §  Business  con=nuity  with     snapshots    and  mirrors   •  Automated  re-­‐replica=on   §  Recover  to  a  point  in  =me   •  Self-­‐healing  from  HW     §  End-­‐to-­‐end  check   and  SW  failures   summing     •  Load  balancing   §  Strong  consistency   §  Built  in  compression   •  Rolling  upgrades   §  Mirror  across  sites  to   •  No  lost  jobs  or  data   meet   •  99999’s  of  up=me   Recovery  Time  Objec=ves
  • 18.
    MapR Mirroring/COOP Requirements Business  Con=nuity     Production Research and  Efficiency   Efficient  design   WAN §  Differen=al  deltas  are  updated   Datacenter  1   Datacenter  1   §  Compressed  and     check-­‐summed   Easy  to  manage   Production WAN Cloud §  Scheduled  or  on-­‐demand   §  WAN,  Remote  Seeding   §  Consistent  point-­‐in-­‐=me   Compute Engine
  • 19.
    MapR Drives HardwarePerformance Typical Hadoop % Performance vs. Apache/CDH 450% 400% Commodity Hardware 350% 300% 250% % Perf vs. Apache/CDH 200% 150% 100% 50% 0% 400MBPS 1200MBPS 1800MBPS SSD <6 Drives 12*5400RPM Drives 12*7200RPM Drives 2*10GbE 1NIC >1NIC or 10GbE >1NIC or 10GbE 12+ Cores 6 Cores 8 Cores 12 Cores 64G DRAM 24G DRAM 32G DRAM 48G DRAM Why is MapR faster and more efficient? §  No  redundant  layers  (not  a  file  system   §  Na=ve  compression   over  a  file  system)   §  Op=mized  shuffle   §  C/C++  vs.  Java  (higher  performance  and   §  Advanced  cache  manager   no  garbage  collec=on  freezes)   §  Port  scaling  (mul=-­‐NIC  support)  and   §  Distributed  metadata   high-­‐speed  RPC  
  • 20.
    Designed for Performanceand Scale MapR Apache/CDH Terasort w/ 1x replication (no compression) Total (minutes) 24 min 34 sec 49 min 33 sec Map 9 min 54 sec 28 min 12 sec Shuffle 9 min 8 sec 27 min 0 sec Terasort w/ 3x replication (no compression) Total 47 min 4 sec 73 min 42 sec Map 11 min 2 sec 30 min 8 sec Shuffle 9 min 17 sec 28 min 40 sec DFSIO/local write Throughput/node 870 MB/s 240 MB/s YCSB (HBase benchmark, 50% read, 50% update) Throughput 33102 ops/sec 7904 ops/sec Latency (r/u) 2.9-4 ms/0.4 ms 7-30 ms/0-5 ms YCSB (HBase benchmark, 95% read, 5% update) Throughput 18K ops/sec 8500 ops/sec Latency (r/u) 5.5-5.7 ms/0.6 ms 12-30 ms/1 ms HW: 10 servers, 2 x 4 cores (2.4 GHz), 11 x 2TB, 32 GB
  • 21.
    Customer Support •  24x7x365 “Follow-The-Sun” coverage •  Critical customer issues are worked on around the clock •  Dedicated team of Hadoop engineering experts •  Contacting MapR support •  Email: support@mapr.com (automatically opens a case) •  Phone: 1.855.669.6277 •  Self Service options: §  http://answers.mapr.com/ §  Web Portal: http://mapr.com/ support
  • 22.
    Two MapR Editions– M3 and M5 §  Control  System   §  Control  System   §  NFS  Access   §  NFS  Access   §  Performance   §  Performance   §  Unlimited  Nodes   §  High  Availability   §  Free     §  Snapshots  &  Mirroring   §  24  X  7  Support   Also Available through: §  Annual  Subscrip=on   Compute Engine
  • 23.
    Try MapR onGoogle Compute Engine www.mapr.com/google
  • 24.
    Apache Drill InteractiveAnalysis of Large-Scale Datasets
  • 25.
    Latency Matters •  Ad-hoc analysis with interactive tools •  Real-time dashboards •  Event/trend detection and analysis •  Network intrusion analysis on the fly •  Fraud •  Failure detection and analysis
  • 26.
    Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model Users Developers Analysts and Developers developers Google project MapReduce Dremel Open source Hadoop Storm and S4 project MapReduce Introducing Apache Drill…
  • 27.
    Innovations •  MapReduce •  Scalable IO and compute trumps efficiency with today's commodity hardware •  With large datasets, schemas and indexes are too limiting •  Flexibility is more important than efficiency •  An easy to use scalable, fault tolerant execution framework is key for large clusters •  Dremel •  Columnar storage provides significant performance benefits at scale •  Columnar storage with nesting preserves structure and can be very efficient •  Avoiding final record assembly as long as possible improves efficiency •  Optimizing for the query use case can avoid the full generality of MR and thus significantly reduce latency. No need to start JVMs, just push compact queries to running agents. •  Apache Drill •  Open source project based upon Dremel’s ideas •  More flexibility and openness
  • 28.
    More Reading onApache Drill •  MapR and Apache Drill •  http://www.mapr.com/drill •  Apache Drill project page •  http://incubator.apache.org/projects/drill.html •  Google’s Dremel •  http://research.google.com/pubs/pub36632.html •  Google’s BigQuery •  https://developers.google.com/bigquery/docs/query-reference •  MIT’s C-Store – a columnar database •  http://db.csail.mit.edu/projects/cstore/ •  Microsoft’s Dryad •  Distributed execution engine •  http://research.microsoft.com/en-us/projects/dryad/ •  Google’s Protobufs •  https://developers.google.com/protocol-buffers/docs/proto