Google Compute and MapR


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Google Compute and MapR

  1. 1. MapRs Hadoop Distribution onGoogle Compute Engine
  2. 2. Who am I? speaking/devfest-dc-9-28-12•  Keys Botzum••  Senior Principal Technologist, MapR Technologies•  MapR Federal and Eastern Region
  3. 3. MapR’s Experience with Google Compute Engine •  Fast –  Virtualized public cloud –  Rivals on-premise physical •  Easy –  Provision 1,000s of servers in minutes •  Cost effective –  Pay only for what you use
  4. 4. gcutil is your friend•  Command line tool that runs on your client machines to manage your instances in your cloud•  Remarkably easy to use –  New server/instance: gcutil addinstance –  Connect to a server/instance: gcutil ssh•  Can create your own custom images using Google’s tools –  Using custom images is as easy as addinstance –image <image name> –  MapR is creating custom images for MapR clusters
  5. 5. MapReduce: A Paradigm Shift•  Distributed computing platform •  Large clusters •  Commodity hardware•  Pioneered at Google •  BigTable, MapReduce and Google File System•  Commercially available as Hadoop
  6. 6. MapR Technologies•  Open, enterprise-grade distribution for Hadoop –  Easy, dependable and fast –  Open source with standards-based extensions•  Hadoop –  Big data analytics –  Inspired by MapReduce paper published by Google scientists Jeffrey Dean and Sanjay Ghemawat in 2004•  MapR is recognized as a technology leader•  MapR Hadoop Cloud Service now available on Google Compute Engine
  7. 7. MapR Partners
  8. 8. MapR’s Complete Distributionfor Apache Hadoop MapR Control System•  Integrated, tested, hardened and supported MapR Heatmap™ LDAP, NIS Integration Quotas, CLI, REST APT Alerts, Alarms•  Integrated with Accumulo Hive Pig Oozle Sqoop HBase Whirr•  Runs on commodity hardware•  Open source with Accumulo Mahout Cascading Naglos Ganglia Flume Zoo- Integration Integration keeper standards-based extensions for: •  Security •  File-based access Direct Snap- •  Most SQL-based Access Real- Volumes Mirrors Data Time shots Placemen access NFS Streamin t •  Easiest integration g No NameNode High Performance Stateful Failover•  High availability Architecture Direct Shuffle and Self Healing•  Best performance MapR’s Storage Services™ 2.7  
  9. 9. Overview of Starting a Cluster•  Google’s gcutil is your friend •  Very easy tool for spinning up instances•  MapR is creating a tool and infrastructure to spin up a fully functional MapR cluster composed of many nodes •  ./ –machine-type <…> -masters <#> -slaves <#> •  …wait a few minutes •  gcutil ssh <node running admin server> and set admin’s password •  gcutil listinstances (to find your cluster’s IP addresses) •  … use the cluster, it’s fully functional •  ./ •  …billing for cluster stops* Note that this is not the final interface, but rather is representative of what will be released. Some details omitted forclarity.
  10. 10. DemoLet’s run a large sortRun TeraSort on a 1250-node MapR Hadoop cluster onGoogle Compute Engine (10 billion records, 1TB of data)
  11. 11. How does this Compare to TerasortRecords? MapR on Record on physical Google Compute hardware EngineHardware Virtual/Cloud PhysicalCores 5024 11680Disks 1256 5840Servers 1256 1460Time 1:20 min 1:02 min
  12. 12. Deployment Comparison Current Record 1460 physical servers 1256 instances Prepare datacenter Invoke gcutil command Rack and stack servers Maintain hardware Months Minutes
  13. 13. Cost Comparison Current Record 1460 1U servers x 1256 n1-standard-4-d x $4K/server = $.58/instance hour x 80 seconds =$5,840,000 $16 ($728/hour)
  14. 14. Easy Management at Scale•  Health Monitoring•  Cluster Administration•  Application Resource Provisioning
  15. 15. Direct Access NFS™ File  Browsers   Standard  Linux   Commands  &  Tools   grep! Access  Directly     sed! “Drag  &  Drop”   sort! tar! Random  Read   Random  Write   Log  directly   Applica=ons  
  16. 16. Multi-tenancy§  Consider a large cluster with lots of storage and numerous jobs supporting multiple organizations§  Volumes §  Control storage usage §  quotas on volumes §  quotas on cluster storage by user or group §  Control data placement §  ensure that data is stored in the locations you want §  Control mirroring and snapshotting§  Job management §  Control where jobs run §  ensure that jobs run where you want §  Historical view of metrics collected from jobs §  ease troubleshooting of job issues§  Security/Protection §  Fine grained permissions on volume and cluster management, including delegation
  17. 17. MapR: Lights Out Data Center Ready DependableReliable Compute Storage •  Automated  stateful  failover   §  Business  con=nuity  with     snapshots    and  mirrors   •  Automated  re-­‐replica=on   §  Recover  to  a  point  in  =me   •  Self-­‐healing  from  HW     §  End-­‐to-­‐end  check   and  SW  failures   summing     •  Load  balancing   §  Strong  consistency   §  Built  in  compression   •  Rolling  upgrades   §  Mirror  across  sites  to   •  No  lost  jobs  or  data   meet   •  99999’s  of  up=me   Recovery  Time  Objec=ves
  18. 18. MapR Mirroring/COOP Requirements Business  Con=nuity     Production Research and  Efficiency   Efficient  design   WAN §  Differen=al  deltas  are  updated  Datacenter  1   Datacenter  1   §  Compressed  and     check-­‐summed   Easy  to  manage   Production WAN Cloud §  Scheduled  or  on-­‐demand   §  WAN,  Remote  Seeding   §  Consistent  point-­‐in-­‐=me   Compute Engine
  19. 19. MapR Drives Hardware Performance Typical Hadoop % Performance vs. Apache/CDH 450% 400% Commodity Hardware 350% 300% 250% % Perf vs. Apache/CDH 200% 150% 100% 50% 0% 400MBPS 1200MBPS 1800MBPS SSD <6 Drives 12*5400RPM Drives 12*7200RPM Drives 2*10GbE 1NIC >1NIC or 10GbE >1NIC or 10GbE 12+ Cores 6 Cores 8 Cores 12 Cores 64G DRAM 24G DRAM 32G DRAM 48G DRAM Why is MapR faster and more efficient? §  No  redundant  layers  (not  a  file  system   §  Na=ve  compression   over  a  file  system)   §  Op=mized  shuffle   §  C/C++  vs.  Java  (higher  performance  and   §  Advanced  cache  manager   no  garbage  collec=on  freezes)   §  Port  scaling  (mul=-­‐NIC  support)  and   §  Distributed  metadata   high-­‐speed  RPC  
  20. 20. Designed for Performance and Scale MapR Apache/CDH Terasort w/ 1x replication (no compression) Total (minutes) 24 min 34 sec 49 min 33 sec Map 9 min 54 sec 28 min 12 sec Shuffle 9 min 8 sec 27 min 0 sec Terasort w/ 3x replication (no compression) Total 47 min 4 sec 73 min 42 sec Map 11 min 2 sec 30 min 8 sec Shuffle 9 min 17 sec 28 min 40 sec DFSIO/local write Throughput/node 870 MB/s 240 MB/s YCSB (HBase benchmark, 50% read, 50% update) Throughput 33102 ops/sec 7904 ops/sec Latency (r/u) 2.9-4 ms/0.4 ms 7-30 ms/0-5 ms YCSB (HBase benchmark, 95% read, 5% update) Throughput 18K ops/sec 8500 ops/sec Latency (r/u) 5.5-5.7 ms/0.6 ms 12-30 ms/1 ms HW: 10 servers, 2 x 4 cores (2.4 GHz), 11 x 2TB, 32 GB
  21. 21. Customer Support•  24x7x365 “Follow-The-Sun” coverage •  Critical customer issues are worked on around the clock•  Dedicated team of Hadoop engineering experts•  Contacting MapR support •  Email: (automatically opens a case) •  Phone: 1.855.669.6277 •  Self Service options: § §  Web Portal: support
  22. 22. Two MapR Editions – M3 and M5§  Control  System   §  Control  System  §  NFS  Access   §  NFS  Access  §  Performance   §  Performance  §  Unlimited  Nodes   §  High  Availability  §  Free     §  Snapshots  &  Mirroring   §  24  X  7  Support  Also Available through: §  Annual  Subscrip=on   Compute Engine
  23. 23. Try MapR on Google Compute
  24. 24. Apache Drill Interactive Analysis of Large-Scale Datasets
  25. 25. Latency Matters•  Ad-hoc analysis with interactive tools•  Real-time dashboards•  Event/trend detection and analysis •  Network intrusion analysis on the fly •  Fraud •  Failure detection and analysis
  26. 26. Big Data Processing Batch processing Interactive analysis Stream processingQuery runtime Minutes to hours Milliseconds to Never-ending minutesData volume TBs to PBs GBs to PBs Continuous streamProgramming MapReduce Queries DAGmodelUsers Developers Analysts and Developers developersGoogle project MapReduce DremelOpen source Hadoop Storm and S4project MapReduce Introducing Apache Drill…
  27. 27. Innovations•  MapReduce •  Scalable IO and compute trumps efficiency with todays commodity hardware •  With large datasets, schemas and indexes are too limiting •  Flexibility is more important than efficiency •  An easy to use scalable, fault tolerant execution framework is key for large clusters•  Dremel •  Columnar storage provides significant performance benefits at scale •  Columnar storage with nesting preserves structure and can be very efficient •  Avoiding final record assembly as long as possible improves efficiency •  Optimizing for the query use case can avoid the full generality of MR and thus significantly reduce latency. No need to start JVMs, just push compact queries to running agents.•  Apache Drill •  Open source project based upon Dremel’s ideas •  More flexibility and openness
  28. 28. More Reading on Apache Drill•  MapR and Apache Drill ••  Apache Drill project page ••  Google’s Dremel ••  Google’s BigQuery ••  MIT’s C-Store – a columnar database ••  Microsoft’s Dryad •  Distributed execution engine ••  Google’s Protobufs •