Philly DB MapR Overview


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Philly DB MapR Overview

  1. 1. MapRs Hadoop Distribution
  2. 2. Who am I? speaking/pdb-10-16-12•  Keys Botzum••  Senior Principal Technologist, MapR Technologies•  MapR Federal and Eastern Region
  3. 3. Agenda •  What’s a Hadoop? •  What’s MapR? •  Enterprise Grade Hadoop •  Making Hadoop More Open
  4. 4. Hadoop in 15 minutes
  5. 5. How to Scale?Big Data has Big Problems•  Petabytes of data•  MTBF on 1000s of nodes is < 1 day•  Something is always broken•  There are limits to scaling Big Iron•  Sequential and random access just don’t scale
  6. 6. Example: Update 1% of 1TB•  Data consists of 1010 records, each 100 bytes•  Task: Update 1% of these records
  7. 7. Approach 1: Just Do It•  Each update involves read, modify and write •  t = 1 seek + 2 disk rotations = 20ms •  1% x 1010 x 20 ms = 2 mega-seconds = 23 days•  Total time dominated by seek and rotation times
  8. 8. Approach 2: The “Hard” Way•  Copy the entire database 1GB at a time•  Update records on the fly •  t = 2 x 1GB / 100MB/s + 20ms = 20s •  103 x 20s = 20,000s = 5.6 hours•  100x faster to do 100x more work!•  Moral: Read data sequentially even if you only want 1% of it
  9. 9. MapReduce: A Paradigm Shift•  Distributed computing platform •  Large clusters •  Commodity hardware•  Pioneered at Google •  BigTable, MapReduce and Google File System•  Commercially available as Hadoop
  10. 10. Hadoop•  Commodity hardware – thousands of nodes•  Handles Big Data – petabytes and more•  Sequential file access – each spindle provides data as fast as possible•  Sharding •  Data distributed evenly across cluster •  More spindles and CPUs working on different parts of data set•  Reliability – self-healing (mostly), self-balancing•  MapReduce •  Parallel computing framework •  Function shipping §  Moves the computation to the data rather than the typical reverse §  Takes into account sharding •  Hides most of complexity from developers
  11. 11. Inside Map-Reduce the,  1   "The  6me  has  come,"  the  Walrus  said,   6me,  1   "To  talk  of  many  things:   come,  [3,2,1]   has,  1   Of  shoes—and  ships—and  shas,  [1,5,2]   ealing-­‐wax   come,  1   come,  6   the,  [1,2,1]   has,  8   …   6me,  [10,1,3]   the,  4   …   6me,  14   Input   Map   Shuffle   Reduce   …   Output   and  sort  
  12. 12. Agenda •  What’s a Hadoop? •  What’s MapR? •  Enterprise Grade Hadoop •  Making Hadoop More Open
  13. 13. The MapR Distribution for Apache Hadoop•  Commercial Hadoop Distribution•  Open, enterprise-grade distribution •  Primarily leveraging open source components •  Carefully targeted enhancements to make Hadoop more open and enterprise-grade•  Growing fast and a recognized leader
  14. 14. MapR in the Cloud•  Available as a service with Amazon Elastic MapReduce (EMR) •  §  Available  as  a  service  with  Google  Compute  Engine    
  15. 15. MapR Partners
  16. 16. Agenda •  What’s a Hadoop? •  What’s MapR? •  Enterprise Grade Hadoop •  Making Hadoop More Open
  17. 17. MapR’s Complete Distributionfor Apache Hadoop MapR Control System•  Integrated, tested, hardened and supported MapR Heatmap™ LDAP, NIS Integration Quotas, CLI, REST APT Alerts, Alarms•  Integrated with Accumulo Hive Pig Oozle Sqoop HBase Whirr•  Runs on commodity hardware•  Open source with Accumulo Mahout Cascading Naglos Ganglia Flume Zoo- Integration Integration keeper standards-based extensions for: •  Security •  File-based access Direct Snap- •  Most SQL-based Access Real- Volumes Mirrors Data Time shots Placemen access NFS Streamin t •  Easiest integration g No NameNode High Performance Stateful Failover•  High availability Architecture Direct Shuffle and Self Healing•  Best performance MapR’s Storage Services™ 2.7  
  18. 18. Easy Management at Scale•  Health Monitoring•  Cluster Administration•  Application Resource Provisioning Same information and tasks available via command line and REST
  19. 19. MapR: Lights Out Data Center Ready DependableReliable Compute Storage •  Automated  stateful  failover   §  Business  con6nuity  with     snapshots    and  mirrors   •  Automated  re-­‐replica6on   §  Recover  to  a  point  in  6me   •  Self-­‐healing  from  HW     §  End-­‐to-­‐end  check   and  SW  failures   summing     •  Load  balancing   §  Strong  consistency   §  Built  in  compression   •  Rolling  upgrades   §  Mirror  across  sites  to   •  No  lost  jobs  or  data   meet   •  99999’s  of  up6me   Recovery  Time  Objec6ves
  20. 20. Storage Architecture§  How  does  MapR  manage  storage  and  how  is  this  different   from  generic  Hadoop?  
  21. 21. What  is  a  Volume?   §  Like  a  sub-­‐directory   §  related  dirs/files  together   §  Contains  file  metadata  for  this   volume   §  Mounted  to  form  global  name-­‐ space   §  Logical  unit  of  policy   Volumes  help  you  manage  data  ©MapR  Technologies  -­‐  Confiden6al   21  
  22. 22. Typical  Volume  Layout   /   /binaries   /hbase   /projects   /users   /var/mapr   /build   /test   /mjones   /jsmith   local...   Create  lots  of  volumes,  100K  volumes  OK!  ©MapR  Technologies  -­‐  Confiden6al   22  
  23. 23. Volumes  Let  You  Manage  Data   §  Replica6on  factor   §  Quotas   §  Load  balancing   §  Snapshots   §  Mirrors   §  Data  placement     §  Made  of  containers   §  Container  is  Sharding  unit   §  16  –  32G  ©MapR  Technologies  -­‐  Confiden6al   23  
  24. 24. Storage  Architecture   §  Nodes   §  Disks   §  Storage  Pools   §  Containers   –  Distributed  across  cluster   –  16-­‐32  GB     §  Volumes  ©MapR  Technologies  -­‐  Confiden6al   24  
  25. 25. No  NameNode  Architecture   Other  Hadoop  Distribu6ons   MapR   NAS   APPLIANCE   A B C D   E   F   A B C D   E   F   NameNode   NameNode   NameNode   NameNode   E   DataNode   DataNode   DataNode   A F   C D   E   D DataNode   DataNode   DataNode   A B B C E   B DataNode   DataNode   DataNode   A D   C F   B F   §  HA  requires  specialized  hardware  and/or   §  HA  w/  automa6c  failover  and  re-­‐replica6on   sonware   §  Up  to  1T  files  (>  5000x  advantage)   §  File  scalability  hampered  by  namenode   §  Higher  performance   booleneck   §  100%  commodity  hardware   §  Metadata  must  fit  in  memory   §  Metadata  is  persisted  to  disk  ©MapR  Technologies  -­‐  Confiden6al   25  
  26. 26. MapR  Snapshots   Hadoop   H  HBASE   Hadoop  / Hadoop  /  /  HBASE   BASE   NFS   NFS   NFS   APPLICATIONS   APPLICATIONS   APPLICAITONS   APPLICAITONS   APPLICATIONS   APPLICAITONS   §  Snapshots  without  data   READ  /  WRITE   duplica6on   MapR  Storage  Services   §  Saves  space  by  sharing   blocks  Data  Blocks   REDIRECT  ON  WRITE    FOR  SNAPSHOT   §  Lightning  fast   A   B   C   C’   D   §  Zero  performance  loss  on   wri6ng  to  original   §  Scheduled,  or  on-­‐demand   §  Easy  recovery  by  user   Snapshot  1   Snapshot  2   Snapshot  3   ©MapR  Technologies  -­‐  Confiden6al   26  
  27. 27. MapR Mirroring/COOP Requirements Business  Con6nuity     Production Research and  Efficiency   Efficient  design   WAN §  Differen6al  deltas  are  updated  Datacenter  1   Datacenter  1   §  Compressed  and     check-­‐summed   Easy  to  manage   Production WAN Cloud §  Scheduled  or  on-­‐demand   §  WAN,  Remote  Seeding   §  Consistent  point-­‐in-­‐6me   Compute Engine
  28. 28. Thought Questions•  Consider a cluster with •  Petabytes of data •  Hundred or thousands of jobs running each day, creating new data •  Many users and teams all using this cluster•  How do I back this up? •  User “oops” protection•  How do I replicate data from one cluster to another in support of disaster recovery? •  Protection from power outages, floods, fire, etc
  29. 29. Designed  for  Performance  and  Scale   MapR   Apache/CDH  Terasort  w/  1x  replica6on  (no  compression)  Total  (minutes)   24  min  34  sec   49  min  33  sec   §  1.4  PB  user  data  Map   9  min  54  sec   28  min  12  sec   §  900-­‐1200  MapReduce  jobs  per  day  Shuffle   9  min  8  sec   27  min  0  sec   §  16  TB/day  average  IO  through  each  server   §  85-­‐90%  storage  u6liza6on  (with  snapshots)  Terasort  w/  3x  replica6on  (no  compression)   §  Very  low-­‐end  hardware  (consumer  drives)  Total   47  min  4  sec   73  min  42  sec  Map   11  min  2  sec   30  min  8  sec  Shuffle   9  min  17  sec   28  min  40  sec   Large  Web  2.0  company  DFSIO/local  write   §  6B  files  on  a  single  cluster  (+  3x  replica6on)  Throughput/node   870  MB/s   240  MB/s   §  2000  servers  targeted  YCSB  (HBase  benchmark,  50%  read,  50%  update)   §  No  degrada6on  during  hardware  failures   §  Heavy  read/write/delete  workload  Throughput   33102  ops/sec   7904  ops/sec   §  1.7K  creates/sec/node  Latency  (r/u)   2.9-­‐4  ms/0.4  ms   7-­‐30  ms/0-­‐5  ms   Response  Eme  YCSB  (HBase  benchmark,  95%  read,  5%  update)   (write/read/delete)  Throughput   18K  ops/sec   8500  ops/sec   Atomic  workload   7.8/4.5/8.7  ms  Latency  (r/u)   5.5-­‐5.7  ms/0.6  ms   12-­‐30  ms/1  ms   Mixed  workload   6.6/4.9/9.1  ms  HW:  10  servers,  2  x  4  cores  (2.4  GHz),  11  x  2TB,  32  GB   ©MapR  Technologies  -­‐  Confiden6al   29  
  30. 30. Customer Support•  24x7x365 “Follow-The-Sun” coverage •  Critical customer issues are worked on around the clock•  Dedicated team of Hadoop engineering experts•  Contacting MapR support •  Email: (automatically opens a case) •  Phone: 1.855.669.6277 •  Self Service options: § §  Web Portal: support
  31. 31. Two MapR Editions – M3 and M5§  Control  System   §  Control  System  §  NFS  Access   §  NFS  Access  §  Performance   §  Performance  §  Unlimited  Nodes   §  High  Availability  §  Free     §  Snapshots  &  Mirroring   §  24  X  7  Support  Also Available through: §  Annual  Subscrip6on   Compute Engine
  32. 32. Agenda •  What’s a Hadoop? •  What’s MapR? •  Enterprise Grade Hadoop •  Making Hadoop More Open
  33. 33. Not  All  ApplicaEons  Use  the  Hadoop  APIs   Applica6ons  and   libraries  that  use  files   and/or  SQL   •  These  are  not  legacy   30  years   applica6ons,  they  are   100,000s  applica6ons   valuable  applica6ons   10,000s  libraries   10s  programming  languages     Applica6ons  and   libraries  that  use  the   Hadoop  APIs    ©MapR  Technologies   33  
  34. 34. Hadoop  Needs  Industry-­‐Standard  Interfaces   Hadoop   •  MapReduce  and  HBase  applica6ons   API   •  Mostly  custom-­‐built   •  File-­‐based  applica6ons   NFS   •  Supported  by  most  opera6ng  systems   •  SQL-­‐based  tools   ODBC   •  Supported  by  most  BI  applica6ons  and   query  builders  ©MapR  Technologies   34  
  35. 35. NFS  ©MapR  Technologies   35  
  36. 36. Your  Data  is  Important  §  HDFS-­‐based  Hadoop  distribu6ons  do  not  (cannot)   properly  support  NFS  §  Your  data  is  important,  it  drives  your  business  –  make   sure  you  can  access  it   –  Why  store  your  data  in  a  system  which  cannot  be  accessed   by  95%  of  the  world’s  applica6ons  and  libraries?  ©MapR  Technologies   36  
  37. 37. Direct  Access  NFS™   File  Browsers   Standard  Linux   Commands  &  Tools   grep! Access  Directly     sed! “Drag  &  Drop”   sort! tar! Random  Read   Random  Write   Log  directly   Applica6ons  ©MapR  Technologies   37  
  38. 38. The  NFS  Protocol  §  RFC  1813   WRITE3res  NFSPROC3_WRITE(WRITE3args)  =  7;     struct  WRITE3args  {          nfs_fh3          file;  §  Very  simple  protocol          offset3          offset;          count3            count;          stable_how    stable;  §  Random  reads/writes          opaque            data<>;   –  Read  count  bytes  from   };   offset  offset  of  file  file     READ3res  NFSPROC3_READ(READ3args)  =  6;   –  Write  buffer  data  to       offset  offset  of  a  file  file   struct  READ3args  {          nfs_fh3    file;          offset3    offset;  §  HDFS  does  not  support          count3      count;   random  writes  so  it   };   cannot  support  NFS    ©MapR  Technologies   38  
  39. 39. S3   o.a.h.fs.s3na6ve.Na6veS3FileSystem  ©MapR  Technologies   HDFS   o.a.h.hdfs.DistributedFileSystem   Local  File  System   Storage  Layers   o.a.h.fs.LocalFileSystem   MapReduce   FTP  39   MapR  storage  layer   o.a.h.fs.FileSystem  Interface   com.mapr.fs.MapRFileSystem   Hadoop   Hadoop  Was  Designed  to  Support  MulEple   NFS  interface   FileSystem  API  
  40. 40. One  NFS  Gateway   What  about  scalability  and  high  availability?  ©MapR  Technologies   40  
  41. 41. MulEple  NFS  Gateways  ©MapR  Technologies   41  
  42. 42. MulEple  NFS  Gateways  with  Load  Balancing  ©MapR  Technologies   42  
  43. 43. MulEple  NFS  Gateways  with  NFS  HA  (VIPs)  ©MapR  Technologies   43  
  44. 44. Customer Examples: Import/Export Data•  Network security vendor •  Network packet captures from switches are streamed into the cluster •  New pattern definitions are loaded into online IPS via NFS•  Online measurement company •  Clickstreams from application servers are streamed into the cluster•  SaaS company •  Exporting a database to Hadoop over NFS•  Ad exchange •  Bids and transactions are streamed into the cluster
  45. 45. Customer Examples: Productivity and Operations•  Retailer •  Operational scripts are easier with NFS than HDFS + MapReduce §  chmod/chown, file system searches/greps, perl, awk, tab-complete •  Consolidate object store with analytics•  Credit card company •  User and project home directories on Linux gateways §  Local files, scripts, source code, … §  Administrators manage quotas, snapshots/backups, …•  Large Internet company recommendation system •  Web server serve MapReduce results (item relationships) directly from cluster•  Email marketing company •  Object store with HBase and NFS
  46. 46. Apache Drill Interactive Analysis of Large-Scale Datasets
  47. 47. Latency Matters•  Ad-hoc analysis with interactive tools•  Real-time dashboards•  Event/trend detection and analysis •  Network intrusion analysis on the fly •  Fraud •  Failure detection and analysis
  48. 48. Big Data Processing Batch processing Interactive analysis Stream processingQuery runtime Minutes to hours Milliseconds to Never-ending minutesData volume TBs to PBs GBs to PBs Continuous streamProgramming MapReduce Queries DAGmodelUsers Developers Analysts and Developers developersGoogle project MapReduce DremelOpen source Hadoop Storm and S4project MapReduce Introducing Apache Drill…
  49. 49. Innovations•  MapReduce •  Scalable IO and compute trumps efficiency with todays commodity hardware •  With large datasets, schemas and indexes are too limiting •  Flexibility is more important than efficiency •  An easy to use scalable, fault tolerant execution framework is key for large clusters•  Dremel •  Columnar storage provides significant performance benefits at scale •  Columnar storage with nesting preserves structure and can be very efficient •  Avoiding final record assembly as long as possible improves efficiency •  Optimizing for the query use case can avoid the full generality of MR and thus significantly reduce latency. No need to start JVMs, just push compact queries to running agents.•  Apache Drill •  Open source project based upon Dremel’s ideas •  More flexibility and openness
  50. 50. More Reading on Apache Drill•  MapR and Apache Drill ••  Apache Drill project page ••  Google’s Dremel ••  Google’s BigQuery ••  MIT’s C-Store – a columnar database ••  Microsoft’s Dryad •  Distributed execution engine ••  Google’s Protobufs •