Philly DB MapR M7 - March 2013


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Philly DB MapR M7 - March 2013

  1. 1. Hbase  and  M7  Technical   Overview   Keys  Botzum   Senior  Principal  Technologist   MapR  Technologies   March  2013  ©MapR  Technologies     1  
  2. 2. Agenda   HBase   MapR   M7   Containers    ©MapR  Technologies     2  
  3. 3.   HBase   A  sparse,  distributed,  persistent,  indexed,  and   sorted  map   OR   A  NoSQL  database   OR   A  Columnar  data  store    ©MapR  Technologies     3  
  4. 4. Key-­‐Value  Store  §  Row  key   –  Binary  sortable  value  §  Row  content  key  (analogous  to  a  column)   –  Column  family  (string)   –  Column  qualifier  (binary)   –  Version/Omestamp  (number)  §  A  row  key,  column  family,  column  qualifier,  and  version  uniquely   idenOfies  a  parOcular  cell   –  A  cell  contains  a  single  binary  value  ©MapR  Technologies     4  
  5. 5. A  Row      C0      C1      C2      C3      C4        CN  Row  Key   Value1   Value2   Value3   Value4   ValueN   …   Column   Column   Row  Key   Version   Value1   Family   Qualifier   Column   Column   Row  Key   Version   Value2   Family   Qualifier   … Column   Column   Row  Key   Version   ValueN   Family   Qualifier  ©MapR  Technologies     5  
  6. 6. Not  A  TradiAonal  RDBMS  §  Weakly  typed  and  schema-­‐less  (unstructured  or  perhaps  semi-­‐ structured)   –  Almost  everything  is  binary  §  No  constraints   –  You  can  put  any  binary  value  in  any  cell   –  You  can  even  put  incompaOble  types  in  two  different  instances  of  the  same   column  family:column  qualifier  §  Column  (qualifiers)  are  created  implicitly  §  Different  rows  can  have  different  columns  §  No  transacOons/no  ACID   –  Only  unit  of  atomic  operaOon  is  a  single  row  ©MapR  Technologies     6  
  7. 7. API  §  APIs  for  querying  (get),  scanning,  and  updaOng  (put)   –  Operate  on  row  key,  column  family,  qualifier,  version,  and  values   –  Can  parOally  specify  and  will  retrieve  union  of  results   •  if  just  specify  row  key,  will  get  all  values  for  it  (with  column  family,  qualifier)   –  By  default  only  largest  version  (most  recent  if  Omestamp)    is  returned   •  Specify  row  key  and  column  family  to  get  will  retrieve  all  values  for  that  row  and   column  family   –  Scanning  is  just  get  over  a  range  of  row  keys  §  Version   –  While  defaults  to  a  Omestamp,  any  integer  is  acceptable  ©MapR  Technologies     7  
  8. 8. Columnar  §  Rather  than  storing  table  rows  linearly  on  disk  and  each  row  on   disk  as  a  single  byte  range  with  fixed  size  fields,  store  columns  of   row  separately   –  Very  efficient  storage  for  sparse  data  sets  (NULL  is  free)   –  Compression  works  befer  on  similar  data     –  Fetches  of  only  subsets  of  row  very  efficient  (less  disk  IO)   –  No  fixed  size  on  column  values   –  No  requirement  to  even  define  columns  §  Columns  are  grouped  together  into  column  families   –  Basically  a  file  on  disk   –  A  unit  of  opOmizaOon   –  In  Hbase,  adding  column  is  implicit,  adding  column  family  is  explicit  ©MapR  Technologies     8  
  9. 9. HBase  Table  Architecture   §  Tables  are  divided  into  key  ranges  (regions)   §  Regions  are  served  by  nodes  (RegionServers)   §  Columns  are  divided  into  access  groups  (columns  families)   CF1   CF2   CF3   CF4   CF5  R1  R2  R3  R4  ©MapR  Technologies     9  
  10. 10. Storage  Model  Highlights  §  Data  is  stored  in  sorted  order   –  A  table  contains  rows   –  A  sequence  of  rows  are  grouped  together  into  a  region   •  A  region  consists  of  various  files  related  to  those  rows  and  is  loaded  into  a  region   server   •  Regions  are  stored  in  HDFS  for  high  availability   –  A  single  region  server  manages  mulOple  regions   •  Region  assignment  can  change  –  load  balancing,  failures,  etc.  §  Clients  connect  to  tables   –  HBase  runOme  transparently  determines  the  region  (based  on  key  ranges)   and  contacts  the  appropriate  region  server  §  At  any  given  Ome  exactly  one  region  server  provides  access  to  a   region   –  Master  region  servers  (with  Zookeeper)  manage  that  ©MapR  Technologies     10  
  11. 11. What’s  Great  About  This?  §  Very  scalable  §  Easy  to  add  region  servers  §  Easy  to  move  regions  around  §  Scans  are  efficient   –  Unlike  hashing  based  models  §  Access  via  row  key  is  very  efficient   –  Note:  there  are  no  secondary  indexes  §  No  schema,  can  store  whatever  you  want  when  you  want  §  Strong  consistency  §  Integrated  with  Hadoop   –  Map-­‐reduce  on  HBase  is  straighlorward   –  HDFS/MapR-­‐FS  provides  data  replicaOon  ©MapR  Technologies     11  
  12. 12. Data  Storage  Architecture  §  Data  from  a  region  column  family  is  stored  in  an  HFile   –  An  HFile  contains  row  key:column  qualifier:version:value   entries   –  Index  at  the  end  into  the  data  –  64KB  “blocks”  by  default  §  Update   –  New  value  is  wrifen  persistently  to  Write  Ahead  Log  (WAL)   –  Cached  in  memory   –  When  memory  fills,  write  out  new  HFile  §  Read   –  Checks  in  memory,  then  all  of  the  Hfiles   –  Read  data  cached  in  memory  §  Delete   –  Create  a  tombstone  record  (purged  at  major  compacOon)    ©MapR  Technologies     12  
  13. 13. Apache  HBase  HFile  Structure   Each  cell  is  an  individual  key  +  value      -­‐  a  row  repeats  the  key  for  each  column   64Kbyte  blocks   Key-­‐value   are  compressed   pairs  are     laid  out  in   increasing   order   An  index  into  the   compressed  blocks  is   created  as  a  btree  ©MapR  Technologies     13  
  14. 14. HBase  Region  OperaAon   §  Typical  region  size  is  a  few  GB,  someOmes  even  10G  or  20G   §  RS    holds  data  in  memory  unOl  full,  then  writes  a  new  HFile   –  Logical  view  of  database  constructed  by  layering  these  files,  with  the   latest  on  top     newest   oldest   Key  range  represented  by  this  region  ©MapR  Technologies     14  
  15. 15. HBase  Read  AmplificaAon   §  When  a  get/scan  comes  in,  all  the  files  have  to  be  examined   –  schema-­‐less,  so  where  is  the  column?   –  Done  in-­‐memory  and  does  not  change  whats  on  disk   •  Bloom-­‐filters  do  not  help  in  scans   newest   oldest   With  7  files,  a  1K-­‐record  get()  potenOally  takes  about  30  seeks,     7  block  fetches  and  decompressions,  from  HDFS.  Even  with  the  index  in  memory   7  seeks  and  7  block  fetches  are  required.  ©MapR  Technologies     15  
  16. 16. HBase  Write  AmplificaAon   §  To  reduce  the  read-­‐amplificaOon,  HBase  merges  the  HFiles   periodically   –  process  called  compacOon   –  runs  automaOcally  when  too  many  files   –  usually  turned  off  due  to  I/O  storms  which  interfere  with  client   access   –  and  kicked-­‐off  manually  on  weekends   Major  compacOon  reads  all  files  and   merges    into  a  single  HFile  ©MapR  Technologies     16  
  17. 17.  HBase  Server  Architecture   Zookeeper   HDFS  Server   Coordinates   Lookup   Hbase  Master   Linux   Client   Filesystem   Data   Hbase  Region   Server   HFiles   WAL  ©MapR  Technologies     17  
  18. 18. WAL  File  §  A  persistent  record  of  every  update/insert  in  sequence  order   –  Shared  by  all  regions  on  one  region  server   –  WAL  files  periodically  rolled  to  limit  size  but  older  WALs  sOll  needed   –  WAL  file  no  longer  needed  once  every  region  with  updates  in  WAL  file  has   flushed  those  from  memory  to  an  HFile   •  Remember  that  more  HFiles  slow  read  path!  §  Must  be  replayed  as  part  of  recovery  process  since  in  memory   updates  are  “lost”   –  This  is  very  expensive  and  delays  bringing  a  region  back  online  ©MapR  Technologies     18  
  19. 19. What’s  Not  So  Good   Reliability   •  Complex  coordinaOon  between  ZK,  HDFS,  HBase   Master,  and  Region  Server  during  region  movement   •  CompacOons  disrupt  operaOons   •  Very  slow  crash  recovery  because  of   •  CoordinaOon  complexity   •  WAL  log  reading  (one  log/server)   Business  conAnuity   •  Many  administraOve  acOons  require  downOme   •  Not  well  integrated  into  MapR-­‐FS  mirroring  and   snapshot  funcOonality  ©MapR  Technologies     19  
  20. 20. What’s  Not  So  Good   Performance   •  Very  long  read/write  path   •  Significant  read  and  write  amplificaOon   •  MulOple  JVMs  in  read/write  path  –  GC  delays!   Manageability   •  CompacOons,  splits  and  merges  must  be  done   manually  (in  reality)   •  Lots  of  “well  known”  problems  maintaining  reliable   cluster  –  spliwng,  compacOons,  region  assignment,  etc.   •  PracOcal  limits  on  number  of  regions/region  server  and   size  of  regions  –  can  make  it  hard  to  fully  uOlize   hardware  ©MapR  Technologies     20  
  21. 21. Region  Assignment  in  Apache  HBase  ©MapR  Technologies     21  
  22. 22. Apache  HBase  on  MapR   Limited  data  management,  data  protecOon  and  disaster  recovery  for  tables.    ©MapR  Technologies     22  
  23. 23. Agenda   HBase   MapR   M7   Containers    ©MapR  Technologies     23  
  24. 24. MapR   A  provider  of  enterprise  grade  Hadoop  with   uniquely  differenOated  features    ©MapR  Technologies     24  
  25. 25. MapR:  The  Enterprise  Grade  DistribuAon  ©MapR  Technologies     25  
  26. 26. One  PlaVorm  for  Big  Data   Broad     RecommendaOon  Engines   Fraud  DetecOon   Billing   LogisOcs   range  of   applicaOons   Risk  Modeling   Market  SegmentaOon   Inventory  ForecasOng   …   Batch   InteracOve   Real-­‐Ome   Map   File-­‐Based   SQL   Stream   Database   Search   Reduce   ApplicaOons   Processing   … 99.999%   Data   Disaster   Scalability     Enterprise   MulO-­‐ &   HA   ProtecOon   Recovery   Performance   IntegraOon   tenancy  ©MapR  Technologies     26  
  27. 27. Dependable:  Lights  Out  Data  Center  Ready   Reliable  Compute   Dependable  Storage   §  Automated  stateful  failover   §  Business  conOnuity  with     §  Automated  re-­‐replicaOon   snapshots    and  mirrors   §  Recover  to  a  point  in  Ome   §  Self-­‐healing  from  HW     and  SW  failures   §  End-­‐to-­‐end  check  summing     §  Load  balancing   §  Strong  consistency   §  No  lost  jobs  or  data   §  Data  safe   §  99999’s  of  upOme   §  Mirror  across  sites  to  meet   Recovery  Time  ObjecOves  ©MapR  Technologies     27  
  28. 28. Fast:  World  Record  Performance   Benchmark   MapR  2.1.1   CDH  4.1.1   MapR  Speed   Increase   Terasort  (1x  replicaOon,  compression  disabled)   Total   13m  35s   26m  6s   2X   Map   7m  58s   21m  8s   3X   Reduce   13m  32s   23m  37s   1.8X   DFSIO  throughput/node   Read   1003  MB/s   656  MB/s   1.5X   MinuteSort  Record   Write   924  MB/s   654  MB/s   1.4X   1.5  TB  in  60  seconds   YCSB  (50%  read,  50%  update)   2103  nodes   Throughput   36,584.4  op/s   12,500.5  op/s   2.9X   RunOme   3.80  hr   11.11  hr   2.9X   YCSB  (95%  read,  5%  update)   Throughput   24,704.3  op/s   10,776.4  op/s   2.3X   RunOme   0.56  hr   1.29  hr   2.3X   Benchmark  hardware  configuraOon:     10  servers,  12  x  2  cores  (2.4  GHz),  12  x  2TB,  48  GB,  1  x  10GbE  ©MapR  Technologies     28  
  29. 29. The  Cloud  Leaders  Pick  MapR   Amazon  EMR  is  the  largest   Google  chose  MapR  to   Hadoop  provider  in  revenue   provide  Hadoop  on  Google   and  #  of  clusters   Compute  Engine  ©MapR  Technologies     29  
  30. 30. MapR  Supports  Broad  Set  of  Customers  Global  Credit  Card  Issuer   Leading  Retailer  §  RecommendaOon  Engine   §  Customer  Behavior  Analysis   §  Customer  targeOng  §  Fraud  detecOon  and  PrevenOon   §  Brand  Monitoring   §  Viewer  Behavioral  analyOcs   §  Global  threat     analyOcs  §  Intrusion  detecOon  &  prevenOon   §  RecommendaOon  Engine   §  Virus  analysis  §  Forensic  analysis   §  Family  tree  connecOons     §  Clickstream  Analysis   §  PaOent  care   §  Log  analysis   §  Quality  profiling/field   monitoring   §  HBase   failure  analysis  §  Fraud  DetecOon     §  AdverOsing  exchange   §  Monitoring  and  measuring  §  Channel  analyOcs   analysis  and  opOmizaOon   online  behavior     §  Customer  Revenue   §  Enterprise  Grade   AnalyOcs   §  Customer  targeOng   Plalorm   §  ETL  Offload   §  Social  media  analysis   §  COOP  features   ©MapR  Technologies     30  
  31. 31. MapR  EdiAons   §  Control  System   §  Control  System   §  All  the  Features  of  M5   §  NFS  Access   §  NFS  Access   §  Simplified   §  Performance   AdministraOon  for   §  Performance   HBase   §  Unlimited  Nodes   §  High  Availability   §  Increased  Performance   §  Free     §  Snapshots  &  Mirroring   §  Consistent  Low  Latency   §  24  X  7  Support   §  Unified  Snapshots,   §  Annual  SubscripOon   Mirroring   Also  Available  through:     Compute  Engine  ©MapR  Technologies     31  
  32. 32. Agenda   Hbase   MapR   M7   Containers    ©MapR  Technologies     32  
  33. 33. M7   An  integrated  system  for   unstructured  and  structured  data    ©MapR  Technologies     33  
  34. 34. Introducing  MapR  M7   §  An  integrated  system   –  Unified  namespace  for  files  and  tables   –  Built-­‐in  data  management  &  protecOon   –  No  extra  administraOon   §  Architected  for  reliability  and  performance   –  Fewer  layers   –  Single  hop  to  data   –  No  compacOons,  low  i/o  amplificaOon   –  Seamless  splits,  automaOc  merges   –  Instant  recovery  ©MapR  Technologies     34  
  35. 35. Binary  CompaAble  with  HBase  APIs   §  HBase  applicaOons  work  "as  is"  with  M7   –  No  need  to  recompile  (binary  compaOble)   §  Can  run  M7  and  HBase  side-­‐by-­‐side  on  the  same  cluster   –  e.g.,  during  a  migraOon   –  can  access  both  M7  table  and  HBase  table  in  same  program     §  Use  standard  Apache  HBase  CopyTable  tool  to  copy  a  table   from  HBase  to  M7  or  vice-­‐versa     %  hbase  org.apache.hadoop.hbase.mapreduce.CopyTable                            -­‐-­‐  oldtable  ©MapR  Technologies     35  
  36. 36. M7:    Remove  Layers,  Simplify   Take  note!  No  JVM!   MapR      M7  ©MapR  Technologies     36  
  37. 37. M7:    No  Master  and  No  RegionServers   No  JVM  problems   One  hop  to  data   Unified  cache   No  extra  daemons  to  manage  ©MapR  Technologies     37  
  38. 38. Region  Assignment  in  Apache  HBase   None  of  this  complexity  is  present  in  MapR  M7  ©MapR  Technologies     38  
  39. 39. Unified  Namespace  for  Files  and  Tables   $  pwd   /mapr/default/user/dave     $  ls   file1    file2    table1    table2     $  hbase  shell   hbase(main):003:0>  create  /user/dave/table3,  cf1,  cf2,  cf3   0  row(s)  in  0.1570  seconds     $  ls   file1    file2    table1    table2    table3     $  hadoop  fs  -­‐ls  /user/dave   Found  5  items   -­‐rw-­‐r-­‐-­‐r-­‐-­‐      3  mapr  mapr                  16  2012-­‐09-­‐28  08:34  /user/dave/file1   -­‐rw-­‐r-­‐-­‐r-­‐-­‐      3  mapr  mapr                  22  2012-­‐09-­‐28  08:34  /user/dave/file2   trwxr-­‐xr-­‐x      3  mapr  mapr                    2  2012-­‐09-­‐28  08:32  /user/dave/table1   trwxr-­‐xr-­‐x      3  mapr  mapr                    2  2012-­‐09-­‐28  08:33  /user/dave/table2   trwxr-­‐xr-­‐x      3  mapr  mapr                    2  2012-­‐09-­‐28  08:38  /user/dave/table3  ©MapR  Technologies     39  
  40. 40. Tables  for  End  Users   §  Users  can  create  and  manage  their  own  tables   –  Unlimited  #  of  tables     §  Tables  can  be  created  in  any  directory   –  Tables  count  towards  volume  and  user  quotas   §  No  admin  intervenOon  needed   –  I  can  create  a  file  or  a  directory  without  opening  a  Ocket  with   admin  team,  why  not  a  table?   –  Do  stuff  on  the  fly,    no  stop/restart  servers   §  AutomaOc  data  protecOon  and  disaster  recovery   –  Users  can  recover  from  snapshots/mirrors  on  their  own  ©MapR  Technologies     40  
  41. 41. M7  –  An  Integrated  System  ©MapR  Technologies     41  
  42. 42. M7   ComparaOve  Analysis  with    Apache  HBase,  Level-­‐DB  and  a  BTree  ©MapR  Technologies     42  
  43. 43. HBase  Write  AmplificaAon  Analysis   §  Assume  10G  per  region,  write  10%  per  day,  grow  10%  per  week   –  1G  of  writes   –  a€er  7  days,  7  files  of  1G  and  1file  of  10G  (only  1G  is  growth)   §  IO  Cost   –  Wrote  7G  to  WAL  +  7G  to  HFiles   –  CompacOon  adds  sOll  more   •  read:  17G    (=  7  x  1G    +  1  x  10G)   •  write:    11G  write  to  new  Hfile   –  WAF  –  wrote  7G  “for  real”  but  actual  disk  IO  a€er  compacOon  is  read   17G  +  write  25G  and  that’s  assuming  no  applicaOon  reads!   §  IO  Cost  of  1000  regions  similar  to  above   –  read  17T,    write  25T    è  major  impact  on  node   §  Best  pracOce,  limit  #  of  regions/node  à  can’t  fully  uOlize   storage  ©MapR  Technologies     43  
  44. 44. AlternaAve:  Level-­‐DB   §  Tiered,  logarithmic  increase   –  L1:  2  x  1M    files   –  L2:    10  x  1M   –  L3:    100  x  1M   –  L4:      1,000  x  1M,  etc   §  CompacOon  overhead   –  avoids  IO  storms    (i/o  done  in  smaller  increments  of    ~10M)   –  but  significantly  extra  bandwidth  compared  to  HBase   §  Read  overhead  is  sOll  high   –  10-­‐15  seeks,  perhaps  more  if  the  lowest  level  is  very  large   –  40K  -­‐  60K    read  from  disk  to  retrieve  a  1K  record  ©MapR  Technologies     44  
  45. 45. BTree  analysis   §  Read  finds  data  directly,  proven  to  be  fastest   –  interior  nodes  only  hold  keys   –  very  large  branching  factor   –  values  only  at  leaves   –  thus  index  caches  work   –  R  =  logN  seeks,  if  no  caching   –  1K  record  read  will  transfer  about  logN  blocks  from  disk   §  Writes  are  slow  on  inserts   –  inserted  into  correct  place  right  away   –  otherwise  read  will  not  find  it   –  requires  btree  to  be  conOnuously  rebalanced   –  causes  extreme  random  i/o  in  insert  path   –  W  =  2.5x  +  logN  seeks  if  no  caching  ©MapR  Technologies     45  
  46. 46. Log-­‐Structured  Merge  Trees   §  LSM  Trees  reduce  insert  cost  by  deferring  and  batching  index  changes   –  If  dont  compact  o€en,  read  perf  is  impacted   –  If  compact  too  o€en,  write  perf  is  impacted     §  B-­‐Trees  are  great  for  reads   –  but  expensive  to  update  in  real-­‐Ome     Can  we  combine  both  ideas?     Writes  cannot  be  done  befer  than  W  =  2.5x   write  to  log    +    write  data  to  somewhere    +    update  meta-­‐data     Memory Disk Index Log Write Read Index©MapR  Technologies     46  
  47. 47. M7  from  MapR   §  TwisOng  BTrees   –  leaves  are  variable  size  (8K  -­‐  8M  or  larger)   –  can  stay  unbalanced  for  long  periods  of  Ome   •  more  inserts  will  balance  it  eventually   •  automaOcally  throfles  updates  to  interior  btree  nodes   –  M7  inserts  "close  to"  where  the  data  is  supposed  to  go   §  Reads   –  Uses  BTree  structure  to  get  "close"  very  fast   •  very  high  branching  with  key-­‐prefix-­‐compression   –  UOlizes  a  separate  lower-­‐level  index  to  find  it  exactly   •  updated  "in-­‐place”  bloom-­‐filters  for  gets,  range-­‐maps  for  scans     §  Overhead   –  1K  record  read  will  transfer  about  32K  from  disk  in  logN  seeks  ©MapR  Technologies     47  
  48. 48. M7    provides  Instant  Recovery   §  Instead  of  having  one  WAL/region  server  or  even  one/region,   we  have  many  micro-­‐WALs/region   §  0-­‐40  microWALs  per  region   –  idle  WALs  “compacted”,  so  most  are  empty   –  region  is  up  before  all  microWALs  are  recovered   –  recovers  region  in  background  in  parallel   –  when  a  key  is  accessed,  that  microWAL  is  recovered  inline   –  1000-­‐10000x  faster  recovery   §  Never  perform  equivalent  of  HBase  major  or  minor   compacOon   §  Why  doesnt  HBase  do  this?  M7  uses  MapR-­‐FS,  not  HDFS   –  No  limit  to  #  of  files  on  disk   –  No  limit  to  #  open  files   –  I/O  path  translates  random  writes  to  sequenOal  writes  on  disk  ©MapR  Technologies     48  
  49. 49. Summary   1K  record  -­‐read   CompacAon   Recovery   amplificaAon   HBase  with  7  hfiles   30  seeks   IO  Storms   Huge  WAL  to  recover   130K  xfer   good  bandwidth     HBase  with  3  hfiles   15  seeks,   IO  Storms   Huge  WAL  to  recover   70K  xfer   high  bandwidth     LevelDB  with  5  levels   13  seeks   No  i/o  storms   WAL  is  Ony   48K  xfer   Very  high  b/w     BTree   logN  seeks   No  i/o  storms   WAL  is  proporOonal  to   logN  xfer   but  100%  random   concurrency  +  cache   MapR    M7   logN  seeks   No  i/o  storms   microWALs    allow   32K  xfer   low  bandwidth    recovery  <  100ms  ©MapR  Technologies     49  
  50. 50. M7:    Fileservers  Serve  Regions   §  Region  lives  enOrely  inside  a  container   –  Does  not  coordinate  through  ZooKeeper     §  Containers  support  distributed  transacOons   –  with  replicaOon  built-­‐in   §  Only  coordinaOon  in  the  system  is  for  splits   –  Between  region-­‐map  and  data-­‐container   –  already  solved  this  problem  for  files  and  its  chunks    ©MapR  Technologies     50  
  51. 51. Agenda   Hbase   MapR   M7   Containers    ©MapR  Technologies     51  
  52. 52.       Whats  a  MapR  container?  ©MapR  Technologies     52  
  53. 53. MapRs  Containers   Files/directories  are  sharded  into  blocks,   and    placed  in  containers  on  disks   l  Each  container  contains   l  Directories  &  files   l  Data  blocks   Containers  are   l  BTrees   ~32  GB  segments  of   100%  random  writes   disk,  placed  on   l  nodes   Patent  Pending  ©MapR  Technologies     53  
  54. 54. M7  Containers   §  Container  holds  many  files   –  regular,  dir,  symlink,  btree,  chunk-­‐map,  region-­‐map,  …   –  all  random-­‐write  capable   §  Container  is  replicated  to  servers   –  unit  of  resynchronizaOon   §  Region  lives  enOrely  inside  1  container   –  all  files  +  WALs  +  btrees  +  bloom-­‐filters  +  range-­‐maps  ©MapR  Technologies     54  
  55. 55. Read-­‐write  ReplicaAon   §  Write  are  synchronous   client2   –  All  copies  have  same  data   client1   clientN     §  Data  is  replicated  in  a  "chain"   fashion   –  befer  bandwidth,  uOlizes  full-­‐duplex   network  links  well   §  Meta-­‐data  is  replicated  in  a  "star"   manner   –  response  Ome  befer,  bandwidth  not   of  concern   –  data  can  also  be  done  this  way    ©MapR  Technologies     55   55  
  56. 56. Random  WriAng  in  MapR   S1 Ask  for   Client   64M  block   wriAng   CLDB   Create  cont.   data   S1, S2, S4 afach   S1, S3 Write   S1, S4, S5 next  chunk   S2 Picks  master   S2, S4, S5 to  S2   and  2  replica  slaves   S3 S2, S3, S5 S4 S5 S3©MapR  Technologies     56  
  57. 57. Container  Balancing   •  Servers  keep  a  bunch  of  containers  "ready  to  go".   •  Writes  get  distributed  around  the  cluster.   l  As  data  size  increases,  writes   spread  more,  like  dropping  a   pebble  in  a  pond     l  Larger  pebbles  spread  the   ripples  farther     l  Space  balanced  by  moving  idle   containers      ©MapR  Technologies     57  
  58. 58. Failure  Handling   Containers  managed  at  CLDB  (HB,  container-­‐reports).   l  HB  loss    +    upstream   enOty  reports  failure          =>  server  dead     l  Incr  epoch  at  CLDB   l  Rearrange  repl  path   l  Exact  same  code  for  files   Container  LocaOon  DataBase     and  M7  tables   (CLDB)   l  No  ZK  ©MapR  Technologies     58  
  59. 59. Architectural  Params   HDFS  block  §  Unit  of  I/O   –  4K/8K    (8K  in  MapR)   10^3   10^6   10^9   i/o   map-­‐red   resync   admin  §  Unit  of  Chunking    (a  map-­‐reduce   split)   §  Unit  of  AdministraOon    (snap,   –  10-­‐100s  of  megabytes   repl,  mirror,  quota,  backup)   –  1  gigabyte  -­‐  1000s  of  terabytes  §  Unit  of  Resync      (a  replica)   –  volume  in  MapR   –  10-­‐100s  of  gigabytes   –  what  data  is  affected  by  my   missing  blocks?   –  container  in  MapR       ©MapR  Technologies     59  
  60. 60. Other  M7  Features   §  Smaller  disk  footprint   –  M7  never  repeats  the  key  or  column  name     §  Columnar  layout   –  M7  supports  64  column  families   –  in-­‐memory  column-­‐families   §  Online  admin   –  M7  schema  changes  on  the  fly   –  delete/rename/redistribute  tables      ©MapR  Technologies     60  
  61. 61. Thank  you!     QuesAons?  ©MapR  Technologies     61  
  62. 62. Examples:  Reliability  Issues   §  CompacAons  disrupt  HBase  operaAons:    I/O  bursts  overwhelm   nodes  (hfp://   §  Very  slow  crash  recovery:  RegionServer  crash  can  cause  data  to  be   unavailable  for  up  to  30  minutes  while  WALs  are  replayed  for   impacted  regions.  (HBASE-­‐1111)   §  Unreliable  splibng:  Region  spliwng  may  cause  data  to  be   inconsistent  and  unavailable.  ( hfp://­‐experience-­‐with-­‐ hbase-­‐dynamic.html)   §  No  client  throcling:  HBase  client  can  easily  overwhelm   RegionServers  and  cause  downOme.  (HBASE-­‐5161,  HBASE-­‐5162)  ©MapR  Technologies     62  
  63. 63. Examples:  Business  ConAnuity  Issues   §  No  Snapshots:  MapR  provides  all-­‐or-­‐nothing  snapshots  for  HBase.   The  WALs  are  shared  among  tables  so  single-­‐table  and  selecOve   mulO-­‐table  snapshots  are  not  possible.  (HDFS-­‐2802,  HDFS-­‐3370,   HBASE-­‐50,  HBASE-­‐6055)   §  Complex  Backup  Process:    complex,  unreliable  and  inefficient.   ( hfp://­‐disaster-­‐ recovery-­‐and-­‐whisky.html)   §  AdministraAon  Requires  DownAme:  The  enOre  cluster  must   be  taken  down  in  order  to  merge  regions.  Tables  must  be  disabled  to   change  schema,  replicaOon  and  other  properOes.  (HBASE-­‐420,   HBASE-­‐1621,  HBASE-­‐5504,  HBASE-­‐5335,  HBASE-­‐3909)  ©MapR  Technologies     63  
  64. 64. Examples:  Performance  Issues   §  Limited  support  for  mulAple  column  families:  HBase  has   issues  handling  mulOple  column  family  due  to  compacOons.  The  standard   HBase  documentaOon  recommends  no  more  than  2-­‐3  column  families.   (HBASE-­‐3149)   §  Limited  data  locality:  HBase  does  not  take  into  account  block   locaOons  when  assigning  regions.  A€er  a  reboot,  RegionServers  are  o€en   reading  data  over  the  network  rather  than  the  local  drives.  (HBASE-­‐4755,   HBASE-­‐4491)   §  Cannot  uAlize  disk  space:  HBase  RegionServers  struggle  with  more   than  50-­‐150  regions  per  RegionServer  so  a  commodity  server  can  only  handle   about  1TB  of  HBase  data,  wasOng  disk  space.  ( hfp://,   hfp://­‐dos-­‐and-­‐donts/)   §  Limited  #  of  tables:  A  single  cluster  can  only  handle  several  tens  of   tables  effecOvely.  ( hfp://  ©MapR  Technologies     64  
  65. 65. Examples:  Manageability  Issues   §  Manual  major  compacAons:  HBase  major  compacOons  are  disrupOve   so  producOon  clusters  keep  them  disabled  and  rely  on  the  administrator  to   manually  trigger  compacOons.  ( hfp://     §  Manual  splibng:  HBase  auto-­‐spliwng  does  not  work  properly  in  a  busy   cluster  so  users  must  pre-­‐split  a  table  based  on  their  esOmate  of  data  size/ growth.  ( hfp://­‐experience-­‐with-­‐hbase-­‐ dynamic.html)   §  Manual  merging:  HBase  does  not  automaOcally  merge  regions  that  are   too  small.  The  administrator  must  take  down  the  cluster  and  trigger  the   merges  manually.     §  Basic  administraAon  is  complex:  Renaming  a  table  requires  copying   all  the  data.  Backing  up  a  cluster  is  a  complex  process.  (HBASE-­‐643)      ©MapR  Technologies     65