Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
HBase and Accumulo Washington DC Hadoop User Group          Jan 25th, 2012            Todd Lipcon     Software Engineer, C...
Background – Overview• HBase and Accumulo are both open-source, Apache  2.0 licensed implementations of Google’s BigTable ...
Sorted Map Datastores• Each row has a row key (like a Primary Key in RDBMS  terms)   • Users may query by exact row key or...
Sorted Map Datastore   (logical view as “records”)  Implicit PRIMARY KEY in             RDBMS terms                       ...
Locality Groups• Different sets of columns may have different properties  and access patterns   • Perhaps a few columns ar...
Sorted Map Datastore          (physical view as “cells”)                         info Column Family / Locality Group      ...
(image from Accumulo manual)Copyright 2012 Cloudera Inc. All rights reserved   7
Accumulo/HBase TerminologyAccumulo     HBase                  DefinitionTablet       Region                 A partition of...
That’s all the intro we have time for…• Check out the excellent Accumulo manual at  http://incubator.apache.org/accumulo• ...
Commonalities (the non-controversial stuff)• Both systems scale well   • Clusters with >1000 nodes, >1PB   • Example HBase...
Commonalities (the non-controversial stuff)• Storage formats are very similar   • Used to be the same, then diverged, then...
General features• Both have good MapReduce integration• Both have a command-line shell• Both have a pretty good test suite...
Now for the fun part… BigTable shootout 2012• Warning: I am necessarily biased as an HBase  committer.• I will be comparin...
Differences – Active contributors and users                                                             (plus various cont...
Differences – User Mailing list activity   500-600 messages                                      50-100 messages   per mon...
Differences – Access Control• Accumulo has per-cell visibility labels as well as table  ACLs   • Each cell has an ACL of w...
Differences – Authentication• Accumulo has a built-in user database   • Users are authenticated by username/password   • P...
Differences – Locality Groups• HBase has a 1:1 correspondence of Column Families  and Locality Groups   • Moving columns f...
Differences – extensibility frameworks• Accumulo has iterators   • Allows custom processing to be inserted in the read pat...
Differences – Web UI and Monitoring    Winner:              Copyright 2012 Cloudera Inc. All rights reserved   20
Differences – Write-ahead logging• HBase uses HDFS files as a WAL  • Takes advantage of HDFS performance improvements as t...
Differences – Other features• Accumulo has a nice mock Accumulo implementation   • Nice for testing user software• Accumul...
Differences – Other features• HBase has RPM and Debian packages as part of Apache  BigTop   • Integrated (and integration-...
Summary• Neither system is better!• One system may very well be better for your use case,  or for the community you want t...
Thanks!Aaron Cordova and John Vines(Accumulo committers) will now joinme for some discussion / questions          Email: t...
Upcoming SlideShare
Loading in …5
×

HBase and Accumulo | Washington DC Hadoop User Group

22,086 views

Published on

HBase and Accumulo are both open-source, Apache 2.0 licensed implementations of Google's BigTable infrastructure, running on Apache Hadoop.

Published in: Technology
  • Be the first to comment

HBase and Accumulo | Washington DC Hadoop User Group

  1. 1. HBase and Accumulo Washington DC Hadoop User Group Jan 25th, 2012 Todd Lipcon Software Engineer, Cloudera todd@cloudera.com / @tlipcon Copyright 2011 Cloudera Inc. All rights reserved
  2. 2. Background – Overview• HBase and Accumulo are both open-source, Apache 2.0 licensed implementations of Google’s BigTable infrastructure, running on Apache Hadoop• Scalable, distributed storage • Scalable data storage at petabyte scale, storing trillions of rows distributed across hundreds or thousands of machines • Automatic fault tolerance and data distribution as machines crash or rejoin the cluster • Linear scaling of IOPS and data capacity by adding servers• Data model is a big sorted hierarchical map Copyright 2012 Cloudera Inc. All rights reserved 2
  3. 3. Sorted Map Datastores• Each row has a row key (like a Primary Key in RDBMS terms) • Users may query by exact row key or by range of row keys • Data is always stored and returned in sorted order• Each row has some number of columns • Each column has a qualifier and some piece of data. Like a Map<byte[], byte[]> • Different rows may have different sets of columns • Each cell has an associated timestamp and may retain a history of previous values• Columns are grouped into column families and locality groups Copyright 2012 Cloudera Inc. All rights reserved 3
  4. 4. Sorted Map Datastore (logical view as “records”) Implicit PRIMARY KEY in RDBMS terms Data is all byte[] in HBase Row key Data Different types ofdata separated into cutting info: , ‘height’: ‘9ft’, ‘state’: ‘CA’ - different roles: , ‘ASF’: ‘Director’, ‘Hadoop’: ‘Founder’ - “column families” tlipcon info: , ‘height’: ‘5ft7, ‘state’: ‘CA’ - roles: , ‘Hadoop’: ‘Committer’@ts=2010, ‘Hadoop’: ‘PMC’@ts=2011, ‘Hive’: ‘Contributor’ - Different rows may have different sets A single cell might have different of columns(table is sparse) values at different timestamps Useful for *-To-Many mappings
  5. 5. Locality Groups• Different sets of columns may have different properties and access patterns • Perhaps a few columns are accessed all the time, whereas others are large and rarely needed • For example, a user’s metadata (1kb, accessed frequently) and their photo (1MB, cached by CDN and accessed rarely)• Put metadata in one locality group and photos in another• Locality groups stored separately on disk: access just the metadata without reading the photo
  6. 6. Sorted Map Datastore (physical view as “cells”) info Column Family / Locality Group Row key Column key Timestamp Cell value cutting info:height 1273516197868 9ft cutting info:state 1043871824184 CA tlipcon info:height 1273878447049 5ft7 tlipcon info:state 1273616297446 CA roles Column Family / Locality Group Row key Column key Timestamp Cell value cutting roles:ASF 1273871823022 Director Sorted on disk by cutting roles:Hadoop 1183746289103 FounderRow key, Col tlipcon roles:Hadoop 1300062064923 PMC key, descending tlipcon roles:Hadoop 1293388212294 Committer timestamp tlipcon roles:Hive 1273616297446 Contributor Milliseconds since unix epoch
  7. 7. (image from Accumulo manual)Copyright 2012 Cloudera Inc. All rights reserved 7
  8. 8. Accumulo/HBase TerminologyAccumulo HBase DefinitionTablet Region A partition of a table (eg email inboxes starting with ‘a’-’c’)TabletServer RegionServer A server in the cluster which hosts a number of tablets/regions, providing read/write accessLog/WAL HLog/WAL Write-ahead log – used for durably logging editsMinor Flush Writing data from memory to diskcompactionMajor Minor Merging several on-disk files into a larger onecompaction CompactionMajor Major Merging all of the on-disk files into a larger onecompaction compactionwith all files Copyright 2012 Cloudera Inc. All rights reserved 8
  9. 9. That’s all the intro we have time for…• Check out the excellent Accumulo manual at http://incubator.apache.org/accumulo• And the HBase manual at http://hbase.apache.org/book.html• Also some longer intro videos on Cloudera’s website, and an excellent O’Reilly book Copyright 2012 Cloudera Inc. All rights reserved 9
  10. 10. Commonalities (the non-controversial stuff)• Both systems scale well • Clusters with >1000 nodes, >1PB • Example HBase users: StumbleUpon, TrendMicro, Facebook, eBay, Flurry, ngmoco, Mozilla, Adobe, etc. • Example Accumulo users: ??????? (I don’t have clearance but I’m told they’re big and important)• Both systems perform well • Depending on tuning, one might beat the other at any given benchmark, but overall results seem comparable• Both open source with active development Copyright 2012 Cloudera Inc. All rights reserved 10
  11. 11. Commonalities (the non-controversial stuff)• Storage formats are very similar • Used to be the same, then diverged, then re-converged! • Multi-level BTrees, bloom filters, compression • Prefix compression currently missing in HBase, 95% complete for 0.94.0• Caching code very similar • Accumulo uses an older version of HBase’s LRUBlockCache • HBase has some recent improvements (off-heap cache), but I imagine Accumulo will grab them soon enough. Copyright 2012 Cloudera Inc. All rights reserved 11
  12. 12. General features• Both have good MapReduce integration• Both have a command-line shell• Both have a pretty good test suite • Accumulo used to be ahead here, but we traded off some ideas and use similar testing strategies now• Both use ZooKeeper for fault tolerant metadata storage, and support failover Masters Copyright 2012 Cloudera Inc. All rights reserved 12
  13. 13. Now for the fun part… BigTable shootout 2012• Warning: I am necessarily biased as an HBase committer.• I will be comparing the very latest versions • HBase 0.92.0 (released only 2 days ago!) • Accumulo 1.4 (not yet released, due out mid Feb?)• Please feel free to loudly disagree after the talk during the time allotted for questions – I am happy to be proven wrong! I’ll invite Aaron Cordova and John Vines up to help answer questions. Copyright 2012 Cloudera Inc. All rights reserved 13
  14. 14. Differences – Active contributors and users (plus various contractors thereof) (I ran out of space) Copyright 2012 Cloudera Inc. All rights reserved 14
  15. 15. Differences – User Mailing list activity 500-600 messages 50-100 messages per month (peak per month (peak 1088) 105) *but it’s new at Apache+ Winner: Copyright 2012 Cloudera Inc. All rights reserved 15
  16. 16. Differences – Access Control• Accumulo has per-cell visibility labels as well as table ACLs • Each cell has an ACL of what users may see it. (eg (TS|(SECRET&PROJECTX))) • Users who don’t have access can’t tell the cell even exists • Very useful for classified information!• HBase has column family ACLs but no built-in per-cell visibility support • Some early work to add visibility labels, but not done yet Winner: Copyright 2012 Cloudera Inc. All rights reserved 16
  17. 17. Differences – Authentication• Accumulo has a built-in user database • Users are authenticated by username/password • Passed in plaintext over the wire• HBase optionally uses Kerberos • Central administration (eg via Active Directory) • Key-based secure credential exchange • Temporary delegation tokens are created for MR jobs, so even if a job’s data leaks, credentials are not compromised • Consistent with rest of Hadoop ecosystem Winner: Copyright 2012 Cloudera Inc. All rights reserved 17
  18. 18. Differences – Locality Groups• HBase has a 1:1 correspondence of Column Families and Locality Groups • Moving columns from one locality group to another after data has been inserted is impossible• Accumulo has a proper distinction and allows online reassignment of column-to-locality-group mappingsWinner: Copyright 2012 Cloudera Inc. All rights reserved 18
  19. 19. Differences – extensibility frameworks• Accumulo has iterators • Allows custom processing to be inserted in the read path as well as into the table maintenance code. Provides neat features like automatic summary maintenance, for example.• HBase has coprocessors • Much more general framework that also subsumes triggers, stored procedures, and cluster management hooks. (e.g Access Control is an HBase coprocessor). • Generality has its cost: very difficult to do some things that are simple with iterators • Some iterator use cases can be done with HBase filters• I’ll call this one a tie Copyright 2012 Cloudera Inc. All rights reserved 19
  20. 20. Differences – Web UI and Monitoring Winner: Copyright 2012 Cloudera Inc. All rights reserved 20
  21. 21. Differences – Write-ahead logging• HBase uses HDFS files as a WAL • Takes advantage of HDFS performance improvements as they are developed • Same trusted replication and checksumming schemes as HDFS• Accumulo has its own Logger implementation • Extra daemons to run • Does not leverage improvements in HDFS • Won’t re-replicate if loggers go down Winner: Copyright 2012 Cloudera Inc. All rights reserved 21
  22. 22. Differences – Other features• Accumulo has a nice mock Accumulo implementation • Nice for testing user software• Accumulo supports isolated scans on super-wide rows • HBase supports wide rows but isolation properties are lost• Accumulo supports tablet merging • If tablets get too small, they’ll merge with neighbors• Accumulo supports table snapshotting/cloning• Other sundry features: logical clocks, RPC tracing, RPC wire compatibility, and more. Copyright 2012 Cloudera Inc. All rights reserved 22
  23. 23. Differences – Other features• HBase has RPM and Debian packages as part of Apache BigTop • Integrated (and integration-tested) with Hive, Pig, and others• HBase has commercial support available from Cloudera, as well as several vendors and other projects building on top (Lily, OMID, etc)• HBase has first-class support for REST clients and thin Thrift clients• HBase has inter-cluster wide-area replication• HBase has significantly more advanced bloom filters and other such optimizations (thanks Facebook!) Copyright 2012 Cloudera Inc. All rights reserved 23
  24. 24. Summary• Neither system is better!• One system may very well be better for your use case, or for the community you want to interact with• Over time, the feature sets are converging • RFile vs HFile v2, Security, Caching, Compaction policies, Iterators/Coprocessors• Now that both projects are in Apache, open dialogue, code sharing, and friendly competition will help make both projects better! Copyright 2012 Cloudera Inc. All rights reserved 24
  25. 25. Thanks!Aaron Cordova and John Vines(Accumulo committers) will now joinme for some discussion / questions Email: todd@cloudera.com Twitter: @tlipcon Copyright 2012 Cloudera Inc. All rights reserved 25

×