A P A C H E
HBASE
             Scott
          Leberknight
BACKGROUND
Google




Bigtable
"Bigtable is a distributed storage
system for managing structured data
that is designed to scale to a very
large size: petabytes of data across
thousands of commodity
servers. Many projects at Google
store data in Bigtable including web
indexing, Google Earth, and Google
Finance."


                  - Bigtable: A Distributed Storage System
                                        for Structured Data
                                 http://labs.google.com/papers/bigtable.html
"A Bigtable is a sparse, distributed, persistent
                    multidimensional sorted map"



               - Bigtable: A Distributed Storage System
                                     for Structured Data
                              http://labs.google.com/papers/bigtable.html
wtf?
distributed


    sparse


column-oriented


   versioned
The map is indexed by a row key,
column key, and a timestamp; each
value in the map is an uninterpreted array
of bytes.
                   - Bigtable: A Distributed Storage System
                                         for Structured Data
                       http://labs.google.com/papers/bigtable.html




 (row key, column key, timestamp) => value
Key Concepts:
row key => 20120407152657

column family => "personal:"

column key => "personal:givenName",
              "personal:surname"

timestamp => 1239124584398
Row Key       Timestamp         Column Family "info:"                ColumN Family
                                                                          "content:"
20120407145045      t7       "info:summary"     "An intro to..."
                    t6        "info:author"       "John Doe"
                    t5                                               "Google's Bigtable is..."
                    t4                                               "Google Bigtable is..."
                    t3       "info:category"     "Persistence"
                    t2        "info:author"          "John"
                    t1         "info:title"    "Intro to Bigtable"
20120320162535      t4       "info:category"     "Persistence"
                    t3                                                   "CouchDB is..."
                    t2        "info:author"       "Bob Smith"
                    t1         "info:title"    "Doc-oriented..."
Get row 20120407145045...
   Row Key       Timestamp         Column Family "info:"                Column Family
                                                                          "content:"
20120407145045      t7       "info:summary"     "An intro to..."
                    t6        "info:author"       "John Doe"
                    t5                                               "Google's Bigtable is..."
                    t4                                               "Google Bigtable is..."
                    t3       "info:category"     "Persistence"
                    t2        "info:author"          "John"
                    t1         "info:title"    "Intro to Bigtable"
20120320162535      t4       "info:category"     "Persistence"
                    t3                                                   "CouchDB is..."
                    t2        "info:author"       "Bob Smith"
                    t1         "info:title"    "Doc-oriented..."
Use HBase when you need random, realtime read/
write access to your Big Data. This project's goal is the
hosting of very large tables -- billions of rows X
millions of columns -- atop clusters of commodity
hardware. HBase is an open-source, distributed,
versioned, column-oriented store modeled after
Google's Bigtable.

                                   - http://hbase.apache.org/
HBase Shell
hbase(main):001:0> create 'blog', 'info', 'content'
0 row(s) in 4.3640 seconds
hbase(main):002:0> put 'blog', '20120320162535', 'info:title', 'Document-oriented
storage using CouchDB'
0 row(s) in 0.0330 seconds
hbase(main):003:0> put 'blog', '20120320162535', 'info:author', 'Bob Smith'
0 row(s) in 0.0030 seconds
hbase(main):004:0> put 'blog', '20120320162535', 'content:', 'CouchDB is a
document-oriented...'
0 row(s) in 0.0030 seconds
hbase(main):005:0> put 'blog', '20120320162535', 'info:category', 'Persistence'
0 row(s) in 0.0030 seconds
hbase(main):006:0> get 'blog', '20120320162535'
COLUMN                       CELL
 content:                    timestamp=1239135042862, value=CouchDB is a doc...
 info:author                 timestamp=1239135042755, value=Bob Smith
 info:category               timestamp=1239135042982, value=Persistence
 info:title                  timestamp=1239135042623, value=Document-oriented...
4 row(s) in 0.0140 seconds
HBase Shell



hbase(main):015:0> get 'blog', '20120407145045', {COLUMN=>'info:author', VERSIONS=>3 }
timestamp=1239135325074, value=John Doe
timestamp=1239135324741, value=John
2 row(s) in 0.0060 seconds
hbase(main):016:0> scan 'blog', { STARTROW => '20120300', STOPROW => '20120400' }
ROW                     COLUMN+CELL
 20120320162535         column=content:, timestamp=1239135042862, value=CouchDB is...
 20120320162535         column=info:author, timestamp=1239135042755, value=Bob Smith
 20120320162535         column=info:category, timestamp=1239135042982, value=Persistence
 20120320162535         column=info:title, timestamp=1239135042623, value=Document...
4 row(s) in 0.0230 seconds
Got byte[]?
// Create a new table
Configuration conf = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);

String tableName = "people";
HTableDescriptor desc = new HTableDescriptor(tableName);
desc.addFamily(new HColumnDescriptor("personal"));
desc.addFamily(new HColumnDescriptor("contactinfo"));
desc.addFamily(new HColumnDescriptor("creditcard"));
admin.createTable(desc);

System.out.printf("%s is available? %bn",
  tableName, admin.isTableAvailable(tableName));
import static org.apache.hadoop.hbase.util.Bytes.toBytes;

// Add some data into 'people' table
Configuration conf = HBaseConfiguration.create();
Put put = new Put(toBytes("connor-john-m-43299"));
put.add(toBytes("personal"), toBytes("givenName"),
        toBytes("John"));
put.add(toBytes("personal"), toBytes("mi"), toBytes("M"));
put.add(toBytes("personal"), toBytes("surname"),
        toBytes("Connor"));
put.add(toBytes("contactinfo"), toBytes("email"),
        toBytes("john.connor@gmail.com"));
table.put(put);
table.flushCommits();
table.close();
Finding data:

    get (by row key)


    scan (by row key ranges, filtering)
// Get a row. Ask for only the data you need.
Configuration conf = HBaseConfiguration.create();
HTable table = new HTable(conf, "people");
Get get = new Get(toBytes("connor-john-m-43299"));
get.setMaxVersions(2);
get.addFamily(toBytes("personal"));
get.addColumn(toBytes("contactinfo"), toBytes("email"));
Result result = table.get(get);
// Update existing values, and add a new one
Configuration conf = HBaseConfiguration.create();
HTable table = new HTable(conf, "people");
Put put = new Put(toBytes("connor-john-m-43299"));
put.add(toBytes("personal"), toBytes("surname"),
        toBytes("Smith"));
put.add(toBytes("contactinfo"), toBytes("email"),
        toBytes("john.m.smith@gmail.com"));
put.add(toBytes("contactinfo"), toBytes("address"),
        toBytes("San Diego, CA"));
table.put(put);
table.flushCommits();
table.close();
// Scan rows...
Configuration conf = HBaseConfiguration.create();
HTable table = new HTable(conf, "people");
Scan scan = new Scan(toBytes("smith-"));
scan.addColumn(toBytes("personal"), toBytes("givenName"));
scan.addColumn(toBytes("contactinfo", toBytes("email"));
scan.addColumn(toBytes("contactinfo", toBytes("address"));
scan.setFilter(new PageFilter(numRowsPerPage));
ResultScanner sacnner = table.getScanner(scan);
for (Result result : scanner) {
  // process result...
}
DAta Modeling


   Row key design


   MATCH TO DATA ACCESS PATTERNS


   WIDE VS. NARROW ROWS
REferences


                   shop.oreilly.com/product/0636920014348.do




                                     http://shop.oreilly.com/product/0636920021773.do
                                     (3rd edition pub date is May 29, 2012)
hbase.apache.org
(my info)




scott.leberknight at nearinfinity.com
www.nearinfinity.com/blogs/
twitter: sleberknight

HBase Lightning Talk

  • 1.
    A P AC H E HBASE Scott Leberknight
  • 2.
  • 3.
  • 4.
    "Bigtable is adistributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable including web indexing, Google Earth, and Google Finance." - Bigtable: A Distributed Storage System for Structured Data http://labs.google.com/papers/bigtable.html
  • 5.
    "A Bigtable isa sparse, distributed, persistent multidimensional sorted map" - Bigtable: A Distributed Storage System for Structured Data http://labs.google.com/papers/bigtable.html
  • 6.
  • 7.
    distributed sparse column-oriented versioned
  • 8.
    The map isindexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. - Bigtable: A Distributed Storage System for Structured Data http://labs.google.com/papers/bigtable.html (row key, column key, timestamp) => value
  • 9.
    Key Concepts: row key=> 20120407152657 column family => "personal:" column key => "personal:givenName", "personal:surname" timestamp => 1239124584398
  • 10.
    Row Key Timestamp Column Family "info:" ColumN Family "content:" 20120407145045 t7 "info:summary" "An intro to..." t6 "info:author" "John Doe" t5 "Google's Bigtable is..." t4 "Google Bigtable is..." t3 "info:category" "Persistence" t2 "info:author" "John" t1 "info:title" "Intro to Bigtable" 20120320162535 t4 "info:category" "Persistence" t3 "CouchDB is..." t2 "info:author" "Bob Smith" t1 "info:title" "Doc-oriented..."
  • 11.
    Get row 20120407145045... Row Key Timestamp Column Family "info:" Column Family "content:" 20120407145045 t7 "info:summary" "An intro to..." t6 "info:author" "John Doe" t5 "Google's Bigtable is..." t4 "Google Bigtable is..." t3 "info:category" "Persistence" t2 "info:author" "John" t1 "info:title" "Intro to Bigtable" 20120320162535 t4 "info:category" "Persistence" t3 "CouchDB is..." t2 "info:author" "Bob Smith" t1 "info:title" "Doc-oriented..."
  • 12.
    Use HBase whenyou need random, realtime read/ write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable. - http://hbase.apache.org/
  • 13.
    HBase Shell hbase(main):001:0> create'blog', 'info', 'content' 0 row(s) in 4.3640 seconds hbase(main):002:0> put 'blog', '20120320162535', 'info:title', 'Document-oriented storage using CouchDB' 0 row(s) in 0.0330 seconds hbase(main):003:0> put 'blog', '20120320162535', 'info:author', 'Bob Smith' 0 row(s) in 0.0030 seconds hbase(main):004:0> put 'blog', '20120320162535', 'content:', 'CouchDB is a document-oriented...' 0 row(s) in 0.0030 seconds hbase(main):005:0> put 'blog', '20120320162535', 'info:category', 'Persistence' 0 row(s) in 0.0030 seconds hbase(main):006:0> get 'blog', '20120320162535' COLUMN CELL content: timestamp=1239135042862, value=CouchDB is a doc... info:author timestamp=1239135042755, value=Bob Smith info:category timestamp=1239135042982, value=Persistence info:title timestamp=1239135042623, value=Document-oriented... 4 row(s) in 0.0140 seconds
  • 14.
    HBase Shell hbase(main):015:0> get'blog', '20120407145045', {COLUMN=>'info:author', VERSIONS=>3 } timestamp=1239135325074, value=John Doe timestamp=1239135324741, value=John 2 row(s) in 0.0060 seconds hbase(main):016:0> scan 'blog', { STARTROW => '20120300', STOPROW => '20120400' } ROW COLUMN+CELL 20120320162535 column=content:, timestamp=1239135042862, value=CouchDB is... 20120320162535 column=info:author, timestamp=1239135042755, value=Bob Smith 20120320162535 column=info:category, timestamp=1239135042982, value=Persistence 20120320162535 column=info:title, timestamp=1239135042623, value=Document... 4 row(s) in 0.0230 seconds
  • 15.
  • 16.
    // Create anew table Configuration conf = HBaseConfiguration.create(); HBaseAdmin admin = new HBaseAdmin(conf); String tableName = "people"; HTableDescriptor desc = new HTableDescriptor(tableName); desc.addFamily(new HColumnDescriptor("personal")); desc.addFamily(new HColumnDescriptor("contactinfo")); desc.addFamily(new HColumnDescriptor("creditcard")); admin.createTable(desc); System.out.printf("%s is available? %bn", tableName, admin.isTableAvailable(tableName));
  • 17.
    import static org.apache.hadoop.hbase.util.Bytes.toBytes; //Add some data into 'people' table Configuration conf = HBaseConfiguration.create(); Put put = new Put(toBytes("connor-john-m-43299")); put.add(toBytes("personal"), toBytes("givenName"), toBytes("John")); put.add(toBytes("personal"), toBytes("mi"), toBytes("M")); put.add(toBytes("personal"), toBytes("surname"), toBytes("Connor")); put.add(toBytes("contactinfo"), toBytes("email"), toBytes("john.connor@gmail.com")); table.put(put); table.flushCommits(); table.close();
  • 18.
    Finding data: get (by row key) scan (by row key ranges, filtering)
  • 19.
    // Get arow. Ask for only the data you need. Configuration conf = HBaseConfiguration.create(); HTable table = new HTable(conf, "people"); Get get = new Get(toBytes("connor-john-m-43299")); get.setMaxVersions(2); get.addFamily(toBytes("personal")); get.addColumn(toBytes("contactinfo"), toBytes("email")); Result result = table.get(get);
  • 20.
    // Update existingvalues, and add a new one Configuration conf = HBaseConfiguration.create(); HTable table = new HTable(conf, "people"); Put put = new Put(toBytes("connor-john-m-43299")); put.add(toBytes("personal"), toBytes("surname"), toBytes("Smith")); put.add(toBytes("contactinfo"), toBytes("email"), toBytes("john.m.smith@gmail.com")); put.add(toBytes("contactinfo"), toBytes("address"), toBytes("San Diego, CA")); table.put(put); table.flushCommits(); table.close();
  • 21.
    // Scan rows... Configurationconf = HBaseConfiguration.create(); HTable table = new HTable(conf, "people"); Scan scan = new Scan(toBytes("smith-")); scan.addColumn(toBytes("personal"), toBytes("givenName")); scan.addColumn(toBytes("contactinfo", toBytes("email")); scan.addColumn(toBytes("contactinfo", toBytes("address")); scan.setFilter(new PageFilter(numRowsPerPage)); ResultScanner sacnner = table.getScanner(scan); for (Result result : scanner) { // process result... }
  • 22.
    DAta Modeling Row key design MATCH TO DATA ACCESS PATTERNS WIDE VS. NARROW ROWS
  • 23.
    REferences shop.oreilly.com/product/0636920014348.do http://shop.oreilly.com/product/0636920021773.do (3rd edition pub date is May 29, 2012) hbase.apache.org
  • 24.
    (my info) scott.leberknight atnearinfinity.com www.nearinfinity.com/blogs/ twitter: sleberknight