With the public confession of Facebook, HBase is on everyone's lips when it comes to the discussion around the new "NoSQL" area of databases. In this talk, Lars will introduce and present a comprehensive overview of HBase. This includes the history of HBase, the underlying architecture, available interfaces, and integration with Hadoop.
5. Why Hadoop/HBase?
• Datasets are constantly growing and intake soars
• Yahoo! has 140PB+ and 42k+ machines
• Facebook adds 500TB+ per day, 100PB+ raw data, on
tens of thousands of machines
• Are you “throwing” data away today?
• Traditional databases are expensive to scale and
inherently difficult to distribute
• Commodity hardware is cheap and powerful
• $1000 buys you 4-8 cores/4GB/1TB
• 600GB 15k RPM SAS nearly $500
• Need for random access and batch processing
• Hadoop only supports batch/streaming
6. History of Hadoop/HBase
• Google solved its scalability problems
• “The Google File System” published October 2003
• Hadoop DFS
• “MapReduce: Simplified Data Processing on Large
Clusters” published December 2004
• Hadoop MapReduce
• “BigTable: A Distributed Storage System for
Structured Data” published November 2006
• HBase
7. Hadoop Introduction
• Two main components
• Hadoop Distributed File System (HDFS)
• A scalable, fault-tolerant, high performance distributed file
system capable of running on commodity hardware
• Hadoop MapReduce
• Software framework for distributed computation
• Significant adoption
• Used in production in hundreds of organizations
• Primary contributors: Yahoo!, Facebook, Cloudera
8. HDFS: Hadoop Distributed File System
• Reliably store petabytes of replicated data across
thousands of nodes
• Data divided into 64MB blocks, each block replicated
three times
• Master/Slave architecture
• Master NameNode contains block locations
• Slave DataNode manages block on local file system
• Built on commodity hardware
• No 15k RPM disks or RAID required (nor wanted!)
9. MapReduce
• Distributed programming model to reliably
process petabytes of data using its locality
• Built-in bindings for Java and C
• Can be used with any language via Hadoop
Streaming
• Inspired by map and reduce functions in
functional programming
Input
è
Map()
è
Copy/Sort
è
Reduce()
è
Output
10. Hadoop…
• … is designed to store and stream extremely large
datasets in batch
• … is not intended for realtime querying
• … does not support random access
• … does not handle billions of small files well
• Less than default block size of 64MB and smaller
• Keeps “inodes” in memory on master
• … is not supporting structured data more than
unstructured or complex data
That is why we have HBase!
11. Why HBase and not …?
• Question: Why HBase and not <put-your-favorite-
nosql-solution-here>?
• What else is there?
• Key/value stores
• Document-oriented stores
• Column-oriented stores
• Graph-oriented stores
• Features to ask for
• In memory or persistent?
• Strict or eventual consistency?
• Distributed or single machine (or afterthought)?
• Designed for read and/or write speeds?
• How does it scale? (if that is what you need)
12. What is HBase?
• Distributed
• Column-Oriented
• Multi-Dimensional
• High-Availability (CAP anyone?)
• High-Performance
• Storage System
Project Goals
Billions of Rows * Millions of Columns * Thousands of
Versions
Petabytes across thousands of commodity servers
13. HBase is not…
• An SQL Database
• No joins, no query engine, no types, no SQL
• Transactions and secondary indexes only as add-ons but
immature
• A drop-in replacement for your RDBMS
• You must be OK with RDBMS anti-schema
• Denormalized data
• Wide and sparsely populated tables
• Just say “no” to your inner DBA
Keyword: Impedance Match
24. HBase Tables
• Tables are sorted by the Row Key in
lexicographical order
• Table schema only defines its Column Families
• Each family consists of any number of Columns
• Each column consists of any number of Versions
• Columns only exist when inserted, NULLs are free
• Columns within a family are sorted and stored
together
• Everything except table names are byte[]
(Table, Row, Family:Column, Timestamp) è Value
25. Column Family vs. Column
• Use only a few column families
• Causes many files that need to stay open per region
plus class overhead per family
• Best used when logical separation between data
and meta columns
• Sorting per family can be used to convey
application logic or access pattern
26. HBase Architecture
• Table is made up of any number if regions
• Region is specified by its startKey and endKey
• Empty table: (Table, NULL, NULL)
• Two-region table: (Table, NULL, “com.cloudera.www”)
and (Table, “com.cloudera.www”, NULL)
• Each region may live on a different node and is
made up of several HDFS files and blocks, each
of which is replicated by Hadoop
27. HBase Architecture (cont.)
• Two types of HBase nodes:
Master and RegionServer
• Special tables -ROOT- and.META. store schema
information and region locations
• Master server responsible for RegionServer
monitoring as well as assignment and load
balancing of regions
• Uses ZooKeeper as its distributed coordination
service
• Manages Master election and server availability
28. Web Crawl Example
• Canonical use-case for BigTable
• Store web crawl data
• Table webtable with family content and meta
• Row is reversed URL with Columns
• content:data stores the raw crawled data
• meta:language stores http language header
• meta:type stores http content-type header
• While processing raw data for hyperlinks and images,
add families links and images
• links:<rurl> column for each hyperlink
• images:<rurl> column for each image
29. HBase Clients
• Native Java Client/API
• Non-Java Clients
• REST server
• Avro server
• Thrift server
• Jython, Scala, Groovy DSL
• TableInputFormat/TableOutputFormat for
MapReduce
• HBase as MapReduce source and/or target
• HBase Shell
• JRuby shell adding get, put, scan and admin calls
30. Java API
• CRUD
• get: retrieve an entire, or partial row (R)
• put: create and update a row (CU)
• delete: delete a cell, column, columns, or row (D)
Result get(Get get) throws IOException;
void put(Put put) throws IOException;
void delete(Delete delete) throws IOException;
31. Java API (cont.)
• CRUD+SI
• scan: Scan any number of rows (S)
• increment: Increment a column value (I)
ResultScanner getScanner(Scan scan) throws IOException;
Result increment(Increment increment) throws IOException ;
32. Java API (cont.)
• CRUD+SI+CAS
• Atomic compare-and-swap (CAS)
• Combined get, check, and put operation
• Helps to overcome lack of full transactions
33. Batch Operations
• Support Get, Put, and Delete
• Reduce network round-trips
• If possible, batch operation to the server to gain
better overall throughput
void batch(List<Row> actions, Object[] results)
throws IOException, InterruptedException;
Object[] batch(List<Row> actions)
throws IOException, InterruptedException;
34. Filters
• Can be used with Get and Scan operations
• Server side hinting
• Reduce data transferred to client
• Filters are no guarantee for fast scans
• Still full table scan in worst-case scenario
• Might have to implement your own
• Filters can hint next row key
35. HBase Extensions
• Hive, Pig, Cascading
• Hadoop-targeted MapReduce tools with HBase
integration
• Sqoop
• Read and write to HBase for further processing in
Hadoop
• HBase Explorer, Nutch, Heretrix
• SpringData
• Toad
36. History of HBase
• November 2006
• Google releases paper on BigTable
• February 2007
• Initial HBase prototype created as Hadoop contrib
• October 2007
• First “useable” HBase (Hadoop 0.15.0)
• January 2008
• Hadoop becomes TLP, HBase becomes subproject
• October 2008
• HBase 0.18.1 released
• January 2009
• HBase 0.19.0
• September 2009
• HBase 0.20.0 released (Performance Release)
• May 2010
• HBase becomes TLP
• June 2010
• HBase 0.89.20100621, first developer release
• May 2011
• HBase 0.90.3 release
41. HBase Architecture (cont.)
• Based on Log-Structured Merge-Trees (LSM-Trees)
• Inserts are done in write-ahead log first
• Data is stored in memory and flushed to disk on
regular intervals or based on size
• Small flushes are merged in the background to keep
number of files small
• Reads read memory stores first and then disk based
files second
• Deletes are handled with “tombstone” markers
• Atomicity on row level no matter how many columns
• keeps locking model easy
44. MapReduce with HBase
• Framework to use HBase as source and/or sink for
MapReduce jobs
• Thin layer over native Java API
• Provides helper class to set up jobs easier
TableMapReduceUtil.initTableMapperJob(
“test”, scan, MyMapper.class,
ImmutableBytesWritable.class,
RowResult.class, job);
TableMapReduceUtil.initTableReducerJob(
“table”, MyReducer.class, job);
45. MapReduce with HBase (cont.)
• Special use-case in regards to Hadoop
• Tables are sorted and have unique keys
• Often we do not need a Reducer phase
• Combiner not needed
• Need to make sure load is distributed properly by
randomizing keys (or use bulk import)
• Partial or full table scans possible
• Scans are very efficient as they make use of block
caches
• But then make sure you do not create to much churn, or
better switch caching off when doing full table scans.
• Can use filters to limit rows being processed
46. TableInputFormat
• Transforms a HBase table into a source for
MapReduce jobs
• Internally uses a TableRecordReader which
wraps a Scan instance
• Supports restarts to handle temporary issues
• Splits table by region boundaries and stores
current region locality
47. TableOutputFormat
• Allows to use HBase table as output target
• Put and Delete support from mapper or reducer
class
• Uses TableOutputCommitter to write data
• Disables auto-commit on table to make use of
client side write buffer
• Handles final flush in close()
48. HFileOutputFormat
• Used to bulk load data into HBase
• Bypasses normal API and generates low-level
store files
• Prepares files for final bulk insert
• Needs special handling of sort order and
partitioning
• Only supports one column family (for now)
• Can load bulk updates into existing tables
49. MapReduce Helper
• TableMapReduceUtil
• IdentityTableMapper
• Passes on key and value, where value is a Result
instance and key is set to value.getRow()
• IdentityTableReducer
• Stores values into HBase, must be Put or Delete
instances
• HRegionPartitioner
• Not set by default, use it to control partioning on
Hadoop level
50. Custom MapReduce over Tables
• No requirement to use provided framework
• Can read from or write to one or many tables in
mapper and reducer
• Can split not on regions but arbitrary boundaries
• Make sure to use write buffer in OutputFormat to
get best performance (do not forget to call
flushCommits() at the end!)
52. Advanced Techniques
• Key/Table Design
• DDI
• Salting
• Hashing vs. Sequential Keys
• ColumnFamily vs. Column
• Using BloomFilter
• Data Locality
• checkAndPut() and checkAndDelete()
• Coprocessors
53. Coprocessors
• New addition to feature set
• Based on talk by Jeff Dean at LADIS 2009
• Run arbitrary code on each region in RegionServer
• High level call interface for clients
• Calls are addressed to rows or ranges of rows while
Coprocessors client library resolves locations
• Calls to multiple rows are atomically split
• Provides model for distributed services
• Automatic scaling, load balancing, request routing
54. Coprocessors in HBase
• Use for efficient computational parallelism
• Secondary indexing (HBASE-2038)
• Column Aggregates (HBASE-1512)
• SQL-like sum(), avg(), max(), min(), etc.
• Access control (HBASE-3025, HBASE-3045)
• Provide basic access control
• Table Metacolumns
• New filtering
• predicate pushdown
• Table/Region access statistics
• HLog extensions (HBASE-3257)
55. Coprocessor and RegionObserver
• The Coprocessor interface defines these hooks
• preOpen, postOpen: Called before and after the
region is reported as online to the master
• preFlush, postFlush: Called before and after the
memstore is flushed into a new store file
• preCompact, postCompact: Called before and after
compaction
• preSplit, postSplit: Called after the region is split
• preClose, postClose: Called before and after the
region is reported as closed to the master
56. Coprocessor and RegionObserver
• The RegionObserver interface is defines these hooks:
• preGet, postGet: Called before and after a client makes a Get
request
• preExists, postExists: Called before and after the client tests for
existence using a Get
• prePut, postPut: Called before and after the client stores a value
• preDelete, postDelete: Called before and after the client deletes a
value
• preScannerOpen, postScannerOpen: Called before and after the
client opens a new scanner
• preScannerNext, postScannerNext: Called before and after the
client asks for the next row on a scanner
• preScannerClose, postScannerClose: Called before and after the
client closes a scanner
• preCheckAndPut, postCheckAndPut: Called before and after the
client calls checkAndPut()
• preCheckAndDelete, postCheckAndDelete: Called before and after
the client calls checkAndDelete()
58. Current Project Status
• HBase 0.90.x “Advanced Concepts”
• Master Rewrite – More Zookeeper
• Intra Row Scanning
• Further optimizations on algorithms and data
structures
CDH3
• HBase 0.92.x “Coprocessors”
• Multi-DC Replication
• Discretionary Access Control
• Coprocessors
CDH4
59. Current Project Status (cont.)
• HBase 0.94.x “Performance Release”
• Read CRC Improvements
• Seek Optimizations
• WAL Compression
• Prefix Compression (aka Block Encoding)
• Atomic Append
• Atomic put+delete
• Multi Increment and Multi Append
• Per-region (i.e. local) Multi-Row Transactions
• WALPlayer
CDH4.x (soon)
60. Current Project Status (cont.)
• HBase 0.96.x “The Singularity”
• Protobuf RPC
• Rolling Upgrades
• Multiversion Access
• Metrics V2
• Preview Technologies
• Snapshots
• PrefixTrie Block Encoding
CDH5 ?