2. what hbase
open-source, distributed, versioned, column-oriented store, implement
by Java, like bigtable
Hadoop: A distributed system, for large scale storage and paralleled computing
HDFS: A distributed file system that provides high throughput access to application data.
ZooKeeper: A high-performance coordination service for distributed applications.
3. why need hbase
Big Data: billions of rows X millions of columns
Scalability: Linear scability, across hundreds or thousands of machine
Read/write performance:
put: MemStore(later merge into data file) and WAL(append instead random write)
get and scan: Block cache and Bloom Filters
Failure handling:http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
Schema: Loosely-structured {key, value} data
4. how does hbase work
(Table, RowKey, Family, Column, Timestamp) → Value
HBase table is a three-dimensional sorted map
Each family consists of any number of columns
Each column consists of any number of versions
row(asc), column(asc), timestamp(desc)
5.
6. HMaster
Assignment, load balancing, splitting
Dispatch Regions to RegionServers.
Assign RegionServers.
Not part of the read/write path
Highly available with ZooKeeper and standbys
7. HRegionServer
StoreFile is stored in HDFS as HFile
Table (HBase table)
Region (Regions for the table)
Store (Store per ColumnFamily for each Region for the table)
MemStore (MemStore for each Store for each Region for the table)
StoreFile (StoreFiles for each Store for each Region for the table)
Block (Blocks within a StoreFile within a Store for each Region for the table)
8. MemStore & HLog
Data is written into MemStore HLog first.
Data are written into cache and log first,
Data are flushed from cache to file, then merge later,
HLog are used for recovering.
9. Zookeeper
Tree-structure index:
Zookeeper file Keep the pointer to the -ROOT- Region.
Store index –ROOT- positions of .META. Regions
Store table info .META. positions of each region on each regioin-server
Store the Hbase schema--table info, column family info
Fully cached in RAM
Monitor RegionServer’s aliveness
10. HClient (Gateway of HBase)
Cache the region positions.
read :
Batch Loading, Scan Caching, Scan Attribute(Column Family or Column) Selection
write : AutoFlush, Turn off WAL on Puts
Hbase client pool