Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Prepared by, Vetri.V WHAT IS HBASE?  HBase is a database: the Hadoop database.It is indexed by rowkey, column key, and timestamp.  HBase stores structured and semistructured data naturally so you can load it with tweets and parsed log files and a catalog of all your products right along with their customer reviews.  It can store unstructured data too, as long as it’s not too large  HBase is designed to run on a cluster of computers instead of a single computer.The cluster can be built using commodity hardware; HBase scales horizontally as you add more machines to the cluster.  Each node in the cluster provides a bit of storage, a bit of cache, and a bit of computation as well. This makes HBase incredibly flexible and forgiving. No node is unique, so if one of those machines breaks down, you simply replace it with another.  This adds up to a powerful, scalable approach to data that,until now, hasn’t been commonly available to mere mortals. HBASE DATA MODEL: Hbase Data model - These six concepts form the foundation of HBase. Table:  HBase organizes data into tables. Table names are Strings and composed of characters that are safe for use in a file system path. Row :  Within a table, data is stored according to its row. Rows are identified uniquely by their rowkey. Rowkeys don’t have a data type and are always treated as a byte[]. Column family:  Data within a row is grouped by column family. Column families also impact the physical arrangement of data stored in HBase.  For this reason, they must be defined up front and aren’t easily modified. Every row in a table has the same column families, although a row need not store data in all its families. Column family names are Strings and composed of characters that are safe for use in a file system path. Column qualifier:  Data within a column family is addressed via its column qualifier,or column. Column qualifiers need not be specified in advance. Column qualifiers need not be consistent between rows.  Like rowkeys, column qualifiers don’t have a data type and are always treated as a byte[].
  2. 2. Prepared by, Vetri.V Cell:  A combination of rowkey, column family, and column qualifier uniquely identifies a cell. The data stored in a cell is referred to as that cell’s value. Values also don’t have a data type and are always treated as a byte[]. Version:  Values within a cell are versioned. Versions are identified by their timestamp,a long. When a version isn’t specified, the current timestamp is used as the basis for the operation. The number of cell value versions retained by HBase is configured via the column family. The default number of cell versions is three. Hbase Architecture HBase Tables and Regions Table is made up of any number of regions. Region is specified by its startKey and endKey.  Empty table: (Table, NULL, NULL)  Two-region table: (Table, NULL, “com.ABC.www”) and (Table, “com.ABC.www”, NULL) Each region may live on a different node and is made up of several HDFS files and blocks, each of which is replicated by Hadoop HBase Tables:-  Tables are sorted by Row in lexicographical order  Table schema only defines its column families  Each family consists of any number of columns  Each column consists of any number of versions  Columns only exist when inserted, NULLs are free  Columns within a family are sorted and stored together  Everything except table names are byte[]
  3. 3. Prepared by, Vetri.V  Hbase Table format (Row, Family:Column, Timestamp) -> Value HBase uses HDFS as its reliable storage layer.It Handles checksums, replication, failover Hbase consists of,  Java API, Gateway for REST, Thrift, Avro  Master manages cluster  RegionServer manage data  ZooKeeper is used the “neural network” and coordinates cluster Data is stored in memory and flushed to disk on regular intervals or based on size  Small flushes are merged in the background to keep number of files small  Reads read memory stores first and then disk based files second  Deletes are handled with “tombstone” markers MemStores:- After data is written to the WAL the RegionServer saves KeyValues in memory store  Flush to disk based on size, is hbase.hregion.memstore.flush.size  Default size is 64MB  Uses snapshot mechanism to write flush to disk while still serving from it and accepting new data at the same time Compactions:- Two types: Minor and Major Compactions Minor Compactions  Combine last “few” flushes  Triggered by number of storage files Major Compactions  Rewrite all storage files  Drop deleted data and those values exceeding TTL and/or number of versions Key Cardinality:- The best performance is gained from using row keys
  4. 4. Prepared by, Vetri.V  Time range bound reads can skip store files  So can Bloom Filters  Selecting column families reduces the amount of data to be scanned Fold, Store, and Shift:- All values are stored with the full coordinates,including: Row Key, Column Family, Column Qualifier, and Timestamp  Folds columns into “row per column”  NULLs are cost free as nothing is stored  Versions are multiple “rows” in folded table DDI:- Stands for Denormalization, Duplication and Intelligent Keys Block Cache Region Splits Hbase shell and Commands Hbase Install $ mkdir hbase-install $ cd hbase-install $ wget http://apache.claz.org/hbase/hbase-0.92.1/hbase-0.92.1.tar.gz
  5. 5. Prepared by, Vetri.V $ tar xvfz hbase-0.92.1.tar.gz $HBASE_HOME/bin/start-hbase.sh configuration changes in Hbase  Go to hbase-env.sh  Edit JAVA_HOME  Next go to hdfs-site.xml and edit the following: <configuration> <property> <name>hbase.rootdir</name> <value>hdfs://eattributes:54310/hbase</value> <description>The directory shared by region servers. Should be fully-qualified to include the filesystem to use. E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR </description> </property> <!-- <property> <name>hbase.master</name> <value>master:60000</value> <description>The host and port that the HBase master runs at. </description> </property> --> <property> <name>hbase.cluster.distributed</name> <value>true</value> <description>The host and port that the HBase master runs at. </description> </property> </configuration> Starting hbase shell: $ hbase shell hbase(main):001:0> list TABLE
  6. 6. Prepared by, Vetri.V 0 row(s) in 0.5710 seconds General HBase shell commands:  Show cluster status. Can be ‘summary’, ‘simple’, or ‘detailed’. The default is ‘summary’. hbase> status hbase> status ‘simple’ hbase> status ‘summary’ hbase> status ‘detailed’ hbase> version hbase>whoami Tables Management commands: Create a table hbase(main):002:0> create 'mytable', 'cf' hbase(main):003:0> list TABLE mytable 1 row(s) in 0.0080 seconds WRITING DATA hbase(main):004:0> put 'mytable', 'first', 'cf:message', 'hello HBase' READING DATA hbase(main):007:0> get 'mytable', 'first' hbase(main):008:0> scan 'mytable' describe table hbase(main):003:0> describe 'users' DESCRIPTION ENABLED {NAME => 'users', FAMILIES => [{NAME => 'info', true ,BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0 , COMPRESSION => 'NONE', VERSIONS => '3', TTL=> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} 1 row(s) in 0.0330 seconds Disable: hbase> disable ‘users’
  7. 7. Prepared by, Vetri.V Disable_all: Disable all of tables matching the given regex hbase>disable_all ‘users.*’ Is_Disabled: verifies Is named table disabled hbase>is_disabled ‘users.*’ Drop: Drop the named table. Table must first be disabled hbase> drop ‘users’ drop_all: Drop all of the tables matching the given regex hbase>drop_all ‘users.*’ Enable: hbase> enable ‘users’ enable_all: hbase>enable_all ‘users.*’ is_enabled: hbase>is_enabled ‘users.*’ exists: hbase> exists ‘users.*’ list: hbase> list hbase> list ‘abc.*’ show_filters: Show all the filters in hbase.
  8. 8. Prepared by, Vetri.V Count:  Count the number of rows in a table. Return value is the number of rows. This operation may take a LONG time (Run ‘$HADOOP_HOME/bin/hadoop jar hbase.jar rowcount’ to run a counting mapreduce job).  Current count is shown every 1000 rows by default. Count interval may be optionally specified. Scan caching is enabled on count scans by default. Default cache size is 10 rows. If your rows are small in size, you may want to increase this parameter. Examples:hbase> count ‘users.*’ hbase> count ‘users.*’, INTERVAL => 100000 hbase> count ‘users.*’, CACHE => 1000 hbase> count ‘users.*’, INTERVAL => 10, CACHE => 1000 Put: hbase> put ‘users, ‘r1, ‘c1’, ‘value’, ts1 Configurable block size hbase(main):002:0> create 'mytable',{NAME => 'colfam1', BLOCKSIZE => '65536'} Block cache:  Workloads don’t benefit from putting data into a read cache—for instance, if a certain table or column family in a table is only accessed for sequential scans or isn’t accessed a lot and you don’t care if Gets or Scans take a little longer.  By default, the block cache is enabled. You can disable it at the time of table creationor by altering the table: hbase(main):002:0> create 'mytable',{NAME => 'colfam1', BLOCKCACHE => 'false’} Aggressive caching:  You can choose some column families to have a higher priority in the block cache (LRU cache).  This comes in handy if you expect more random reads on one column family compared to another. This configuration is also done at table- instantiation time: hbase(main):002:0> create 'mytable', {NAME => 'colfam1', IN_MEMORY => 'true'} The default value for the IN_MEMORY parameter is false.
  9. 9. Prepared by, Vetri.V Bloom filters: hbase(main):007:0> create 'mytable',{NAME => 'colfam1', BLOOMFILTER => 'ROWCOL'}  The default value for the BLOOMFILTER parameter is NONE.  A row-level bloom filter is enabled with ROW, and a qualifier-level bloom filter is enabled with ROWCOL.  The rowlevel bloom filter checks for the non-existence of the particular rowkey in the block,and the qualifier-level bloom filter checks for the non-existence of the row and column qualifier combination.  The overhead of the ROWCOL bloom filter is higher than that of the ROW bloom filter. TTL (Time To Live):  can set the TTL while creating the table like this: hbase(main):002:0> create 'mytable', {NAME => 'colfam1', TTL => '18000'} This command sets the TTL on the column family colfam1 as 18,000 seconds = 5 hours. Data in colfam1 that is older than 5 hours is deleted during the next major compaction. Compression  Can enable compression on a column family when creating tables like this: hbase(main):002:0> create 'mytable', {NAME => 'colfam1', COMPRESSION => 'SNAPPY'} Note that data is compressed only on disk. It’s kept uncompressed in memory (Mem-Store or block cache) or while transferring over the network. Cell versioning:  Versions are also configurable at a column family level and can be specified at the time of table instantiation: hbase(main):002:0> create 'mytable', {NAME => 'colfam1', VERSIONS => 1} hbase(main):002:0> create 'mytable', {NAME => 'colfam1', VERSIONS => 1, TTL => '18000'} hbase(main):002:0> create 'mytable', {NAME => 'colfam1', VERSIONS => 5, MIN_VERSIONS => '1'} Description of a table:
  10. 10. Prepared by, Vetri.V hbase(main):004:0> describe 'follows' DESCRIPTION ENABLED {NAME => 'follows', coprocessor$1 => 'file:///U true users/ndimiduk/repos/hbaseia twitbase/target/twitbase- 1.0.0.jar|HBaseIA.TwitBase.coprocessors.FollowsObserver|1001|', FAMILIES => [{NAME => 'f', BLOOMFILTER => 'NONE', REPLICATION_SCOPE =>'0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} 1 row(s) in 0.0330 seconds Tuning HBase: hbase(main):003:0> help 'status' SPLITTING TABLES: hbase(main):019:0> split 'mytable' , 'G' Alter table hbase(main):020:0> alter 't', NAME => 'f', VERSIONS => 1 TRUNCATING TABLES: hbase(main):023:0> truncate 't' Truncating 't' table (it may take a while): - Disabling table... - Dropping table... - Creating table... 0 row(s) in 14.3190 seconds THANK YOU…