Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBASE Overview


Published on

  • Login to see the comments

HBASE Overview

  1. 1. HBASE THE SCALABLE DATA STORE Sampath Rachakonda
  2. 2. Agenda  Evolution of HBASE  Overview  Data Model  Architecture  Hbase and Zookeper
  3. 3. Evolution of HBASE  File-Systems  Tapes → Linear Access or Sequential Access.  Disc → Random Access Seek Time Transfer Rate  DBMS  RDBMS  Now NOSQL
  4. 4. Hadoop  It comprises mainly two things HDFS and MapReduce.  HDFS is scalable, fault tolerant, and high performance DFS can run on commodity hardware.  Map-Reduce is software framework for distributed computation.  Master/Slave `  Limitations Batch processing Sequential Data look-up Not intended for real time querying No Support for Random Access
  5. 5. NOSQL  Massive Data Volumes  Schema Evolution As it is almost impossible for fixed Schema for web scale database. With NOSQL Schema changes can be gradually introduced into systems.  Extreme Query Load Bottleneck is Joins
  6. 6. Why HBASE ?  Column-Oriented Stores  Distributed – Designed to serve large tables  Horizontally Scalable  High Performance & Availability  Storage System  The base goal of HBASE is Billions of Rows, Millions of Columns and Thousand of versions  Supports random real time CRUD operations unlike HDFS
  7. 7. Who uses Hbase ?  Facebook  Adobe  Twitter  Yahoo  Meetup  Netflix  Many More..
  8. 8. When to use HBASE ?  Good for large amounts of data 100's of millions or billions of rows Have to have enough hardware Large Amounts of client requests Single Random Selects and range scans by key Great for variable schema Analytical
  9. 9. HBASE Data Model  Data is stored in Tables  Tables contain rows Rows are referenced by Unique key Key is array of bytes anything can be a key.  Rows made of columns are grouped in column families Data is stored in cells and identified by row x column-family x column  Tables are sorted by the row key in lexicographical order.
  10. 10. HBASE Families  Rows are grouped as families Labeled as “Family:column”  Example: “user:name” Different features are applied to families  Stored together – HFile/StoreFile  Compression  Table Schema defines its Column Families Each family can consist of any number of columns and Versions Column exists when inserted, NULLS are free. Columns with family are sorted and stored together.
  11. 11. HBASE Timestamps  Cells Values are versioned and 3 versions are kept by default.  Versions are stored in decreasing time-stamp order.  Reads the latest first – which will be our current value.  Value will be Value = Table + RowKey + Family + Column + TimeStamp  Index will be always unique
  12. 12. HBASE Cells Example  Example of how values are stored Row Key Time stamp Name Family Address Family first_name last_name number address row1 t1 Bob Smith t5 10 First Lane t10 30 Other Lane t15 7 Last Street row2 t20 Mary Tompson t22 77 One Street t30 Thompson
  13. 13. HBASE Architecture  Table is made up of regions  Region is a range of rows sorted together Dynamically splits as they become too big and merge when they are too small  Master Server is responsible for managing Hbase cluster (i.e.., Region Servers)  Hbase stores its data into HDFS which makes to rely it on high tolerant and high availability and fault tolerance features.  Zookeper is used for distributed coordination.
  14. 14. HBASE Architecture  As Follows:
  15. 15. HBASE Regions  Region is a range of keys start key to end key exclusive  Initially there will be one region as addition of data exceed the configured maximum (256 MB default) the region will be split  No of regions per server varies from 10 to 10000 as per hardware per region server.  Splitting data into regions help us in different ways: Fast Recovery when a region fails Load Balancing when a server overloaded Splitting is fast
  16. 16. HBASE Data Storage  When data is added it will be written on to WAL (Write Ahead Log) and also in memory (Memstore)  When the data exceeds maximum value then it is flushed out of WAL to HFile  RegionServer still serves read-writes during the flush operations, writing values to WAL & Memstore.  Hfile is nothing much than a Key-Value map.  As HDFS doesn't support updates to an existing file therefore HFiles are immutable.  Delete Marker is saved to indicate whether record is available or removed.
  17. 17. HBASE Data Storage(Contd.)  Periodic Data Computations are performed to control no of Hfiles and to keep cluster balanced Minor Complication:  Smaller Hfiles are merged into larger Hfiles Fast as data is already sorted Delete Markers are not applied Major Complication:  Scanning for all the entries and apply deletes as necessary  Merge all Hfiles of a region into a single file lies within a column family
  18. 18. HBASE Master  Manages Regions and their locations Assigns Regions Balances workload Recovers if any region server is unavailable Uses Zookeeper for distributed coordination service  Clients directly communicate with Region Servers  Performs Schema Management and changes Adding/Removing tables and Column Families
  19. 19. HBASE and Zookeeper  HBASE uses zookeeper for region assignments  Zookeeper is a centralized server for maintaining configuration information, Naming, Providing distributed synchronization, and providing group service.  File like API, performs operations on directories and files (Znodes)  Clients connect with a session to zookeeper Session is maintained via Heart-Beat Clients listening for updates will be notified of the deleted nodes and new nodes.
  20. 20. HBASE and Zookeeper(Contd.)  Each region server creates a Ephemeral Node. Master monitors these nodes to discover available region servers and for server failures.  Use Zookeeper to make sure that only one master is registered  HBASE cannot exist in distributed without Zookeeper.
  21. 21. HBASE Access  Hbase Shell  Native JAVA API Fastest and very capable options.  Avro Server Requires running Avro Server.  Hbql SQL like syntax for HBASE