Atilim University
Big Data Analytics
Dr. Ziya Karakaya
Mirwais Doost
AGENDA
AGENDA
 What is HBase
 HBase Features
 Applications of HBase
 HBase vs RDBM
 HBase Storage
 HBase Architectural Components
What is HBase
Structured data
This data could
be easily stores in
a Relational
Database (RDMS)
Introduction to HBase
At the past, data used to
be less and was mostly
structured
Semi-structured
data
Storing and
processing this
data on RDBMS
became a major
problem
Introduction to HBase
Then, Internet evolved and
huge volumes of structured
and semi-structured data got
generated
Semi-structured
data
Apache HBASE
was the solution
for this
Introduction to HBase
Then, Internet evolved and
huge volumes of structured
and semi-structured data got
generated
Solution
What is HBase?
HBase is a column oriented database management system
derived from google NoSQL database Big Table that runs
on the top of HDFS
Open source project that is horizontally scalable
1
NoSQL database written in java which performs faster querying
2
Well suited for sparse datasets (can contain missing or NA values)
3
Applications of
HBase
Applications of HBase
Medical E-Commerce Sports
HBase is used for
genome sequences
Storing disease history
of people or an area
HBase is used for storing
logs about customer
search history
Performs analytics and
target advertisement for
better business insights
HBase stores match
details and history of
each match
Uses this data for better
prediction
HBase vs
RDBMS
HBase vs RDBMS
Does not have a fix schema
(schema-less). Defines only
column families
Works well with structured and
semi-structured data
It can have denormalized data
(can contain missing or NA values)
Built for wide tables that can be
scaled horizontally
Has a fixed schema which
describes the structure of the
tables
Works well with structured data
RDBMS can store only normalized
data
Built for the tables that is hard
to scale
HBase
Storage
HBase column oriented storage
Row 1
Row 2
Row 3
Column Family 1 Column Family 2 Column Family 3
Row id
Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7 Col 8 Col 9
Row Key Column Family
Column
Qualifiers
Cells
HBase column oriented storage
1 Angela Chicago 31 Big Data Architect $70,000
2 Dwayne Boston 35 Web Developer $65000
3 David Seattle 29 Data Analytics $55000
Personal data Professional data
name city age Designation salary
Row Key Column Family
Column
Qualifiers
Cells
empid
Row id
HBase
Architectural
Components
HBase Architectural Components
HFile HFile
Store Memory
Region Server
HLog
Region
HFile HFile
Store Memory
Region Server
HLog
Region
HFile HFile
Store Memory
Region Server
HLog
Region
HDFS
Zookeeper is used for
monitoring
Apache
Zookeeper
HMaster
HBase Master assigns
regions and load
balancing
Key col col
xxx val val
xxx val val
Region Server 1
Region 1 Region 2
startKey
endKey
Key col Col
xxx val val
xxx val val
Region Server 2
Region 3 Region 4
startKey
endKey
Key col Col
xxx val val
xxx val val
Key col Col
xxx val val
xxx val val
HBase Architectural Components - Regions
HBase tables are divided horizontally by
row key range into “Regions”
Regions are assigned to the nodes in the
cluster, called “Regions Servers”
A regions contains all rows in the table
between the regions start key and end key
These servers serve data for read and write
Client
get
Key col col
xxx val val
xxx val val
Region Server 1
Region 1 Region 2
Assigns regions to
region serves
Key col Col
xxx val val
xxx val val
Region Server 2
Region 3 Region 4
Key col Col
xxx val val
xxx val val
Key col Col
xxx val val
xxx val val
HBase Architectural Components - HMaster
Region assignment, Data Definition Language
operation (create, delete) are handled by HMaster
Assigning and re-assigning regions for recovery
or load balancing and monitoring all servers
Client
HMaster
Create, delete,
update table
Monitors region
servers
Assigns
regions
to region
serves
HBase has a distributed environment where HMaster alone is not
sufficient to manage every thing, Hence, ZooKeeper was introduced
Key col col
xxx val val
xxx val val
Region Server 1
Region 1 Region 2
Key col Col
xxx val val
xxx val val
Region Server 2
Region 3 Region 4
Key col Col
xxx val val
xxx val val
Key col Col
xxx val val
xxx val val
HBase Architectural Components - ZooKeeper
ZooKeeper is a distributed coordination service to
maintain server state in the cluster
ZooKeeper maintains which servers are alive and
available, and provides server failure notification
Inactive
HMaster
Ative
HMaster
heartbeat
ZooKeeper
Active HMaster sends a heartbeat signal to ZooKeeper indicating that it is active and region servers
send their status to ZooKeeper indicating they are ready for read and write operation
Key col col
xxx val val
xxx val val
Region Server 1
Region 1 Region 2
Key col Col
xxx val val
xxx val val
Region Server 2
Region 3 Region 4
Key col Col
xxx val val
xxx val val
Key col Col
xxx val val
xxx val val
HBase Architectural Components - ZooKeeper
ZooKeeper is a distributed coordination service to
maintain server state in the cluster
ZooKeeper maintains which servers are alive and
available, and provides server failure notification
Inactive
HMaster
Ative
HMaster
heartbeat
ZooKeeper
Inactive HMaster acts as a backup if the active HMaster fails, it
will come to rescue
Key col col
xxx val val
xxx val val
Region Server 1
Region 1 Region 2
Key col Col
xxx val val
xxx val val
Region Server 2
Region 3 Region 4
Key col Col
xxx val val
xxx val val
Key col Col
xxx val val
xxx val val
HBase Architectural Components work together
HMaster
1 master is
active
ZooKeeper
• Acvtive Hmaster selection
• Region Server session
heartbeat
Ephermera
l node
Ephermeral
node
Active HMaster and Region Servers connect with a session to ZooKeeper and ZooKeeper maintains ephemeral
nodes for active sessions via heartbeats to indicate that region servers are up and running
HBase Read or Write
Region Server
DataNode
Region Server
DataNode
Client ZooKeeper
There is a special HBase Catalog table called the META table, Which
holds the location of the regions in the cluster
Here is what happens the first time a client reads or writes data to HBase
The client gets the Region
Server that hosts the META
table from ZooKeeper
META location is stored
in ZooKeeper
Requests for
Region Server
META table
location
HBase Read or Write
Region Server
DataNode
Region Server
DataNode
Client ZooKeeper
There is a special HBase Catalog table called the META table, Which
holds the location of the regions in the cluster
Here is what happens the first time a client reads or writes data to HBase
The client gets the Region
Server that hosts the META
table from ZooKeeper
Get region server for row
key from meta table
Meta
Cache
The client caches this
information along
with the meta table
location
HBase Read or Write
Region Server
DataNode
Region Server
DataNode
Client ZooKeeper
There is a special HBase Catalog table called the META table, Which
holds the location of the regions in the cluster
Here is what happens the first time a client reads or writes data to HBase
It will get the Row from the
corresponding Region Server
Get row
Put row
Key col col
xxx val val
xxx val val
Region 1 Region 2
Key col Col
xxx val val
xxx val val
Region 3 Region 4
Region Server
Key col Col
xxx val val
xxx val val
Key col Col
xxx val val
xxx val val
HBase Meta Table
Region Server
Meta Table
Row key Vale
table, key, region region server
Special HBase catalog
table that maintains a
list of all the Region
Servers in the HBase
storage system
META table is used to
find the Region for a
given Table Key
HBase Write Mechanism
MemStore MemStore
Region
HFile
HFile
HFile
HFile
WAL
HDFS DataNode
Client
When client issues a put request, it will write the data to the write-ahead log (WAL)
1
1
Write Ahead Log (WAL) is a file
use to store new data that is
yet to be put on permanent
storage. It is used for recovery
in the case of failure.
HBase Write Mechanism
MemStore MemStore
Region
HFile
HFile
HFile
HFile
WAL
HDFS DataNode
Client
Once data is written to the WAL, it is then copied to the MemStore
2
1
MemStore is the write cache
that stores new data that has
not yet been written to the
disk. There is one MemStore
per column family per region. 2
HBase Write Mechanism
MemStore MemStore
Region
HFile
HFile
HFile
HFile
WAL
HDFS DataNode
Client
Once the data is placed in the MemStore, the client then receives the acknowledgment
3
1
2
3 ACK
HBase Write Mechanism
MemStore MemStore
Region
HFile
HFile
HFile
HFile
WAL
HDFS DataNode
Client
When the MemStore reaches the threshold, it dumps or commits the data into HFile
4
1
Hfiles store the rows of data as
stored KeyValue on disk
2
3
4 4
ACK
HBase Features
HBase Features
Scalable
Automatic failure
support
Consistent read and
write
JAVA API for client
access
Data can be scaled
across various nodes
as it is stored in HDFS
Write Ahead Log
across clusters
which provides
automatic support
against failure
HBase Provides
consistent read and
write of data
Provides ease to
use JAVA API for
clients
Many thanks for your attention!

Hbase.pptx

  • 1.
    Atilim University Big DataAnalytics Dr. Ziya Karakaya Mirwais Doost
  • 3.
  • 4.
    AGENDA  What isHBase  HBase Features  Applications of HBase  HBase vs RDBM  HBase Storage  HBase Architectural Components
  • 5.
  • 6.
    Structured data This datacould be easily stores in a Relational Database (RDMS) Introduction to HBase At the past, data used to be less and was mostly structured
  • 7.
    Semi-structured data Storing and processing this dataon RDBMS became a major problem Introduction to HBase Then, Internet evolved and huge volumes of structured and semi-structured data got generated
  • 8.
    Semi-structured data Apache HBASE was thesolution for this Introduction to HBase Then, Internet evolved and huge volumes of structured and semi-structured data got generated Solution
  • 9.
    What is HBase? HBaseis a column oriented database management system derived from google NoSQL database Big Table that runs on the top of HDFS Open source project that is horizontally scalable 1 NoSQL database written in java which performs faster querying 2 Well suited for sparse datasets (can contain missing or NA values) 3
  • 10.
  • 11.
    Applications of HBase MedicalE-Commerce Sports HBase is used for genome sequences Storing disease history of people or an area HBase is used for storing logs about customer search history Performs analytics and target advertisement for better business insights HBase stores match details and history of each match Uses this data for better prediction
  • 12.
  • 13.
    HBase vs RDBMS Doesnot have a fix schema (schema-less). Defines only column families Works well with structured and semi-structured data It can have denormalized data (can contain missing or NA values) Built for wide tables that can be scaled horizontally Has a fixed schema which describes the structure of the tables Works well with structured data RDBMS can store only normalized data Built for the tables that is hard to scale
  • 14.
  • 15.
    HBase column orientedstorage Row 1 Row 2 Row 3 Column Family 1 Column Family 2 Column Family 3 Row id Col 1 Col 2 Col 3 Col 4 Col 5 Col 6 Col 7 Col 8 Col 9 Row Key Column Family Column Qualifiers Cells
  • 16.
    HBase column orientedstorage 1 Angela Chicago 31 Big Data Architect $70,000 2 Dwayne Boston 35 Web Developer $65000 3 David Seattle 29 Data Analytics $55000 Personal data Professional data name city age Designation salary Row Key Column Family Column Qualifiers Cells empid Row id
  • 17.
  • 18.
    HBase Architectural Components HFileHFile Store Memory Region Server HLog Region HFile HFile Store Memory Region Server HLog Region HFile HFile Store Memory Region Server HLog Region HDFS Zookeeper is used for monitoring Apache Zookeeper HMaster HBase Master assigns regions and load balancing
  • 19.
    Key col col xxxval val xxx val val Region Server 1 Region 1 Region 2 startKey endKey Key col Col xxx val val xxx val val Region Server 2 Region 3 Region 4 startKey endKey Key col Col xxx val val xxx val val Key col Col xxx val val xxx val val HBase Architectural Components - Regions HBase tables are divided horizontally by row key range into “Regions” Regions are assigned to the nodes in the cluster, called “Regions Servers” A regions contains all rows in the table between the regions start key and end key These servers serve data for read and write Client get
  • 20.
    Key col col xxxval val xxx val val Region Server 1 Region 1 Region 2 Assigns regions to region serves Key col Col xxx val val xxx val val Region Server 2 Region 3 Region 4 Key col Col xxx val val xxx val val Key col Col xxx val val xxx val val HBase Architectural Components - HMaster Region assignment, Data Definition Language operation (create, delete) are handled by HMaster Assigning and re-assigning regions for recovery or load balancing and monitoring all servers Client HMaster Create, delete, update table Monitors region servers Assigns regions to region serves HBase has a distributed environment where HMaster alone is not sufficient to manage every thing, Hence, ZooKeeper was introduced
  • 21.
    Key col col xxxval val xxx val val Region Server 1 Region 1 Region 2 Key col Col xxx val val xxx val val Region Server 2 Region 3 Region 4 Key col Col xxx val val xxx val val Key col Col xxx val val xxx val val HBase Architectural Components - ZooKeeper ZooKeeper is a distributed coordination service to maintain server state in the cluster ZooKeeper maintains which servers are alive and available, and provides server failure notification Inactive HMaster Ative HMaster heartbeat ZooKeeper Active HMaster sends a heartbeat signal to ZooKeeper indicating that it is active and region servers send their status to ZooKeeper indicating they are ready for read and write operation
  • 22.
    Key col col xxxval val xxx val val Region Server 1 Region 1 Region 2 Key col Col xxx val val xxx val val Region Server 2 Region 3 Region 4 Key col Col xxx val val xxx val val Key col Col xxx val val xxx val val HBase Architectural Components - ZooKeeper ZooKeeper is a distributed coordination service to maintain server state in the cluster ZooKeeper maintains which servers are alive and available, and provides server failure notification Inactive HMaster Ative HMaster heartbeat ZooKeeper Inactive HMaster acts as a backup if the active HMaster fails, it will come to rescue
  • 23.
    Key col col xxxval val xxx val val Region Server 1 Region 1 Region 2 Key col Col xxx val val xxx val val Region Server 2 Region 3 Region 4 Key col Col xxx val val xxx val val Key col Col xxx val val xxx val val HBase Architectural Components work together HMaster 1 master is active ZooKeeper • Acvtive Hmaster selection • Region Server session heartbeat Ephermera l node Ephermeral node Active HMaster and Region Servers connect with a session to ZooKeeper and ZooKeeper maintains ephemeral nodes for active sessions via heartbeats to indicate that region servers are up and running
  • 24.
    HBase Read orWrite Region Server DataNode Region Server DataNode Client ZooKeeper There is a special HBase Catalog table called the META table, Which holds the location of the regions in the cluster Here is what happens the first time a client reads or writes data to HBase The client gets the Region Server that hosts the META table from ZooKeeper META location is stored in ZooKeeper Requests for Region Server META table location
  • 25.
    HBase Read orWrite Region Server DataNode Region Server DataNode Client ZooKeeper There is a special HBase Catalog table called the META table, Which holds the location of the regions in the cluster Here is what happens the first time a client reads or writes data to HBase The client gets the Region Server that hosts the META table from ZooKeeper Get region server for row key from meta table Meta Cache The client caches this information along with the meta table location
  • 26.
    HBase Read orWrite Region Server DataNode Region Server DataNode Client ZooKeeper There is a special HBase Catalog table called the META table, Which holds the location of the regions in the cluster Here is what happens the first time a client reads or writes data to HBase It will get the Row from the corresponding Region Server Get row Put row
  • 27.
    Key col col xxxval val xxx val val Region 1 Region 2 Key col Col xxx val val xxx val val Region 3 Region 4 Region Server Key col Col xxx val val xxx val val Key col Col xxx val val xxx val val HBase Meta Table Region Server Meta Table Row key Vale table, key, region region server Special HBase catalog table that maintains a list of all the Region Servers in the HBase storage system META table is used to find the Region for a given Table Key
  • 28.
    HBase Write Mechanism MemStoreMemStore Region HFile HFile HFile HFile WAL HDFS DataNode Client When client issues a put request, it will write the data to the write-ahead log (WAL) 1 1 Write Ahead Log (WAL) is a file use to store new data that is yet to be put on permanent storage. It is used for recovery in the case of failure.
  • 29.
    HBase Write Mechanism MemStoreMemStore Region HFile HFile HFile HFile WAL HDFS DataNode Client Once data is written to the WAL, it is then copied to the MemStore 2 1 MemStore is the write cache that stores new data that has not yet been written to the disk. There is one MemStore per column family per region. 2
  • 30.
    HBase Write Mechanism MemStoreMemStore Region HFile HFile HFile HFile WAL HDFS DataNode Client Once the data is placed in the MemStore, the client then receives the acknowledgment 3 1 2 3 ACK
  • 31.
    HBase Write Mechanism MemStoreMemStore Region HFile HFile HFile HFile WAL HDFS DataNode Client When the MemStore reaches the threshold, it dumps or commits the data into HFile 4 1 Hfiles store the rows of data as stored KeyValue on disk 2 3 4 4 ACK
  • 32.
  • 33.
    HBase Features Scalable Automatic failure support Consistentread and write JAVA API for client access Data can be scaled across various nodes as it is stored in HDFS Write Ahead Log across clusters which provides automatic support against failure HBase Provides consistent read and write of data Provides ease to use JAVA API for clients
  • 34.
    Many thanks foryour attention!