HBase Intro.         Anty.Rao       July 13, 2012Big Data Engineering Team       Hanborq Inc.
Outline•   What is HBase•   Data Model•   Physical Structures•   HBase Architecture•   Q/A                                ...
Apache HBase         HBase is an   open source, distributed,         Sorted mapmodeled after Google’s BigTable            ...
Why HBase• HDFS  – File in HDFS is immutable, don’t support update• HBase = HDFS +random read/write• HBase uses HDFS for s...
Data Model• Tables are sorted by Row• Table Schema only define it’s column families  –   Each family consists of any numbe...
Operators• Operations are based on row keys• Operations:  – Put  – Get  – Scan  – Delete    • Just a tombstone marker     ...
How row is physically stored                      KeyValueRow Key           Column Key          Timestamp   Cellcom.cnn.ww...
How data is physically stored               HFilehttp://www.slideshare.net/schubertzhang/hfile-a-blockindexed-file-format-...
Data Organization : Region• Region: unit of  distribution and  availability• Regions are split when  grown too large• Max ...
Read/Write Path                  10
Architecture Overview                        11
Write-Ahead-Log Flow                       12
Three Major Components• Master• HRegionServer• Client                                 13
Master• Master duties   –   Bootstrapping, doing bulk initial assign.   –   Load balancer   –   Splitting WAL, assign regi...
Master is stateless• All the date and state info stored in HDFS &  ZooKeeper• Master is not SPOF!                         ...
HRegionServer•   Send heartbeat(Load info) to Master•   Write Requests•   Read Request•   Flush•   Compaction•   Region Sp...
HBase Client• Cache write  requests• Look up region  server location  when writing and  reading  – First locate .ROOT.  – ...
Q/A      18
THANK YOU !       Anty.Rao(ant.rao@gmail.com)                      19
Upcoming SlideShare
Loading in...5
×

HBase Introduction

300

Published on

Introduction of HBase, for training.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
300
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "HBase Introduction"

  1. 1. HBase Intro. Anty.Rao July 13, 2012Big Data Engineering Team Hanborq Inc.
  2. 2. Outline• What is HBase• Data Model• Physical Structures• HBase Architecture• Q/A 2
  3. 3. Apache HBase HBase is an open source, distributed, Sorted mapmodeled after Google’s BigTable 3
  4. 4. Why HBase• HDFS – File in HDFS is immutable, don’t support update• HBase = HDFS +random read/write• HBase uses HDFS for storage• “Log Structured merge tree” – Similar to “log structured file systems” – Same storage pattern as Cassandra 4
  5. 5. Data Model• Tables are sorted by Row• Table Schema only define it’s column families – Each family consists of any number of columns – Each column consists of any number of versions – Columns only exist when inserted, NULLs are free – Columns with in a family are sorted and stored together• Everything except table names are byte[]• (Row,Family:Column,Timestamp)  Value 5
  6. 6. Operators• Operations are based on row keys• Operations: – Put – Get – Scan – Delete • Just a tombstone marker 6
  7. 7. How row is physically stored KeyValueRow Key Column Key Timestamp Cellcom.cnn.www anchor:cnnsi.com T9 CNNcom.cnn.www Anchor:my.look.ca T8 CNN.comcom.cnn.www Contents: T6 <html>….com.cnn.www Contents: t5 <html>…com.cnn.www Contents: t3 <html>… 7
  8. 8. How data is physically stored HFilehttp://www.slideshare.net/schubertzhang/hfile-a-blockindexed-file-format-to-store-sorted-keyvalue-pairs 8
  9. 9. Data Organization : Region• Region: unit of distribution and availability• Regions are split when grown too large• Max region size is a tuning parameter – Too Low: prevents parallel scalability – Too high: makes things slow 9
  10. 10. Read/Write Path 10
  11. 11. Architecture Overview 11
  12. 12. Write-Ahead-Log Flow 12
  13. 13. Three Major Components• Master• HRegionServer• Client 13
  14. 14. Master• Master duties – Bootstrapping, doing bulk initial assign. – Load balancer – Splitting WAL, assign regions – Get crashed region back• What Master does Not do – Does not handle any write request (not a DB master) – Does not handle location finding requests – Not involved in the read/write path – Even master(s) is(are) down, cluster can response to write/read request. – Generally does very little most the time 14
  15. 15. Master is stateless• All the date and state info stored in HDFS & ZooKeeper• Master is not SPOF! 15
  16. 16. HRegionServer• Send heartbeat(Load info) to Master• Write Requests• Read Request• Flush• Compaction• Region Splits 16
  17. 17. HBase Client• Cache write requests• Look up region server location when writing and reading – First locate .ROOT. – Then –META- region – User region• Make RPC call to region server. 17
  18. 18. Q/A 18
  19. 19. THANK YOU ! Anty.Rao(ant.rao@gmail.com) 19

×