Successfully reported this slideshow.

Introduction to HBase

2,502 views

Published on

  • Be the first to comment

Introduction to HBase

  1. 1. Introduction to HBaseByeongweon Moon / REDDUCKbyeongweon.moon@reddduck.com
  2. 2. HBase Key Point Clustered, commodity(-ish) hardware Mostly schema-less Dynamic distribution Spread writes out over the cluster
  3. 3. HBase Distributed database modeled on Bigtable  Bigtable : A Distributed Storage System for Structured Data by Chang et al. Runs on top of Hadoop Core Layers on HDFS for storage Native connections to MapReduce Distributed, High Availability, High Performance, Strong Consistency
  4. 4. HBase (cont.) Column-oriented store  Wide table costs only the data stored  NULLs in row are ‘free’  Good compression: columns of similar type  Column name is arbitrary Rows stored in sorted order Can random read and write Goal of billions of rows X millions of cells  Petabytes of data across thousands of servers
  5. 5. Column Oriented Storage
  6. 6. !HBase “NoSQL” Database  No joins  No sophisticated query engine  No transactions (sort of)  No column typing  No SQL, no ODBC/JDBC, etc. Not a replacement for RDBMS Matching Impedance
  7. 7. Why HBase? Datasets are reaching Petabytes Traditional databases are expensive to scale and difficult to distribute Commodity hardware is cheap and powerful Need for random access and batch processing (which Hadoop does not offer)
  8. 8. Tables Table is split into roughly equal sized “regions” Each region is a contiguous range of keys Regions split as they grow, thus dynamically adjusting to your data set
  9. 9. Table (cont.) Tables are sorted by Row Table schema defines column families  Families consist of any number of columns  Columns consist of any number of versions  Everything except table name is byte[](Table, Row, Family:Column, Timestamp) -> Value
  10. 10. Table (cont.) As a data structrue SortedMap( RowKey, List( SortedMap( Column, List( Value, Timestamp ) ) ) )
  11. 11. HBase Open Source Stack ZooKeeper : Small Data Coordination Service HBase : Database Storage Engine HDFS : Distributed File system Hadoop : Asynchrous Map-Reduce Jobs
  12. 12. Server Architecture Similar to HDFS  Master == Namenode  Regionserver == Datanode Often run these alongside each other! Difference: HBase stores state in HDFS HDFS provides robust data storage across machines, insulating against failure Master and Regionserver fairly stateless and machine independent
  13. 13. Region Assignment Each region from every table is assigned to a Regionserver Master Duties:  Responsible for assignment and handling regionserver problems (if any!)  When machines fail, move regions  When regions split, move regions to balance  Could move regions to respond to load  Can run multiple backup masters
  14. 14. Master The master does NOT  Handle any write request (not a DB master!)  Handle location finding requests  Not involved in the read/write path  Generally does very little most of the time
  15. 15. Distributed Coordination Zookeeper is used to manage master election and server availability Set up as a cluster, provides distributed coordination primitives An excellent tool for building cluster management systems
  16. 16. HBase Architecturehttp://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
  17. 17. How data actually stored
  18. 18. Write-ahead-Loghttp://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
  19. 19. HLog
  20. 20. Demo
  21. 21. HBase - Roadmap HBase 0.92.0  Coprocessors  Distributed Log Splitting  Running Tasks in UI  Performance Improvements HBase 0.94.0  Security  Secondary Indexes  Search Integration  HFile v2
  22. 22. Reference http://ofps.oreilly.com/titles/9781449396107/ index.html http://hbase.apache.org/book.html#quicksta rt http://www.larsgeorge.com/2010/02/fosdem- 2010-nosql-talk.html

×