Intro to HBase                      Alex Baranau, Sematext International, 2012Monday, July 9, 12
About Me                     Software Engineer at Sematext International                     http://blog.sematext.com/auth...
Agenda                     What is HBase?                     How to use HBase?                     When to use HBase?Mond...
What is HBase?Monday, July 9, 12
What: HBase is...                     Open-source non-relational distributed                     column-oriented database ...
What HBase is NOT                     Not an SQL database                     Not relational                     No joins ...
What: Features-1                     Linear scalability, capable of                     storing hundreds of terabytes of d...
What: Part of Hadoop                                 ecosystem                        Provides realtime random read/write ...
What: Features-2                     Integrates nicely with Hadoop MapReduce (both                     as source and desti...
How to use HBase?Monday, July 9, 12
How: the Data                         Row keys uninterpreted byte arrays                         Columns grouped in column...
How: Writing the Data                      Row updates are atomic                      Updates across multiple rows are NO...
How: Reading the Data                      Reader will always read the last written (and committed)                      v...
How: MapReduce Integration                     Out of the box integration with Hadoop                     MapReduce       ...
How: Sharding the Data                      Automatic and configurable sharding of                      tables:           ...
How: Setup: Components                      HBase components                                              ZooKeeper       ...
How: Setup: Hadoop Cluster                         Typical Hadoop+HBase setup                                             ...
How: Setup: Automatic Failover                     DataNode failures handled by HDFS                     (replication)    ...
When to Use HBase?Monday, July 9, 12
When: What HBase is good at                     Serving large amount of data: built                     to scale from the ...
When: HBase vs ...                     Favors consistency over availability                     Part of a Hadoop ecosystem...
When: Use-cases                     Audit logging systems                       track user actions                       a...
When: Use-cases                     Real-time analytics, OLAP                       real-time counters                    ...
When: Use-cases                     Monitoring system exampleMonday, July 9, 12
When: Use-cases                     Messages-centered systems                       twitter-like messages/statuses        ...
Future                     Making stable enough to substitute                     RDBMS in mission critical cases         ...
Qs?                     (next: Intro into HBase Internals)                            Sematext is hiring!Monday, July 9, 12
Upcoming SlideShare
Loading in...5
×

Intro to HBase

19,751

Published on

Introduction to HBase briefly covers the following topics: what is HBase, how and when to used it.

Published in: Technology, Education
4 Comments
35 Likes
Statistics
Notes
No Downloads
Views
Total Views
19,751
On Slideshare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
619
Comments
4
Likes
35
Embeds 0
No embeds

No notes for slide

Intro to HBase

  1. 1. Intro to HBase Alex Baranau, Sematext International, 2012Monday, July 9, 12
  2. 2. About Me Software Engineer at Sematext International http://blog.sematext.com/author/abaranau @abaranau http://github.com/sematext (abaranau)Monday, July 9, 12
  3. 3. Agenda What is HBase? How to use HBase? When to use HBase?Monday, July 9, 12
  4. 4. What is HBase?Monday, July 9, 12
  5. 5. What: HBase is... Open-source non-relational distributed column-oriented database modeled after Google’s BigTable. Think of it as a sparse, consistent, distributed, multidimensional, sorted map: labeled tables of rows row consist of key-value cells: (row key, column family, column, timestamp) -> valueMonday, July 9, 12
  6. 6. What HBase is NOT Not an SQL database Not relational No joins No fancy query language and no sophisticated query engine No transactions out-of-the box No secondary indices out-of-the box Not a drop-in replacement for your RDBMSMonday, July 9, 12
  7. 7. What: Features-1 Linear scalability, capable of storing hundreds of terabytes of data Automatic and configurable sharding of tables Automatic failover support Strictly consistent reads and writesMonday, July 9, 12
  8. 8. What: Part of Hadoop ecosystem Provides realtime random read/write access to data stored in HDFS read HBase write Data read write Data Consumer Producer HDFS writeMonday, July 9, 12
  9. 9. What: Features-2 Integrates nicely with Hadoop MapReduce (both as source and destination) Easy Java API for client access Thrift gateway and REST APIs Bulk import of large amount of data Replication across clusters & backup options Block cache and Bloom filters for real-time queries and many more...Monday, July 9, 12
  10. 10. How to use HBase?Monday, July 9, 12
  11. 11. How: the Data Row keys uninterpreted byte arrays Columns grouped in columnfamilies (CFs) CFs defined statically upon table creation Cell is uninterpreted byte array and a timestamp Rows are ordered Different data All values stores as and accessed by separated into CFs byte arrays row key Row Key Data Rows can have geo:{‘country’:‘Belarus’,‘region’:‘Minsk’} different Minsk demography:{‘population’:‘1,937,000’@ts=2011} columns geo:{‘country’:‘USA’,‘state’:’NY’} Cell can have New_York_City demography:{‘population’:‘8,175,133’@ts=2010, multiple ‘population’:‘8,244,910’@ts=2011} versions Data can be Suva geo:{‘country’:‘Fiji’} very “sparse”Monday, July 9, 12
  12. 12. How: Writing the Data Row updates are atomic Updates across multiple rows are NOT atomic, no transaction support out of the box HBase stores N versions of a cell (default 3) Tables are usually “sparse”, not all columns populated in a rowMonday, July 9, 12
  13. 13. How: Reading the Data Reader will always read the last written (and committed) values Reading single row: Get Reading multiple rows: Scan (very fast) Scan usually defines start key and stop key Rows are ordered, easy to do partial key scan Row Key Data ‘login_2012-03-01.00:09:17’ d:{‘user’:‘alex’} ... ... ‘login_2012-03-01.23:59:35’ d:{‘user’:‘otis’} ‘login_2012-03-02.00:00:21’ d:{‘user’:‘david’} Query predicate pushed down via server-side FiltersMonday, July 9, 12
  14. 14. How: MapReduce Integration Out of the box integration with Hadoop MapReduce Data from HBase table can be source for MR job MR job can write data into HBase MR job can write data into HDFS directly and then output files can be very quickly loaded into HBase via “Bulk Loading” functionalityMonday, July 9, 12
  15. 15. How: Sharding the Data Automatic and configurable sharding of tables: Tables partitioned into Regions Region defined by start & end row keys Regions are the “atoms” of distribution Regions are assigned to RegionServers (HBase cluster slaves)Monday, July 9, 12
  16. 16. How: Setup: Components HBase components ZooKeeper ZooKeeper ZooKeeper client HMaster HMaster RegionServer RegionServer RegionServer RegionServer RegionServerMonday, July 9, 12
  17. 17. How: Setup: Hadoop Cluster Typical Hadoop+HBase setup Master Node HDFS NameNode JobTracker MapReduce HBase HMaster RegionServer RegionServer Slave TaskTracker TaskTracker Nodes DataNode DataNode Slave Node Slave NodeMonday, July 9, 12
  18. 18. How: Setup: Automatic Failover DataNode failures handled by HDFS (replication) RSs failures (incl. caused by whole server failure) handled automatically Master re-assignes Regions to available RSs HMaster failover: automatic with multiple HMastersMonday, July 9, 12
  19. 19. When to Use HBase?Monday, July 9, 12
  20. 20. When: What HBase is good at Serving large amount of data: built to scale from the get-go fast random access to the data Write-heavy applications* Append-style writing (inserting/ overwriting new data) rather than heavy read-modify-write operations** * clients should handle the loss of HTable client-side buffer ** see https://github.com/sematext/HBaseHUTMonday, July 9, 12
  21. 21. When: HBase vs ... Favors consistency over availability Part of a Hadoop ecosystem Great community; adopted by tech giants like Facebook, Twitter, Yahoo!, Adobe, etc.Monday, July 9, 12
  22. 22. When: Use-cases Audit logging systems track user actions answer questions/queries like: what are the last 10 actions made by user? row key: userId_timestamp which users logged into system yesterday? row key: action_timestamp_userIdMonday, July 9, 12
  23. 23. When: Use-cases Real-time analytics, OLAP real-time counters interactive reports showing trends, breakdowns, etc time-series databasesMonday, July 9, 12
  24. 24. When: Use-cases Monitoring system exampleMonday, July 9, 12
  25. 25. When: Use-cases Messages-centered systems twitter-like messages/statuses Content management systems serving content out of HBase Canonical use-case: webtable (pages stored during crawling the web) And othersMonday, July 9, 12
  26. 26. Future Making stable enough to substitute RDBMS in mission critical cases Easier system management Performance improvementsMonday, July 9, 12
  27. 27. Qs? (next: Intro into HBase Internals) Sematext is hiring!Monday, July 9, 12
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×