Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBase introduction talk

477 views

Published on

This is the introductory presentation on HBase given by Hayden Marchant in the monthly Amobee Tech Talk.

In this session, we'll learn about HBase, a NoSQL database that provides real-time, random read and write access to tables meant to store billions of rows and millions of columns.

HBase is an open-source, non-relational distributed column-oriented database, is linearly scalable, and is designed to run on commodity hardware. HBase clusters can be in the hundreds and thousands of nodes, serving extraordinary amounts of information. Tight integration with Hadoop gives way to allows powerful analytical processing on data residing in HBase.

Published in: Software
  • Be the first to comment

  • Be the first to like this

HBase introduction talk

  1. 1. Introduction to HBase Hayden Marchant
  2. 2. Agenda ● What is Hbase? ● Hadoop Overview ● HBase Architecture 101 ● Use Cases ● Usage in Amobee ● Questions
  3. 3. Apache HBase ● Open Source ● Sparse multi-dimensional sorted map datastore ● Modeled after Google BigTable ● Key Features: – Distributed Storage across cluster of machines – Random, online read/write data access – Schema-less datamodel (NoSQL) – Self-managed data partitions
  4. 4. Apache Hadoop Dependencies ● Hadoop Distributed File System (HDFS) – Distributed, fault-tolerant, throughput-optimized data storage – The Google File System, 2003, Ghemawat et al. ● Apache Zookeeper (ZK) – Distributed, available, reliable coordination system – The Chubby Lock Service …, 2006, Burrows – http://research.google.com/archive/chubby.html ● Apache Hadoop MapReduce (MR) – Distributed, fault-tolerant, batch-oriented data processing – MapReduce: …, 2004, Dean and Ghemawat – http://research.google.com/archive/mapreduce.html
  5. 5. What is Hadoop? ● Solution for Big Data – Deals with complexities of high volume, velocity and variety of data ● Set of Open Source Projects ● Transforms commodity hardware into a service that: – Stores petabytes of data reliably – Allows huge distributed computations ● Key attributes – Redundant and reliable (no data loss) – Extremely powerful – Batch processing centric – Easy to program – Runs on commodity hardware
  6. 6. HDFS Overview
  7. 7. MapReduce overview MAP (K1,V1) => list (K2,V2) REDUCE (K2,list(V2)) => list(K3,V3)
  8. 8. MapReduce – an example
  9. 9. What's in a Hadoop machine ● MapReduce server on a machine is called TaskTracker ● HDFS Server on machine is called a DataNode TaskTracker DataNode
  10. 10. Hadoop Cluster ● Having multiple machines with Hadoop creates a cluster TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode JobTracker NameNode
  11. 11. Glossary ● Region – A subset of table's rows, like a range partition – Automatically partitioned ● RegionServer (slave) – Serves data (from regions) for reads and writes ● Master – Responsible for coordination of region servers – Assigns regions , detects failures of region servers – Controls admin functions
  12. 12. HBase Distribution ● Store and Access Data across 1 to 1000s of commodity servers (Region Servers) ● Automatic failover based on Apache Zookeeper ● Linear scaling of capacity and IOPS by adding servers
  13. 13. Cluster Deployment
  14. 14. Sorted Map Datastore ● Not a relational data store (very light schema) ● Table consists of rows, each of which has primary key (“rowkey”) ● Each row can have any number of columns – like a Map<byte[],byte[]> ● Rows are stored in sorted order
  15. 15. Sorted Map Datastore (logical view as “records”)
  16. 16. Sorted Map Datastore (physical view as “cells”)
  17. 17. Column Families ● Different sets of columns may have different properties and access patterns ● Configurable by column family – Compression (none, gzip, snappy) – Version retention policies – Cache priority ● CFs stored separately on disk: access one without wasting IO on the other
  18. 18. Accessing HBase ● Java API (thick client) ● Shell ● REST/HTTP ● MapReduce ● Hive/Pig for analytics ● Various other SQL engines
  19. 19. HBase API ● get(row) ● put(row, Map<column,value> ) ● scan(row range, filter) ● increment(row, columns) ● ….(checkAndPut, delete, etc...)
  20. 20. Quick Demo
  21. 21. Scaling with regions
  22. 22. Scaling with regions (ctd...)
  23. 23. Physical Architecture
  24. 24. Read & Write paths
  25. 25. MapReduce over HBase ● MapReduce jobs can access Hbase in parallel – Read – Write ● High-level parallelism
  26. 26. Use Cases
  27. 27. Saas Audit Logging ● Online service requires per-user audit logs ● Row key userid_sessionid_timestamp allows efficient range scan lookups for per- user history fetch ● Server-side filter allows for efficient queries ● MapReduce for analytic questions about user behaviour.
  28. 28. OpenTSDB ● Scalable time-series store and metrics collector ● Thousands of machine generating hundreds of operational metrics ● Thousands of writes/second ● Web interface to displays graphs per metric for time period ● Schema: – Row key: metricid_hourofday – Col :{timestamp} – Val: {metric measurement}
  29. 29. Amobee ● User Profile Database – > 1.3 billion profiles – 10s of properties for each users ● Batch/Real-time updates of profiles ● Provide real-time access to profiles for ad- targeting ● Central backbone for DMP
  30. 30. Use HBase if... ● You need random write, random read, or both ● You need to do many thousands of operations per second on multiple TB on data ● Your access patterns are well-known and simple
  31. 31. Don't use Hbase if... ● You only append to your dataset and tend to read the whole thing ● You primarily do ad-hoc analytics (i.e ill- defined access patterns) ● Your data easily fits on one beefy node
  32. 32. Where is HBase going? ● Preparation for version 1.0 ● Customized balance of Consistency/Availability/Persistence ● Namespaces ● Cell-level Security ● MapReduce on snapshots
  33. 33. Questions?

×