SlideShare a Scribd company logo
1 of 61
HBASE – THE SCALABLE
DATA STORE
An Introduction to HBase
JAX UK, October 2012

Lars George
Director EMEA Services
About Me

•  Director EMEA Services @ Cloudera
    •  Consulting on Hadoop projects (everywhere)
•  Apache Committer
    •  HBase and Whirr
•  O’Reilly Author
    •  HBase – The Definitive Guide
      •  Now in Japanese!

•  Contact
    •  lars@cloudera.com                      日本語版も出ました!	
  
    •  @larsgeorge
Agenda

•  Introduction to HBase
•  HBase Architecture
•  MapReduce with HBase
•  Advanced Techniques
•  Current Project Status
INTRODUCTION TO HBASE
Why Hadoop/HBase?

•  Datasets are constantly growing and intake soars
    •  Yahoo! has 140PB+ and 42k+ machines
    •  Facebook adds 500TB+ per day, 100PB+ raw data, on
       tens of thousands of machines
    •  Are you “throwing” data away today?
•  Traditional databases are expensive to scale and
   inherently difficult to distribute
•  Commodity hardware is cheap and powerful
   •  $1000 buys you 4-8 cores/4GB/1TB
   •  600GB 15k RPM SAS nearly $500
•  Need for random access and batch processing
    •  Hadoop only supports batch/streaming
History of Hadoop/HBase

•  Google solved its scalability problems
    •  “The Google File System” published October 2003
      •  Hadoop DFS
   •  “MapReduce: Simplified Data Processing on Large
     Clusters” published December 2004
      •  Hadoop MapReduce
   •  “BigTable: A Distributed Storage System for
     Structured Data” published November 2006
      •  HBase
Hadoop Introduction

•  Two main components
    •  Hadoop Distributed File System (HDFS)
       •  A scalable, fault-tolerant, high performance distributed file
         system capable of running on commodity hardware
   •  Hadoop MapReduce
       •  Software framework for distributed computation

•  Significant adoption
    •  Used in production in hundreds of organizations
    •  Primary contributors: Yahoo!, Facebook, Cloudera
HDFS: Hadoop Distributed File System

•  Reliably store petabytes of replicated data across
 thousands of nodes
   •  Data divided into 64MB blocks, each block replicated
     three times
•  Master/Slave architecture
    •  Master NameNode contains block locations
    •  Slave DataNode manages block on local file system
•  Built on commodity hardware
    •  No 15k RPM disks or RAID required (nor wanted!)
MapReduce

•  Distributed programming model to reliably
 process petabytes of data using its locality
   •  Built-in bindings for Java and C
   •  Can be used with any language via Hadoop
     Streaming
•  Inspired by map and reduce functions in
 functional programming

 Input	
  è	
  Map()	
  è	
  Copy/Sort	
  è	
  Reduce()	
  è	
  Output	
  
 	
  
Hadoop…

•  … is designed to store and stream extremely large
   datasets in batch
•  … is not intended for realtime querying
•  … does not support random access
•  … does not handle billions of small files well
   •  Less than default block size of 64MB and smaller
   •  Keeps “inodes” in memory on master
•  … is not supporting structured data more than
 unstructured or complex data

              That is why we have HBase!
Why HBase and not …?

•  Question: Why HBase and not <put-your-favorite-
   nosql-solution-here>?
•  What else is there?
   •    Key/value stores
   •    Document-oriented stores
   •    Column-oriented stores
   •    Graph-oriented stores
•  Features to ask for
    •  In memory or persistent?
    •  Strict or eventual consistency?
    •  Distributed or single machine (or afterthought)?
    •  Designed for read and/or write speeds?
    •  How does it scale? (if that is what you need)
What is HBase?

•  Distributed
•  Column-Oriented
•  Multi-Dimensional
•  High-Availability (CAP anyone?)
•  High-Performance
•  Storage System

                       Project Goals
   Billions of Rows * Millions of Columns * Thousands of
                            Versions
    Petabytes across thousands of commodity servers
HBase is not…

•  An SQL Database
    •  No joins, no query engine, no types, no SQL
    •  Transactions and secondary indexes only as add-ons but
       immature
•  A drop-in replacement for your RDBMS
•  You must be OK with RDBMS anti-schema
    •  Denormalized data
    •  Wide and sparsely populated tables
    •  Just say “no” to your inner DBA


               Keyword: Impedance Match
HBase Tables
HBase Tables
HBase Tables
HBase Tables
HBase Tables
HBase Tables
HBase Tables
HBase Tables
HBase Tables
HBase Tables
HBase Tables

•  Tables are sorted by the Row Key in
   lexicographical order
•  Table schema only defines its Column Families
  •  Each family consists of any number of Columns
  •  Each column consists of any number of Versions
  •  Columns only exist when inserted, NULLs are free
  •  Columns within a family are sorted and stored
     together
  •  Everything except table names are byte[]


(Table, Row, Family:Column, Timestamp) è Value
Column Family vs. Column

•  Use only a few column families
    •  Causes many files that need to stay open per region
       plus class overhead per family
•  Best used when logical separation between data
   and meta columns
•  Sorting per family can be used to convey
   application logic or access pattern
HBase Architecture

•  Table is made up of any number if regions
•  Region is specified by its startKey and endKey
    •  Empty table: (Table, NULL, NULL)
    •  Two-region table: (Table, NULL, “com.cloudera.www”)
       and (Table, “com.cloudera.www”, NULL)
•  Each region may live on a different node and is
 made up of several HDFS files and blocks, each
 of which is replicated by Hadoop
HBase Architecture (cont.)

•  Two types of HBase nodes:
        Master and RegionServer
•  Special tables -ROOT- and.META. store schema
   information and region locations
•  Master server responsible for RegionServer
   monitoring as well as assignment and load
   balancing of regions
•  Uses ZooKeeper as its distributed coordination
   service
  •  Manages Master election and server availability
Web Crawl Example

•  Canonical use-case for BigTable
•  Store web crawl data
    •  Table webtable with family content and meta
    •  Row is reversed URL with Columns
      •  content:data stores the raw crawled data
      •  meta:language stores http language header
      •  meta:type stores http content-type header
   •  While processing raw data for hyperlinks and images,
     add families links and images
      •  links:<rurl> column for each hyperlink
      •  images:<rurl> column for each image
HBase Clients

•  Native Java Client/API
•  Non-Java Clients
    •  REST server
    •  Avro server
    •  Thrift server
    •  Jython, Scala, Groovy DSL
•  TableInputFormat/TableOutputFormat for
 MapReduce
   •  HBase as MapReduce source and/or target
•  HBase Shell
    •  JRuby shell adding get, put, scan and admin calls
Java API

•  CRUD
    •  get: retrieve an entire, or partial row (R)
    •  put: create and update a row (CU)
    •  delete: delete a cell, column, columns, or row (D)


      Result get(Get get) throws IOException;

      void put(Put put) throws IOException;

      void delete(Delete delete) throws IOException;
Java API (cont.)

•  CRUD+SI
    •  scan:      Scan any number of rows (S)
    •  increment: Increment a column value (I)




ResultScanner getScanner(Scan scan) throws IOException;

Result increment(Increment increment) throws IOException ;
Java API (cont.)

•  CRUD+SI+CAS
    •  Atomic compare-and-swap (CAS)


•  Combined get, check, and put operation
•  Helps to overcome lack of full transactions
Batch Operations

•  Support Get, Put, and Delete
•  Reduce network round-trips
•  If possible, batch operation to the server to gain
 better overall throughput

    void batch(List<Row> actions, Object[] results)
      throws IOException, InterruptedException;

    Object[] batch(List<Row> actions)
      throws IOException, InterruptedException;
Filters

•  Can be used with Get and Scan operations
•  Server side hinting
•  Reduce data transferred to client
•  Filters are no guarantee for fast scans
    •  Still full table scan in worst-case scenario
    •  Might have to implement your own
•  Filters can hint next row key
HBase Extensions

•  Hive, Pig, Cascading
    •  Hadoop-targeted MapReduce tools with HBase
       integration
•  Sqoop
    •  Read and write to HBase for further processing in
       Hadoop
•  HBase Explorer, Nutch, Heretrix
•  SpringData
•  Toad
History of HBase
•  November 2006
     •  Google releases paper on BigTable
•  February 2007
     •  Initial HBase prototype created as Hadoop contrib
•  October 2007
     •  First “useable” HBase (Hadoop 0.15.0)
•  January 2008
     •  Hadoop becomes TLP, HBase becomes subproject
•  October 2008
     •  HBase 0.18.1 released
•  January 2009
     •  HBase 0.19.0
•  September 2009
     •  HBase 0.20.0 released (Performance Release)
•  May 2010
     •  HBase becomes TLP
•  June 2010
     •  HBase 0.89.20100621, first developer release
•  May 2011
     •  HBase 0.90.3 release
HBase Users

•  Adobe
•  eBay
•  Facebook
•  Mozilla (Socorro)
•  Trend Micro (Advanced Threat Research)
•  Twitter
•  Yahoo!
•  …
HBASE ARCHITECTURE
HBase Architecture
HBase Architecture (cont.)
HBase Architecture (cont.)

•  Based on Log-Structured Merge-Trees (LSM-Trees)
•  Inserts are done in write-ahead log first
•  Data is stored in memory and flushed to disk on
   regular intervals or based on size
•  Small flushes are merged in the background to keep
   number of files small
•  Reads read memory stores first and then disk based
   files second
•  Deletes are handled with “tombstone” markers
•  Atomicity on row level no matter how many columns
   •  keeps locking model easy
Write Ahead Log
MAPREDUCE WITH HBASE
MapReduce with HBase

•  Framework to use HBase as source and/or sink for
   MapReduce jobs
•  Thin layer over native Java API
•  Provides helper class to set up jobs easier

   TableMapReduceUtil.initTableMapperJob(
      “test”, scan, MyMapper.class,
      ImmutableBytesWritable.class,
      RowResult.class, job);


   TableMapReduceUtil.initTableReducerJob(
      “table”, MyReducer.class, job);
MapReduce with HBase (cont.)

•  Special use-case in regards to Hadoop
•  Tables are sorted and have unique keys
    •  Often we do not need a Reducer phase
    •  Combiner not needed
•  Need to make sure load is distributed properly by
   randomizing keys (or use bulk import)
•  Partial or full table scans possible
•  Scans are very efficient as they make use of block
   caches
   •  But then make sure you do not create to much churn, or
     better switch caching off when doing full table scans.
•  Can use filters to limit rows being processed
TableInputFormat

•  Transforms a HBase table into a source for
   MapReduce jobs
•  Internally uses a TableRecordReader which
   wraps a Scan instance
   •  Supports restarts to handle temporary issues
•  Splits table by region boundaries and stores
 current region locality
TableOutputFormat

•  Allows to use HBase table as output target
•  Put and Delete support from mapper or reducer
   class
•  Uses TableOutputCommitter to write data
•  Disables auto-commit on table to make use of
   client side write buffer
•  Handles final flush in close()
HFileOutputFormat

•  Used to bulk load data into HBase
•  Bypasses normal API and generates low-level
   store files
•  Prepares files for final bulk insert
•  Needs special handling of sort order and
   partitioning
•  Only supports one column family (for now)
•  Can load bulk updates into existing tables
MapReduce Helper

•  TableMapReduceUtil
•  IdentityTableMapper
     •  Passes on key and value, where value is a Result
        instance and key is set to value.getRow()
•  IdentityTableReducer
     •  Stores values into HBase, must be Put or Delete
        instances
•  HRegionPartitioner
    •  Not set by default, use it to control partioning on
       Hadoop level
Custom MapReduce over Tables

•  No requirement to use provided framework
•  Can read from or write to one or many tables in
   mapper and reducer
•  Can split not on regions but arbitrary boundaries
•  Make sure to use write buffer in OutputFormat to
   get best performance (do not forget to call
   flushCommits() at the end!)
ADVANCED TECHNIQUES
Advanced Techniques

•  Key/Table Design
•  DDI
•  Salting
•  Hashing vs. Sequential Keys
•  ColumnFamily vs. Column
•  Using BloomFilter
•  Data Locality
•  checkAndPut() and checkAndDelete()
•  Coprocessors
Coprocessors

•  New addition to feature set
•  Based on talk by Jeff Dean at LADIS 2009
    •  Run arbitrary code on each region in RegionServer
    •  High level call interface for clients
       •  Calls are addressed to rows or ranges of rows while
          Coprocessors client library resolves locations
       •  Calls to multiple rows are atomically split
   •  Provides model for distributed services
       •  Automatic scaling, load balancing, request routing
Coprocessors in HBase

•  Use for efficient computational parallelism
•  Secondary indexing (HBASE-2038)
•  Column Aggregates (HBASE-1512)
    •  SQL-like sum(), avg(), max(), min(), etc.
•  Access control (HBASE-3025, HBASE-3045)
    •  Provide basic access control
•  Table Metacolumns
•  New filtering
    •  predicate pushdown
•  Table/Region access statistics
•  HLog extensions (HBASE-3257)
Coprocessor and RegionObserver

•  The Coprocessor interface defines these hooks
    •  preOpen, postOpen: Called before and after the
       region is reported as online to the master
    •  preFlush, postFlush: Called before and after the
       memstore is flushed into a new store file
    •  preCompact, postCompact: Called before and after
       compaction
    •  preSplit, postSplit: Called after the region is split
    •  preClose, postClose: Called before and after the
       region is reported as closed to the master
Coprocessor and RegionObserver

•  The RegionObserver interface is defines these hooks:
    •  preGet, postGet: Called before and after a client makes a Get
       request
    •  preExists, postExists: Called before and after the client tests for
       existence using a Get
    •  prePut, postPut: Called before and after the client stores a value
    •  preDelete, postDelete: Called before and after the client deletes a
       value
    •  preScannerOpen, postScannerOpen: Called before and after the
       client opens a new scanner
    •  preScannerNext, postScannerNext: Called before and after the
       client asks for the next row on a scanner
    •  preScannerClose, postScannerClose: Called before and after the
       client closes a scanner
    •  preCheckAndPut, postCheckAndPut: Called before and after the
       client calls checkAndPut()
    •  preCheckAndDelete, postCheckAndDelete: Called before and after
       the client calls checkAndDelete()
PROJECT STATUS
Current Project Status

•  HBase 0.90.x “Advanced Concepts”
    •  Master Rewrite – More Zookeeper
    •  Intra Row Scanning
    •  Further optimizations on algorithms and data
       structures
           CDH3
•  HBase 0.92.x “Coprocessors”
    •  Multi-DC Replication
    •  Discretionary Access Control
    •  Coprocessors
           CDH4
Current Project Status (cont.)

•  HBase 0.94.x “Performance Release”
    •  Read CRC Improvements
    •  Seek Optimizations
    •  WAL Compression
    •  Prefix Compression (aka Block Encoding)
    •  Atomic Append
    •  Atomic put+delete
    •  Multi Increment and Multi Append
    •  Per-region (i.e. local) Multi-Row Transactions
    •  WALPlayer

         CDH4.x    (soon)
Current Project Status (cont.)

•  HBase 0.96.x “The Singularity”
    •  Protobuf RPC
      •  Rolling Upgrades
      •  Multiversion Access
  •  Metrics V2
  •  Preview Technologies
      •  Snapshots
      •  PrefixTrie Block Encoding



        CDH5 ?
Ques%ons?	
  

More Related Content

What's hot

Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 ReleaseNick Dimiduk
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseCloudera, Inc.
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014larsgeorge
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardMatthew Blair
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the BasicsHBaseCon
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the BasicsHBaseCon
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...Cloudera, Inc.
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBaseAnil Gupta
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.
 

What's hot (20)

Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBase
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ Flipboard
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 

Viewers also liked

Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks
 
Strata + Hadoop World 2012: Apache HBase Features for the Enterprise
Strata + Hadoop World 2012: Apache HBase Features for the EnterpriseStrata + Hadoop World 2012: Apache HBase Features for the Enterprise
Strata + Hadoop World 2012: Apache HBase Features for the EnterpriseCloudera, Inc.
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Jeremy Walsh
 
HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015Avinash Ramineni
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtMichael Stack
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best PracticesVenu Anuganti
 
HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)Nick Dimiduk
 
HBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBaseHBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBaseMichael Stack
 
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandApache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandJosh Elser
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataCloudera, Inc.
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceCloudera, Inc.
 
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low LatencyNick Dimiduk
 
AWS re:Invent 2016: Blockchain on AWS: Disrupting the Norm (GPST301)
AWS re:Invent 2016: Blockchain on AWS: Disrupting the Norm (GPST301)AWS re:Invent 2016: Blockchain on AWS: Disrupting the Norm (GPST301)
AWS re:Invent 2016: Blockchain on AWS: Disrupting the Norm (GPST301)Amazon Web Services
 
Hbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databasesHbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databasesLuis Cipriani
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaCloudera, Inc.
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0enissoz
 

Viewers also liked (20)

Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Strata + Hadoop World 2012: Apache HBase Features for the Enterprise
Strata + Hadoop World 2012: Apache HBase Features for the EnterpriseStrata + Hadoop World 2012: Apache HBase Features for the Enterprise
Strata + Hadoop World 2012: Apache HBase Features for the Enterprise
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
 
HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015HBase from the Trenches - Phoenix Data Conference 2015
HBase from the Trenches - Phoenix Data Conference 2015
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the Art
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices
 
HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)
 
HBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBaseHBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBase
 
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to UnderstandApache HBase Internals you hoped you Never Needed to Understand
Apache HBase Internals you hoped you Never Needed to Understand
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Apache Phoenix + Apache HBase
Apache Phoenix + Apache HBaseApache Phoenix + Apache HBase
Apache Phoenix + Apache HBase
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
 
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low Latency
 
Spark + HBase
Spark + HBase Spark + HBase
Spark + HBase
 
AWS re:Invent 2016: Blockchain on AWS: Disrupting the Norm (GPST301)
AWS re:Invent 2016: Blockchain on AWS: Disrupting the Norm (GPST301)AWS re:Invent 2016: Blockchain on AWS: Disrupting the Norm (GPST301)
AWS re:Invent 2016: Blockchain on AWS: Disrupting the Norm (GPST301)
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Hbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databasesHbase: Introduction to column oriented databases
Hbase: Introduction to column oriented databases
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
 

Similar to Intro to HBase - Lars George

Nyc hadoop meetup introduction to h base
Nyc hadoop meetup   introduction to h baseNyc hadoop meetup   introduction to h base
Nyc hadoop meetup introduction to h base智杰 付
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
NoSql - mayank singh
NoSql - mayank singhNoSql - mayank singh
NoSql - mayank singhMayank Singh
 
Introduction to Apache HBase
Introduction to Apache HBaseIntroduction to Apache HBase
Introduction to Apache HBaseGokuldas Pillai
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Conhecendo o Apache HBase
Conhecendo o Apache HBaseConhecendo o Apache HBase
Conhecendo o Apache HBaseFelipe Ferreira
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airshipdave_revell
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxBhavanaHotchandani
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in HyderabadRajitha D
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxvishwasgarade1
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseRishabh Dugar
 

Similar to Intro to HBase - Lars George (20)

Nyc hadoop meetup introduction to h base
Nyc hadoop meetup   introduction to h baseNyc hadoop meetup   introduction to h base
Nyc hadoop meetup introduction to h base
 
HBase
HBaseHBase
HBase
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
NoSql - mayank singh
NoSql - mayank singhNoSql - mayank singh
NoSql - mayank singh
 
Introduction to Apache HBase
Introduction to Apache HBaseIntroduction to Apache HBase
Introduction to Apache HBase
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Conhecendo o Apache HBase
Conhecendo o Apache HBaseConhecendo o Apache HBase
Conhecendo o Apache HBase
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptx
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql database
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 

More from JAX London

Everything I know about software in spaghetti bolognese: managing complexity
Everything I know about software in spaghetti bolognese: managing complexityEverything I know about software in spaghetti bolognese: managing complexity
Everything I know about software in spaghetti bolognese: managing complexityJAX London
 
Devops with the S for Sharing - Patrick Debois
Devops with the S for Sharing - Patrick DeboisDevops with the S for Sharing - Patrick Debois
Devops with the S for Sharing - Patrick DeboisJAX London
 
Busy Developer's Guide to Windows 8 HTML/JavaScript Apps
Busy Developer's Guide to Windows 8 HTML/JavaScript AppsBusy Developer's Guide to Windows 8 HTML/JavaScript Apps
Busy Developer's Guide to Windows 8 HTML/JavaScript AppsJAX London
 
It's code but not as we know: Infrastructure as Code - Patrick Debois
It's code but not as we know: Infrastructure as Code - Patrick DeboisIt's code but not as we know: Infrastructure as Code - Patrick Debois
It's code but not as we know: Infrastructure as Code - Patrick DeboisJAX London
 
Locks? We Don't Need No Stinkin' Locks - Michael Barker
Locks? We Don't Need No Stinkin' Locks - Michael BarkerLocks? We Don't Need No Stinkin' Locks - Michael Barker
Locks? We Don't Need No Stinkin' Locks - Michael BarkerJAX London
 
Worse is better, for better or for worse - Kevlin Henney
Worse is better, for better or for worse - Kevlin HenneyWorse is better, for better or for worse - Kevlin Henney
Worse is better, for better or for worse - Kevlin HenneyJAX London
 
Java performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJava performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJAX London
 
Clojure made-simple - John Stevenson
Clojure made-simple - John StevensonClojure made-simple - John Stevenson
Clojure made-simple - John StevensonJAX London
 
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias WessendorfHTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias WessendorfJAX London
 
Play framework 2 : Peter Hilton
Play framework 2 : Peter HiltonPlay framework 2 : Peter Hilton
Play framework 2 : Peter HiltonJAX London
 
Complexity theory and software development : Tim Berglund
Complexity theory and software development : Tim BerglundComplexity theory and software development : Tim Berglund
Complexity theory and software development : Tim BerglundJAX London
 
Why FLOSS is a Java developer's best friend: Dave Gruber
Why FLOSS is a Java developer's best friend: Dave GruberWhy FLOSS is a Java developer's best friend: Dave Gruber
Why FLOSS is a Java developer's best friend: Dave GruberJAX London
 
Akka in Action: Heiko Seeburger
Akka in Action: Heiko SeeburgerAkka in Action: Heiko Seeburger
Akka in Action: Heiko SeeburgerJAX London
 
NoSQL Smackdown 2012 : Tim Berglund
NoSQL Smackdown 2012 : Tim BerglundNoSQL Smackdown 2012 : Tim Berglund
NoSQL Smackdown 2012 : Tim BerglundJAX London
 
Closures, the next "Big Thing" in Java: Russel Winder
Closures, the next "Big Thing" in Java: Russel WinderClosures, the next "Big Thing" in Java: Russel Winder
Closures, the next "Big Thing" in Java: Russel WinderJAX London
 
Java and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJava and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJAX London
 
Mongo DB on the JVM - Brendan McAdams
Mongo DB on the JVM - Brendan McAdamsMongo DB on the JVM - Brendan McAdams
Mongo DB on the JVM - Brendan McAdamsJAX London
 
New opportunities for connected data - Ian Robinson
New opportunities for connected data - Ian RobinsonNew opportunities for connected data - Ian Robinson
New opportunities for connected data - Ian RobinsonJAX London
 
HTML5 Websockets and Java - Arun Gupta
HTML5 Websockets and Java - Arun GuptaHTML5 Websockets and Java - Arun Gupta
HTML5 Websockets and Java - Arun GuptaJAX London
 
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian PloskerThe Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian PloskerJAX London
 

More from JAX London (20)

Everything I know about software in spaghetti bolognese: managing complexity
Everything I know about software in spaghetti bolognese: managing complexityEverything I know about software in spaghetti bolognese: managing complexity
Everything I know about software in spaghetti bolognese: managing complexity
 
Devops with the S for Sharing - Patrick Debois
Devops with the S for Sharing - Patrick DeboisDevops with the S for Sharing - Patrick Debois
Devops with the S for Sharing - Patrick Debois
 
Busy Developer's Guide to Windows 8 HTML/JavaScript Apps
Busy Developer's Guide to Windows 8 HTML/JavaScript AppsBusy Developer's Guide to Windows 8 HTML/JavaScript Apps
Busy Developer's Guide to Windows 8 HTML/JavaScript Apps
 
It's code but not as we know: Infrastructure as Code - Patrick Debois
It's code but not as we know: Infrastructure as Code - Patrick DeboisIt's code but not as we know: Infrastructure as Code - Patrick Debois
It's code but not as we know: Infrastructure as Code - Patrick Debois
 
Locks? We Don't Need No Stinkin' Locks - Michael Barker
Locks? We Don't Need No Stinkin' Locks - Michael BarkerLocks? We Don't Need No Stinkin' Locks - Michael Barker
Locks? We Don't Need No Stinkin' Locks - Michael Barker
 
Worse is better, for better or for worse - Kevlin Henney
Worse is better, for better or for worse - Kevlin HenneyWorse is better, for better or for worse - Kevlin Henney
Worse is better, for better or for worse - Kevlin Henney
 
Java performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha GeeJava performance: What's the big deal? - Trisha Gee
Java performance: What's the big deal? - Trisha Gee
 
Clojure made-simple - John Stevenson
Clojure made-simple - John StevensonClojure made-simple - John Stevenson
Clojure made-simple - John Stevenson
 
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias WessendorfHTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
 
Play framework 2 : Peter Hilton
Play framework 2 : Peter HiltonPlay framework 2 : Peter Hilton
Play framework 2 : Peter Hilton
 
Complexity theory and software development : Tim Berglund
Complexity theory and software development : Tim BerglundComplexity theory and software development : Tim Berglund
Complexity theory and software development : Tim Berglund
 
Why FLOSS is a Java developer's best friend: Dave Gruber
Why FLOSS is a Java developer's best friend: Dave GruberWhy FLOSS is a Java developer's best friend: Dave Gruber
Why FLOSS is a Java developer's best friend: Dave Gruber
 
Akka in Action: Heiko Seeburger
Akka in Action: Heiko SeeburgerAkka in Action: Heiko Seeburger
Akka in Action: Heiko Seeburger
 
NoSQL Smackdown 2012 : Tim Berglund
NoSQL Smackdown 2012 : Tim BerglundNoSQL Smackdown 2012 : Tim Berglund
NoSQL Smackdown 2012 : Tim Berglund
 
Closures, the next "Big Thing" in Java: Russel Winder
Closures, the next "Big Thing" in Java: Russel WinderClosures, the next "Big Thing" in Java: Russel Winder
Closures, the next "Big Thing" in Java: Russel Winder
 
Java and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJava and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk Pepperdine
 
Mongo DB on the JVM - Brendan McAdams
Mongo DB on the JVM - Brendan McAdamsMongo DB on the JVM - Brendan McAdams
Mongo DB on the JVM - Brendan McAdams
 
New opportunities for connected data - Ian Robinson
New opportunities for connected data - Ian RobinsonNew opportunities for connected data - Ian Robinson
New opportunities for connected data - Ian Robinson
 
HTML5 Websockets and Java - Arun Gupta
HTML5 Websockets and Java - Arun GuptaHTML5 Websockets and Java - Arun Gupta
HTML5 Websockets and Java - Arun Gupta
 
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian PloskerThe Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
 

Recently uploaded

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 

Intro to HBase - Lars George

  • 1. HBASE – THE SCALABLE DATA STORE An Introduction to HBase JAX UK, October 2012 Lars George Director EMEA Services
  • 2. About Me •  Director EMEA Services @ Cloudera •  Consulting on Hadoop projects (everywhere) •  Apache Committer •  HBase and Whirr •  O’Reilly Author •  HBase – The Definitive Guide •  Now in Japanese! •  Contact •  lars@cloudera.com 日本語版も出ました!   •  @larsgeorge
  • 3. Agenda •  Introduction to HBase •  HBase Architecture •  MapReduce with HBase •  Advanced Techniques •  Current Project Status
  • 5. Why Hadoop/HBase? •  Datasets are constantly growing and intake soars •  Yahoo! has 140PB+ and 42k+ machines •  Facebook adds 500TB+ per day, 100PB+ raw data, on tens of thousands of machines •  Are you “throwing” data away today? •  Traditional databases are expensive to scale and inherently difficult to distribute •  Commodity hardware is cheap and powerful •  $1000 buys you 4-8 cores/4GB/1TB •  600GB 15k RPM SAS nearly $500 •  Need for random access and batch processing •  Hadoop only supports batch/streaming
  • 6. History of Hadoop/HBase •  Google solved its scalability problems •  “The Google File System” published October 2003 •  Hadoop DFS •  “MapReduce: Simplified Data Processing on Large Clusters” published December 2004 •  Hadoop MapReduce •  “BigTable: A Distributed Storage System for Structured Data” published November 2006 •  HBase
  • 7. Hadoop Introduction •  Two main components •  Hadoop Distributed File System (HDFS) •  A scalable, fault-tolerant, high performance distributed file system capable of running on commodity hardware •  Hadoop MapReduce •  Software framework for distributed computation •  Significant adoption •  Used in production in hundreds of organizations •  Primary contributors: Yahoo!, Facebook, Cloudera
  • 8. HDFS: Hadoop Distributed File System •  Reliably store petabytes of replicated data across thousands of nodes •  Data divided into 64MB blocks, each block replicated three times •  Master/Slave architecture •  Master NameNode contains block locations •  Slave DataNode manages block on local file system •  Built on commodity hardware •  No 15k RPM disks or RAID required (nor wanted!)
  • 9. MapReduce •  Distributed programming model to reliably process petabytes of data using its locality •  Built-in bindings for Java and C •  Can be used with any language via Hadoop Streaming •  Inspired by map and reduce functions in functional programming Input  è  Map()  è  Copy/Sort  è  Reduce()  è  Output    
  • 10. Hadoop… •  … is designed to store and stream extremely large datasets in batch •  … is not intended for realtime querying •  … does not support random access •  … does not handle billions of small files well •  Less than default block size of 64MB and smaller •  Keeps “inodes” in memory on master •  … is not supporting structured data more than unstructured or complex data That is why we have HBase!
  • 11. Why HBase and not …? •  Question: Why HBase and not <put-your-favorite- nosql-solution-here>? •  What else is there? •  Key/value stores •  Document-oriented stores •  Column-oriented stores •  Graph-oriented stores •  Features to ask for •  In memory or persistent? •  Strict or eventual consistency? •  Distributed or single machine (or afterthought)? •  Designed for read and/or write speeds? •  How does it scale? (if that is what you need)
  • 12. What is HBase? •  Distributed •  Column-Oriented •  Multi-Dimensional •  High-Availability (CAP anyone?) •  High-Performance •  Storage System Project Goals Billions of Rows * Millions of Columns * Thousands of Versions Petabytes across thousands of commodity servers
  • 13. HBase is not… •  An SQL Database •  No joins, no query engine, no types, no SQL •  Transactions and secondary indexes only as add-ons but immature •  A drop-in replacement for your RDBMS •  You must be OK with RDBMS anti-schema •  Denormalized data •  Wide and sparsely populated tables •  Just say “no” to your inner DBA Keyword: Impedance Match
  • 24. HBase Tables •  Tables are sorted by the Row Key in lexicographical order •  Table schema only defines its Column Families •  Each family consists of any number of Columns •  Each column consists of any number of Versions •  Columns only exist when inserted, NULLs are free •  Columns within a family are sorted and stored together •  Everything except table names are byte[] (Table, Row, Family:Column, Timestamp) è Value
  • 25. Column Family vs. Column •  Use only a few column families •  Causes many files that need to stay open per region plus class overhead per family •  Best used when logical separation between data and meta columns •  Sorting per family can be used to convey application logic or access pattern
  • 26. HBase Architecture •  Table is made up of any number if regions •  Region is specified by its startKey and endKey •  Empty table: (Table, NULL, NULL) •  Two-region table: (Table, NULL, “com.cloudera.www”) and (Table, “com.cloudera.www”, NULL) •  Each region may live on a different node and is made up of several HDFS files and blocks, each of which is replicated by Hadoop
  • 27. HBase Architecture (cont.) •  Two types of HBase nodes: Master and RegionServer •  Special tables -ROOT- and.META. store schema information and region locations •  Master server responsible for RegionServer monitoring as well as assignment and load balancing of regions •  Uses ZooKeeper as its distributed coordination service •  Manages Master election and server availability
  • 28. Web Crawl Example •  Canonical use-case for BigTable •  Store web crawl data •  Table webtable with family content and meta •  Row is reversed URL with Columns •  content:data stores the raw crawled data •  meta:language stores http language header •  meta:type stores http content-type header •  While processing raw data for hyperlinks and images, add families links and images •  links:<rurl> column for each hyperlink •  images:<rurl> column for each image
  • 29. HBase Clients •  Native Java Client/API •  Non-Java Clients •  REST server •  Avro server •  Thrift server •  Jython, Scala, Groovy DSL •  TableInputFormat/TableOutputFormat for MapReduce •  HBase as MapReduce source and/or target •  HBase Shell •  JRuby shell adding get, put, scan and admin calls
  • 30. Java API •  CRUD •  get: retrieve an entire, or partial row (R) •  put: create and update a row (CU) •  delete: delete a cell, column, columns, or row (D) Result get(Get get) throws IOException; void put(Put put) throws IOException; void delete(Delete delete) throws IOException;
  • 31. Java API (cont.) •  CRUD+SI •  scan: Scan any number of rows (S) •  increment: Increment a column value (I) ResultScanner getScanner(Scan scan) throws IOException; Result increment(Increment increment) throws IOException ;
  • 32. Java API (cont.) •  CRUD+SI+CAS •  Atomic compare-and-swap (CAS) •  Combined get, check, and put operation •  Helps to overcome lack of full transactions
  • 33. Batch Operations •  Support Get, Put, and Delete •  Reduce network round-trips •  If possible, batch operation to the server to gain better overall throughput void batch(List<Row> actions, Object[] results) throws IOException, InterruptedException; Object[] batch(List<Row> actions) throws IOException, InterruptedException;
  • 34. Filters •  Can be used with Get and Scan operations •  Server side hinting •  Reduce data transferred to client •  Filters are no guarantee for fast scans •  Still full table scan in worst-case scenario •  Might have to implement your own •  Filters can hint next row key
  • 35. HBase Extensions •  Hive, Pig, Cascading •  Hadoop-targeted MapReduce tools with HBase integration •  Sqoop •  Read and write to HBase for further processing in Hadoop •  HBase Explorer, Nutch, Heretrix •  SpringData •  Toad
  • 36. History of HBase •  November 2006 •  Google releases paper on BigTable •  February 2007 •  Initial HBase prototype created as Hadoop contrib •  October 2007 •  First “useable” HBase (Hadoop 0.15.0) •  January 2008 •  Hadoop becomes TLP, HBase becomes subproject •  October 2008 •  HBase 0.18.1 released •  January 2009 •  HBase 0.19.0 •  September 2009 •  HBase 0.20.0 released (Performance Release) •  May 2010 •  HBase becomes TLP •  June 2010 •  HBase 0.89.20100621, first developer release •  May 2011 •  HBase 0.90.3 release
  • 37. HBase Users •  Adobe •  eBay •  Facebook •  Mozilla (Socorro) •  Trend Micro (Advanced Threat Research) •  Twitter •  Yahoo! •  …
  • 41. HBase Architecture (cont.) •  Based on Log-Structured Merge-Trees (LSM-Trees) •  Inserts are done in write-ahead log first •  Data is stored in memory and flushed to disk on regular intervals or based on size •  Small flushes are merged in the background to keep number of files small •  Reads read memory stores first and then disk based files second •  Deletes are handled with “tombstone” markers •  Atomicity on row level no matter how many columns •  keeps locking model easy
  • 44. MapReduce with HBase •  Framework to use HBase as source and/or sink for MapReduce jobs •  Thin layer over native Java API •  Provides helper class to set up jobs easier TableMapReduceUtil.initTableMapperJob( “test”, scan, MyMapper.class, ImmutableBytesWritable.class, RowResult.class, job); TableMapReduceUtil.initTableReducerJob( “table”, MyReducer.class, job);
  • 45. MapReduce with HBase (cont.) •  Special use-case in regards to Hadoop •  Tables are sorted and have unique keys •  Often we do not need a Reducer phase •  Combiner not needed •  Need to make sure load is distributed properly by randomizing keys (or use bulk import) •  Partial or full table scans possible •  Scans are very efficient as they make use of block caches •  But then make sure you do not create to much churn, or better switch caching off when doing full table scans. •  Can use filters to limit rows being processed
  • 46. TableInputFormat •  Transforms a HBase table into a source for MapReduce jobs •  Internally uses a TableRecordReader which wraps a Scan instance •  Supports restarts to handle temporary issues •  Splits table by region boundaries and stores current region locality
  • 47. TableOutputFormat •  Allows to use HBase table as output target •  Put and Delete support from mapper or reducer class •  Uses TableOutputCommitter to write data •  Disables auto-commit on table to make use of client side write buffer •  Handles final flush in close()
  • 48. HFileOutputFormat •  Used to bulk load data into HBase •  Bypasses normal API and generates low-level store files •  Prepares files for final bulk insert •  Needs special handling of sort order and partitioning •  Only supports one column family (for now) •  Can load bulk updates into existing tables
  • 49. MapReduce Helper •  TableMapReduceUtil •  IdentityTableMapper •  Passes on key and value, where value is a Result instance and key is set to value.getRow() •  IdentityTableReducer •  Stores values into HBase, must be Put or Delete instances •  HRegionPartitioner •  Not set by default, use it to control partioning on Hadoop level
  • 50. Custom MapReduce over Tables •  No requirement to use provided framework •  Can read from or write to one or many tables in mapper and reducer •  Can split not on regions but arbitrary boundaries •  Make sure to use write buffer in OutputFormat to get best performance (do not forget to call flushCommits() at the end!)
  • 52. Advanced Techniques •  Key/Table Design •  DDI •  Salting •  Hashing vs. Sequential Keys •  ColumnFamily vs. Column •  Using BloomFilter •  Data Locality •  checkAndPut() and checkAndDelete() •  Coprocessors
  • 53. Coprocessors •  New addition to feature set •  Based on talk by Jeff Dean at LADIS 2009 •  Run arbitrary code on each region in RegionServer •  High level call interface for clients •  Calls are addressed to rows or ranges of rows while Coprocessors client library resolves locations •  Calls to multiple rows are atomically split •  Provides model for distributed services •  Automatic scaling, load balancing, request routing
  • 54. Coprocessors in HBase •  Use for efficient computational parallelism •  Secondary indexing (HBASE-2038) •  Column Aggregates (HBASE-1512) •  SQL-like sum(), avg(), max(), min(), etc. •  Access control (HBASE-3025, HBASE-3045) •  Provide basic access control •  Table Metacolumns •  New filtering •  predicate pushdown •  Table/Region access statistics •  HLog extensions (HBASE-3257)
  • 55. Coprocessor and RegionObserver •  The Coprocessor interface defines these hooks •  preOpen, postOpen: Called before and after the region is reported as online to the master •  preFlush, postFlush: Called before and after the memstore is flushed into a new store file •  preCompact, postCompact: Called before and after compaction •  preSplit, postSplit: Called after the region is split •  preClose, postClose: Called before and after the region is reported as closed to the master
  • 56. Coprocessor and RegionObserver •  The RegionObserver interface is defines these hooks: •  preGet, postGet: Called before and after a client makes a Get request •  preExists, postExists: Called before and after the client tests for existence using a Get •  prePut, postPut: Called before and after the client stores a value •  preDelete, postDelete: Called before and after the client deletes a value •  preScannerOpen, postScannerOpen: Called before and after the client opens a new scanner •  preScannerNext, postScannerNext: Called before and after the client asks for the next row on a scanner •  preScannerClose, postScannerClose: Called before and after the client closes a scanner •  preCheckAndPut, postCheckAndPut: Called before and after the client calls checkAndPut() •  preCheckAndDelete, postCheckAndDelete: Called before and after the client calls checkAndDelete()
  • 58. Current Project Status •  HBase 0.90.x “Advanced Concepts” •  Master Rewrite – More Zookeeper •  Intra Row Scanning •  Further optimizations on algorithms and data structures CDH3 •  HBase 0.92.x “Coprocessors” •  Multi-DC Replication •  Discretionary Access Control •  Coprocessors CDH4
  • 59. Current Project Status (cont.) •  HBase 0.94.x “Performance Release” •  Read CRC Improvements •  Seek Optimizations •  WAL Compression •  Prefix Compression (aka Block Encoding) •  Atomic Append •  Atomic put+delete •  Multi Increment and Multi Append •  Per-region (i.e. local) Multi-Row Transactions •  WALPlayer CDH4.x (soon)
  • 60. Current Project Status (cont.) •  HBase 0.96.x “The Singularity” •  Protobuf RPC •  Rolling Upgrades •  Multiversion Access •  Metrics V2 •  Preview Technologies •  Snapshots •  PrefixTrie Block Encoding CDH5 ?