SlideShare a Scribd company logo
HBase in Practice
Lars George – Partner and Co-Founder @ OpenCore
DataWorks Summit 2017 - Munich
NoSQL is no SQL is SQL?
About Us
• Partner & Co-Founder at OpenCore
• Before that
• Lars: EMEA Chief Architect at Cloudera (5+ years)
• Hadoop since 2007
• Apache Committer & Apache Member
• HBase (also in PMC)
• Lars: O’Reilly Author: HBase – The Definitive Guide
• Contact
• lars.george@opencore.com
• @larsgeorge
Website: www.opencore.com
Agenda
• Brief Intro To Core Concepts
• Access Options
• Data Modelling
• Performance Tuning
• Use-Cases
• Summary
Introduction To Core Concepts
HBase Tables
• From user perspective, HBase is similar to a database, or spreadsheet
• There are rows and columns, storing values
• By default asking for a specific row/column combination returns the
current value (that is, that last value stored there)
HBase Tables
• HBase can have a
different schema
per row
• Could be called
schema-less
• Primary access by
the user given row
key and column
name
• Sorting of rows and
columns by their
key (aka names)
HBase Tables
• Each row/column coordinate is tagged with a version number, allowing
multi-versioned values
• Version is usually
the current time
(as epoch)
• API lets user ask
for versions
(specific, by count,
or by ranges)
• Up to 2B versions
HBase Tables
• Table data is cut into pieces to distribute over cluster
• Regions split table into
shards at size boundaries
• Families split within
regions to group
sets of columns
together
• At least one of
each is needed
Scalability – Regions as Shards
• A region is served by exactly
one region server
• Every region server serves
many regions
• Table data is spread over servers
• Distribution of I/O
• Assignment is based on
configurable logic
• Balancing cluster load
• Clients talk directly to region
servers
Column Family-Oriented
• Group multiple columns into
physically separated locations
• Apply different properties to each
family
• TTL, compression, versions, …
• Useful to separate distinct data
sets that are related
• Also useful to separate larger blob
from meta data
Data Management
• What is available is tracked in three
locations
• System catalog table hbase:meta
• Files in HDFS directories
• Open region instances on servers
• System aligns these locations
• Sometimes (very rarely) a repair may
be needed using HBase Fsck
• Redundant information is useful to
repair corrupt tables
HBase really is….
• A distributed Hash Map
• Imagine a complex, concatenated key including the user given row key and
column name, the timestamp (version)
• Complex key points to actual value, that is, the cell
Fold, Store, and Shift
• Logical rows in tables are
really stored as flat key-value
pairs
• Each carries full coordinates
• Pertinent information can be
freely placed in cell to
improve lookup
• HBase is a column-family
grouped key-value store
HFile Format Information
• All data is stored in a custom (open-source) format, called HFile
• Data is stored in blocks (64KB default)
• Trade-off between lookups and I/O throughput
• Compression, encoding applied _after_ limit check
• Index, filter and meta data is stored in separate blocks
• Fixed trailer allows traversal of file structure
• Newer versions introduce multilayered index and filter structures
• Only load master index and load partial index blocks on demand
• Reading data requires deserialization of block into cells
• Kind of Amdahl’s Law applies
HBase Architecture
• One Master and many Worker servers
• Clients mostly communicate with workers
• Workers store actual data
• Memstore for accruing
• HFile for persistence
• WAL for fail-safety
• Data provided as regions
• HDFS is backing store
• But could be another
HBase Architecture (cont.)
HBase Architecture (cont.)
• Based on Log-Structured Merge-Trees (LSM-Trees)
• Inserts are done in write-ahead log first
• Data is stored in memory and flushed to disk on regular intervals or based
on size
• Small flushes are merged in the background to keep number of files small
• Reads read memory stores first and then disk based files second
• Deletes are handled with “tombstone”
markers
• Atomicity on row level no matter how
many columns
• Keeps locking model easy
Merge Reads
• Read Memstore & StoreFiles
using separate scanners
• Merge matching cells into
single row “view”
• Delete’s mask existing data
• Bloom filters help skip
StoreFiles
• Reads may have to span
many files
APIs and Access Options
HBase Clients
• Native Java Client/API
• Non-Java Clients
• REST server
• Thrift server
• Jython, Groovy DSL
• Spark
• TableInputFormat/TableOutputFormat for MapReduce
• HBase as MapReduce source and/or target
• Also available for table snapshots
• HBase Shell
• JRuby shell adding get, put, scan etc. and admin calls
• Phoenix, Impala, Hive, …
Java API
From Wikipedia:
• CRUD: “In computer programming, create, read, update, and delete are the
four basic functions of persistent storage.”
• Other variations of CRUD include
• BREAD (Browse, Read, Edit, Add, Delete)
• MADS (Modify, Add, Delete, Show)
• DAVE (Delete, Add, View, Edit)
• CRAP (Create, Retrieve, Alter, Purge)
Java API (cont.)
• CRUD
• put: Create and update a row (CU)
• get: Retrieve an entire, or partial row (R)
• delete: Delete a cell, column, columns, or row (D)
• CRUD+SI
• scan: Scan any number of rows (S)
• increment: Increment a column value (I)
• CRUD+SI+CAS
• Atomic compare-and-swap (CAS)
• Combined get, check, and put operation
• Helps to overcome lack of full transactions
Java API (cont.)
• Batch Operations
• Support Get, Put, and Delete
• Reduce network round-trips
• If possible, batch operation to the server to gain better overall throughput
• Filters
• Can be used with Get and Scan operations
• Server side hinting
• Reduce data transferred to client
• Filters are no guarantee for fast scans
• Still full table scan in worst-case scenario
• Might have to implement your own
• Filters can hint next row key
Data Modeling
Where’s your data at?
Key Cardinality
• The best performance is gained from using row keys
• Time range bound reads can skip store files
• So can Bloom Filters
• Selecting column families
reduces the amount of data
to be scanned
• Pure value based access
is a full table scan
• Filters often are too, but
reduce network traffic
Key/Table Design
• Crucial to gain best performance
• Why do I need to know? Well, you also need to know that RDBMS is only working
well when columns are indexed and query plan is OK
• Absence of secondary indexes forces use of row key or column name
sorting
• Transfer multiple indexes into one
• Generate large table -> Good since fits architecture and spreads across cluster
• DDI
• Stands for Denormalization, Duplication and Intelligent Keys
• Needed to overcome trade-offs of architecture
• Denormalization -> Replacement for JOINs
• Duplication -> Design for reads
• Intelligent Keys -> Implement indexing and sorting, optimize reads
Pre-materialize Everything
• Achieve one read per customer request if possible
• Otherwise keep at lowest number
• Reads between 10ms (cache miss) and 1ms (cache hit)
• Use MapReduce or Spark to compute exacts in batch
• Store and merge updates live
• Use increment() methods
Motto: “Design for Reads”
Tall-Narrow vs. Flat-Wide Tables
• Rows do not split
• Might end up with one row per region
• Same storage footprint
• Put more details into the row key
• Sometimes dummy column only
• Make use of partial key scans
• Tall with Scans, Wide with Gets
• Atomicity only on row level
• Examples
• Large graphs, stored as adjacency matrix (narrow)
• Message inbox (wide)
Sequential Keys
<timestamp><more key>: {CF: {CQ: {TS : Val}}}
• Hotspotting on regions is bad!
• Instead do one of the following:
• Salting
• Prefix <timestamp> with distributed value
• Binning or bucketing rows across regions
• Key field swap/promotion
• Move <more key> before the timestamp (see OpenTSDB)
• Randomization
• Move <timestamp> out of key or prefix with MD5 hash
• Might also be mitigated by overall spread of workloads
Key Design Choices
• Based on access pattern, either use
sequential or random keys
• Often a combination of both is needed
• Overcome architectural limitations
• Neither is necessarily bad
• Use bulk import for sequential keys and
reads
• Random keys are good for random access
patterns
Checklist
• Design for Use-Case
• Read, Write, or Both?
• Avoid Hotspotting
• Hash leading key part, or use salting/bucketing
• Use bulk loading where possible
• Monitor your servers!
• Presplit tables
• Try prefix encoding when values are small
• Otherwise use compression (or both)
• For Reads: Restrict yourself
• Specify what you need, i.e. columns, families, time range
• Shift details to appropriate position
• Composite Keys
• Column Qualifiers
Performance Tuning
1000 knobs to turn… 20 are important?
Everything is Pluggable
• Cell
• Memstore
• Flush Policy
• Compaction
Policy
• Cache
• WAL
• RPC handling
• …
Cluster Tuning
• First, tune the global settings
• Heap size and GC algorithm
• Memory share for reads and writes
• Enable Block Cache
• Number of RPC handlers
• Load Balancer
• Default flush and compaction strategy
• Thread pools (10+)
• Next, tune the per-table and family settings
• Region sizes
• Block sizes
• Compression and encoding
• Compactions
• …
Region Balancer Tuning
• A background process in the HBase
Master is tracking load on servers
• The load balancer moves regions
occasionally
• Multiple implementations exists
• Simple counts number of regions
• Stochastic determines cost
• Favored Node pins HDFS block
replicas
• Can be tuned further
• Cluster-wide setting!
RPC Tuning
• Default is one queue for
all types of requests
• Can be split into
separate queues for
reads and writes
• Read queue can be
further split into reads
and scans
 Stricter resource limits,
but may avoid cross-
starvation
Key Tuning
• Design keys to match use-case
• Sequential, salted, or random
• Use sorting to convey meaning
• Colocate related data
• Spread load over all servers
• Clever key design can make use
of distribution: aging-out regions
Compaction Tuning
• Default compaction settings are aggressive
• Set for update use-case
• For insert use-cases, Blooms are effective
• Allows to tune down compactions
• Saves resources by reducing write amplification
• More store files are also enabling faster full
table scans with time range bound scans
• Server can ignore older files
• Large regions may be eligible for advanced
compaction strategies
• Stripe or date-tiered compactions
• Reduce rewrites to fraction of region size
Use-Cases
What works well, what does not, and what is so-so
Placing the Use-Case
Big Data Workloads
Low
latency
Batch
Random Access Full ScanShort Scan
HDFS + MR
(Hive/Pig)
HBase
HBase + Snapshots
-> HDFS + MR/Spark
HDFS
+ SQL
HBase + MR/Spark
Big Data Workloads
Low
latency
Batch
Random Access Full ScanShort Scan
HDFS + MR/Spark
(Hive/Pig)
HBase
HBase + Snapshots
-> HDFS + MR/Spark
HDFS
+ SQL
HBase + MR/Spark
Current Metrics
Graph data
Simple Entities
Hybrid Entity Time series
+ Rollup serving
Messages
Analytic archive
Hybrid Entity Time series
+ Rollup generation
Index building
Entity Time series
Summary
Wrapping it up…
What matters…
• For optimal performance, two things need to be considered:
• Optimize the cluster and table settings
• Choose the matching key schema
• Ensure load is spread over tables and cluster nodes
• HBase works best for random access and bound scans
• HBase can be optimized for larger scans, but its sweet spot is short burst scans (can
be parallelized too) and random point gets
• Java heap space limits addressable space
• Play with region sizes, compaction strategies, and key design to maximize result
• Using HBase for a suitable use-case will make for a happy customer…
• Conversely, forcing it into non-suitable use-cases may be cause for trouble
Questions?
Thank You!
@larsgeorge

More Related Content

What's hot

Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in ImpalaCloudera, Inc.
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
HBaseCon
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
强 王
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Amazon Web Services
 
Apache kudu
Apache kuduApache kudu
Apache kudu
Asim Jalis
 
[124]네이버에서 사용되는 여러가지 Data Platform, 그리고 MongoDB
[124]네이버에서 사용되는 여러가지 Data Platform, 그리고 MongoDB[124]네이버에서 사용되는 여러가지 Data Platform, 그리고 MongoDB
[124]네이버에서 사용되는 여러가지 Data Platform, 그리고 MongoDB
NAVER D2
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Amazon Web Services
 
YARN Federation
YARN Federation YARN Federation

What's hot (20)

Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in Impala
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
[124]네이버에서 사용되는 여러가지 Data Platform, 그리고 MongoDB
[124]네이버에서 사용되는 여러가지 Data Platform, 그리고 MongoDB[124]네이버에서 사용되는 여러가지 Data Platform, 그리고 MongoDB
[124]네이버에서 사용되는 여러가지 Data Platform, 그리고 MongoDB
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
Best Practices for Running PostgreSQL on AWS - DAT314 - re:Invent 2017
 
YARN Federation
YARN Federation YARN Federation
YARN Federation
 

Viewers also liked

Apache HBase: State of the Union
Apache HBase: State of the UnionApache HBase: State of the Union
Apache HBase: State of the Union
DataWorks Summit/Hadoop Summit
 
Date-tiered Compaction Policy for Time-series Data
Date-tiered Compaction Policy for Time-series DataDate-tiered Compaction Policy for Time-series Data
Date-tiered Compaction Policy for Time-series Data
HBaseCon
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
HBaseCon
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
HBaseCon
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
HBaseCon
 
Apache HBase at Airbnb
Apache HBase at Airbnb Apache HBase at Airbnb
Apache HBase at Airbnb
HBaseCon
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
alexbaranau
 

Viewers also liked (9)

Apache HBase: State of the Union
Apache HBase: State of the UnionApache HBase: State of the Union
Apache HBase: State of the Union
 
Date-tiered Compaction Policy for Time-series Data
Date-tiered Compaction Policy for Time-series DataDate-tiered Compaction Policy for Time-series Data
Date-tiered Compaction Policy for Time-series Data
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
 
HBase internals
HBase internalsHBase internals
HBase internals
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
 
Apache HBase at Airbnb
Apache HBase at Airbnb Apache HBase at Airbnb
Apache HBase at Airbnb
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 

Similar to HBase in Practice

Hbase schema design and sizing apache-con europe - nov 2012
Hbase schema design and sizing   apache-con europe - nov 2012Hbase schema design and sizing   apache-con europe - nov 2012
Hbase schema design and sizing apache-con europe - nov 2012Chris Huang
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
larsgeorge
 
Schema Design
Schema DesignSchema Design
Schema Design
QBurst
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
JAX London
 
Introduction to Apache HBase
Introduction to Apache HBaseIntroduction to Apache HBase
Introduction to Apache HBase
Gokuldas Pillai
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overview
PritamKathar
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
Valerii Moisieienko
 
Scaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLScaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQL
Richard Schneeman
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
Rahul Borate
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
sheetal sharma
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
Rahul Borate
 
Hbasepreso 111116185419-phpapp02
Hbasepreso 111116185419-phpapp02Hbasepreso 111116185419-phpapp02
Hbasepreso 111116185419-phpapp02Gokuldas Pillai
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
Jean-Baptiste Poullet
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
Malin Weiss
 

Similar to HBase in Practice (20)

Hbase schema design and sizing apache-con europe - nov 2012
Hbase schema design and sizing   apache-con europe - nov 2012Hbase schema design and sizing   apache-con europe - nov 2012
Hbase schema design and sizing apache-con europe - nov 2012
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
 
Schema Design
Schema DesignSchema Design
Schema Design
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
Introduction to Apache HBase
Introduction to Apache HBaseIntroduction to Apache HBase
Introduction to Apache HBase
 
Cassandra an overview
Cassandra an overviewCassandra an overview
Cassandra an overview
 
01 hbase
01 hbase01 hbase
01 hbase
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
 
Scaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLScaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQL
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Hbasepreso 111116185419-phpapp02
Hbasepreso 111116185419-phpapp02Hbasepreso 111116185419-phpapp02
Hbasepreso 111116185419-phpapp02
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
 
NoSql
NoSqlNoSql
NoSql
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 

More from DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Recently uploaded

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 

Recently uploaded (20)

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 

HBase in Practice

  • 1. HBase in Practice Lars George – Partner and Co-Founder @ OpenCore DataWorks Summit 2017 - Munich NoSQL is no SQL is SQL?
  • 2. About Us • Partner & Co-Founder at OpenCore • Before that • Lars: EMEA Chief Architect at Cloudera (5+ years) • Hadoop since 2007 • Apache Committer & Apache Member • HBase (also in PMC) • Lars: O’Reilly Author: HBase – The Definitive Guide • Contact • lars.george@opencore.com • @larsgeorge Website: www.opencore.com
  • 3. Agenda • Brief Intro To Core Concepts • Access Options • Data Modelling • Performance Tuning • Use-Cases • Summary
  • 5. HBase Tables • From user perspective, HBase is similar to a database, or spreadsheet • There are rows and columns, storing values • By default asking for a specific row/column combination returns the current value (that is, that last value stored there)
  • 6. HBase Tables • HBase can have a different schema per row • Could be called schema-less • Primary access by the user given row key and column name • Sorting of rows and columns by their key (aka names)
  • 7. HBase Tables • Each row/column coordinate is tagged with a version number, allowing multi-versioned values • Version is usually the current time (as epoch) • API lets user ask for versions (specific, by count, or by ranges) • Up to 2B versions
  • 8. HBase Tables • Table data is cut into pieces to distribute over cluster • Regions split table into shards at size boundaries • Families split within regions to group sets of columns together • At least one of each is needed
  • 9. Scalability – Regions as Shards • A region is served by exactly one region server • Every region server serves many regions • Table data is spread over servers • Distribution of I/O • Assignment is based on configurable logic • Balancing cluster load • Clients talk directly to region servers
  • 10. Column Family-Oriented • Group multiple columns into physically separated locations • Apply different properties to each family • TTL, compression, versions, … • Useful to separate distinct data sets that are related • Also useful to separate larger blob from meta data
  • 11. Data Management • What is available is tracked in three locations • System catalog table hbase:meta • Files in HDFS directories • Open region instances on servers • System aligns these locations • Sometimes (very rarely) a repair may be needed using HBase Fsck • Redundant information is useful to repair corrupt tables
  • 12. HBase really is…. • A distributed Hash Map • Imagine a complex, concatenated key including the user given row key and column name, the timestamp (version) • Complex key points to actual value, that is, the cell
  • 13. Fold, Store, and Shift • Logical rows in tables are really stored as flat key-value pairs • Each carries full coordinates • Pertinent information can be freely placed in cell to improve lookup • HBase is a column-family grouped key-value store
  • 14. HFile Format Information • All data is stored in a custom (open-source) format, called HFile • Data is stored in blocks (64KB default) • Trade-off between lookups and I/O throughput • Compression, encoding applied _after_ limit check • Index, filter and meta data is stored in separate blocks • Fixed trailer allows traversal of file structure • Newer versions introduce multilayered index and filter structures • Only load master index and load partial index blocks on demand • Reading data requires deserialization of block into cells • Kind of Amdahl’s Law applies
  • 15. HBase Architecture • One Master and many Worker servers • Clients mostly communicate with workers • Workers store actual data • Memstore for accruing • HFile for persistence • WAL for fail-safety • Data provided as regions • HDFS is backing store • But could be another
  • 17. HBase Architecture (cont.) • Based on Log-Structured Merge-Trees (LSM-Trees) • Inserts are done in write-ahead log first • Data is stored in memory and flushed to disk on regular intervals or based on size • Small flushes are merged in the background to keep number of files small • Reads read memory stores first and then disk based files second • Deletes are handled with “tombstone” markers • Atomicity on row level no matter how many columns • Keeps locking model easy
  • 18. Merge Reads • Read Memstore & StoreFiles using separate scanners • Merge matching cells into single row “view” • Delete’s mask existing data • Bloom filters help skip StoreFiles • Reads may have to span many files
  • 19. APIs and Access Options
  • 20. HBase Clients • Native Java Client/API • Non-Java Clients • REST server • Thrift server • Jython, Groovy DSL • Spark • TableInputFormat/TableOutputFormat for MapReduce • HBase as MapReduce source and/or target • Also available for table snapshots • HBase Shell • JRuby shell adding get, put, scan etc. and admin calls • Phoenix, Impala, Hive, …
  • 21. Java API From Wikipedia: • CRUD: “In computer programming, create, read, update, and delete are the four basic functions of persistent storage.” • Other variations of CRUD include • BREAD (Browse, Read, Edit, Add, Delete) • MADS (Modify, Add, Delete, Show) • DAVE (Delete, Add, View, Edit) • CRAP (Create, Retrieve, Alter, Purge)
  • 22. Java API (cont.) • CRUD • put: Create and update a row (CU) • get: Retrieve an entire, or partial row (R) • delete: Delete a cell, column, columns, or row (D) • CRUD+SI • scan: Scan any number of rows (S) • increment: Increment a column value (I) • CRUD+SI+CAS • Atomic compare-and-swap (CAS) • Combined get, check, and put operation • Helps to overcome lack of full transactions
  • 23. Java API (cont.) • Batch Operations • Support Get, Put, and Delete • Reduce network round-trips • If possible, batch operation to the server to gain better overall throughput • Filters • Can be used with Get and Scan operations • Server side hinting • Reduce data transferred to client • Filters are no guarantee for fast scans • Still full table scan in worst-case scenario • Might have to implement your own • Filters can hint next row key
  • 25. Key Cardinality • The best performance is gained from using row keys • Time range bound reads can skip store files • So can Bloom Filters • Selecting column families reduces the amount of data to be scanned • Pure value based access is a full table scan • Filters often are too, but reduce network traffic
  • 26. Key/Table Design • Crucial to gain best performance • Why do I need to know? Well, you also need to know that RDBMS is only working well when columns are indexed and query plan is OK • Absence of secondary indexes forces use of row key or column name sorting • Transfer multiple indexes into one • Generate large table -> Good since fits architecture and spreads across cluster • DDI • Stands for Denormalization, Duplication and Intelligent Keys • Needed to overcome trade-offs of architecture • Denormalization -> Replacement for JOINs • Duplication -> Design for reads • Intelligent Keys -> Implement indexing and sorting, optimize reads
  • 27. Pre-materialize Everything • Achieve one read per customer request if possible • Otherwise keep at lowest number • Reads between 10ms (cache miss) and 1ms (cache hit) • Use MapReduce or Spark to compute exacts in batch • Store and merge updates live • Use increment() methods Motto: “Design for Reads”
  • 28. Tall-Narrow vs. Flat-Wide Tables • Rows do not split • Might end up with one row per region • Same storage footprint • Put more details into the row key • Sometimes dummy column only • Make use of partial key scans • Tall with Scans, Wide with Gets • Atomicity only on row level • Examples • Large graphs, stored as adjacency matrix (narrow) • Message inbox (wide)
  • 29. Sequential Keys <timestamp><more key>: {CF: {CQ: {TS : Val}}} • Hotspotting on regions is bad! • Instead do one of the following: • Salting • Prefix <timestamp> with distributed value • Binning or bucketing rows across regions • Key field swap/promotion • Move <more key> before the timestamp (see OpenTSDB) • Randomization • Move <timestamp> out of key or prefix with MD5 hash • Might also be mitigated by overall spread of workloads
  • 30. Key Design Choices • Based on access pattern, either use sequential or random keys • Often a combination of both is needed • Overcome architectural limitations • Neither is necessarily bad • Use bulk import for sequential keys and reads • Random keys are good for random access patterns
  • 31. Checklist • Design for Use-Case • Read, Write, or Both? • Avoid Hotspotting • Hash leading key part, or use salting/bucketing • Use bulk loading where possible • Monitor your servers! • Presplit tables • Try prefix encoding when values are small • Otherwise use compression (or both) • For Reads: Restrict yourself • Specify what you need, i.e. columns, families, time range • Shift details to appropriate position • Composite Keys • Column Qualifiers
  • 32. Performance Tuning 1000 knobs to turn… 20 are important?
  • 33. Everything is Pluggable • Cell • Memstore • Flush Policy • Compaction Policy • Cache • WAL • RPC handling • …
  • 34. Cluster Tuning • First, tune the global settings • Heap size and GC algorithm • Memory share for reads and writes • Enable Block Cache • Number of RPC handlers • Load Balancer • Default flush and compaction strategy • Thread pools (10+) • Next, tune the per-table and family settings • Region sizes • Block sizes • Compression and encoding • Compactions • …
  • 35. Region Balancer Tuning • A background process in the HBase Master is tracking load on servers • The load balancer moves regions occasionally • Multiple implementations exists • Simple counts number of regions • Stochastic determines cost • Favored Node pins HDFS block replicas • Can be tuned further • Cluster-wide setting!
  • 36. RPC Tuning • Default is one queue for all types of requests • Can be split into separate queues for reads and writes • Read queue can be further split into reads and scans  Stricter resource limits, but may avoid cross- starvation
  • 37. Key Tuning • Design keys to match use-case • Sequential, salted, or random • Use sorting to convey meaning • Colocate related data • Spread load over all servers • Clever key design can make use of distribution: aging-out regions
  • 38. Compaction Tuning • Default compaction settings are aggressive • Set for update use-case • For insert use-cases, Blooms are effective • Allows to tune down compactions • Saves resources by reducing write amplification • More store files are also enabling faster full table scans with time range bound scans • Server can ignore older files • Large regions may be eligible for advanced compaction strategies • Stripe or date-tiered compactions • Reduce rewrites to fraction of region size
  • 39. Use-Cases What works well, what does not, and what is so-so
  • 41. Big Data Workloads Low latency Batch Random Access Full ScanShort Scan HDFS + MR (Hive/Pig) HBase HBase + Snapshots -> HDFS + MR/Spark HDFS + SQL HBase + MR/Spark
  • 42. Big Data Workloads Low latency Batch Random Access Full ScanShort Scan HDFS + MR/Spark (Hive/Pig) HBase HBase + Snapshots -> HDFS + MR/Spark HDFS + SQL HBase + MR/Spark Current Metrics Graph data Simple Entities Hybrid Entity Time series + Rollup serving Messages Analytic archive Hybrid Entity Time series + Rollup generation Index building Entity Time series
  • 44. What matters… • For optimal performance, two things need to be considered: • Optimize the cluster and table settings • Choose the matching key schema • Ensure load is spread over tables and cluster nodes • HBase works best for random access and bound scans • HBase can be optimized for larger scans, but its sweet spot is short burst scans (can be parallelized too) and random point gets • Java heap space limits addressable space • Play with region sizes, compaction strategies, and key design to maximize result • Using HBase for a suitable use-case will make for a happy customer… • Conversely, forcing it into non-suitable use-cases may be cause for trouble

Editor's Notes

  1. For Developers & End-Users – Apache Phoenix, Spark
  2. Importance of Row Key structure
  3. Time-series Data etc.
  4. Time-series Data etc.