SlideShare a Scribd company logo
1 of 23
Download to read offline
1 
HBASE: overview 
Jean-Baptiste Poullet 
Consultant @Stat'Rgy
2 
Contents 
● What is HBase ? 
● HBase vs RDBMS (like MySQL or PostgreSQL) 
● Backup ? CRUD operations ? ACID compliant ? 
● Hardware/OS 
● HBase DB Design 
● UI ? Let's make a demo.
3 
What is HBase ? 
● Wikipedia definition: HBase is an open source, non-relational, 
distributed database modeled after Google's BigTable and 
written in Java. It is developed as part of Apache Software 
Foundation's Apache Hadoop project and runs on top of HDFS 
(Hadoop Distributed Filesystem), providing BigTable-like 
capabilities for Hadoop. That is, it provides a fault-tolerant way of 
storing large quantities of sparse data (small amounts of 
information caught within a large collection of empty or 
unimportant data, such as finding the 50 largest items in a group 
of 2 billion records, or finding the non-zero items representing less 
than 0.1% of a huge collection).
4 
HBase is used by the largest companies
5 
HBase features 
No real indexes 
● Rows are stored sequentially, as are the columns within each row. Therefore, no issues with index bloat, and insert performance is 
independent of table size. 
● 
● Automatic partitioning 
● As your tables grow, they will automatically be split into regions and distributed across all available nodes. 
● 
● Scale linearly and automatically with new nodes 
● Add a node, point it to the existing cluster, and run the regionserver. Regions will automatically rebalance and load will spread evenly. 
● 
● Commodity hardware 
● Clusters are built on $1,000–$5,000 nodes rather than $50,000 nodes. RDBMSs are I/O hungry, requiring more costly hardware. 
● 
● Fault tolerance 
● Lots of nodes means each is relatively insignificant. No need to worry about individual node downtime. 
● 
● Batch processing 
● MapReduce integration allows fully parallel, distributed jobs against your data with locality awareness.
6 
HBase vs RDBMS 
Why should I migrate to HBase ? 
● Scalability / dealing with sparse matrix 
– In RDBMS, NULL cells need to be set and occupy space 
– In HBase, NULL cells are simply not stored 
When ? 
If you stay up at night worrying about your database (uptime, scale, or speed), then you should seriously 
consider making a jump from the RDBMS world to HBase. 
How ? 
● ETL (sqoop, scalding/cascading, scala, python, BI ETL, etc)
7 
CRUD operations in HBase 
CRUD operations for many clients 
Single-row transactions (multiple-row transactions are possible since version 0.94 if the 
rows are on the same region server) 
Select columns and version possible 
Atomic read-modify-write on data stored => concurrent access is not an issue 
Co-processors are equivalent to stored-procedures in RDBMS 
allow to push user code in the address space of the server 
access to server local data 
implement lightweight batch jobs, data pre-processing, data summarization 
HFile is persistent and ordered immutable maps from key to value 
Deleting data: a delete marker (tombstone marker) is written to indicate that a given key is 
deleted. In the READ process data marked as deleted are skipped. 
DDI: Stands for Denormalization, Duplication and Intelligent Keys 
• Denormalization : replacement for JOINs 
• Duplication : Design for reads 
• Intelligent Keys : Implement indexing and sorting, optimize reads
8 
Is HBase ACID ? 
● ACID = Atomicity, Consistency, Isolation, and Durability 
● HBase guarantees: 
– Atomic: All row level operations within a table are atomic. This guarantee 
is maintained even when there’s more than one column family within a row. 
– Consistency: Scan operations return a consistent view of the data stored 
in HBase at some point in the past. Concurrent client interaction could 
update a row during a multi-row scan, but all rows returned by a scan 
operation will always contain valid data from some point in the past. 
– Durability: Any data that can be retrieved from HBase has also been made 
durable to disk (persisted to HDFS, in other words). 
– 
When ACID properties are required by HBase clients, design the 
HBase schema such that cross row or cross table data operations 
are not required. Keeping data within a row provides atomicity.
9 
HBase cluster – Failure Candidates 
● Data Center: geo distributed data 
● Cluster: avoid redundant cluster, rather have one big cluster with high redundancy 
● Rack: Hadoop has built-in rack awareness 
● Network Switch: redundant network within each node 
● Power Strip: redundant power within each node 
● Region Server or Data Node: can be added/removed dynamically for regular 
maintenance => need of a replication factor of 3 or 4 
● Zookeeper Node: Zookeeper nodes are distributed and can be added/removed 
dynamically, must be in odd number due to the quorum (Best practices: 5 or 7) 
● HBase Master or Name Node: Multiple Hmaster (Best practices: 2-3, 1 per rack)
10 
Backup built-in 
● HBase is highly distributed and has built-in versioning, 
data retention policy 
– No need to backup just for redundancy 
– Point-in-time restore: 
● Use TTL/Table/CF/C and keep the history for X hours/days 
– Accidental deletes: 
● Use 'KeepDeletedCells' to keep all deleted data 
HDFS is a key enabling technology not only for Hadoop but also for HBase. By 
storing data in HDFS, HBase offers reliability, availability, seamless scalability, 
high performance and much more — all on cost effective distributed servers.
11 
Backup - Tools 
● Use export/import tool: 
– Based on timestamp; and use it for point-in-time backup/restore 
● Use region snapshots 
– Take HFile snapshots and copy them over to new storage 
location 
– Copy Hlog files for point-in-time roll-forward from snapshot time 
(replay using WALPlayer post import) 
● Table snapshots (0.94.6+)
12 
Hardware/Disk/OS best practices 
● 1U or 2U preferred, avoid 4U or NAS or expensive systems 
● JBOD on slaves, RAID 1+0 on masters 
● No SSDs, No virtualized storage 
● Good number of cores (4-16), HyperThreading enabled on CPUs 
● Good amount of RAM (24-72G) 
● Dual 1G network, 10G or InfiniBand 
● SATA, 7/10/15K, the cheaper the better 
● Use RAID firmware drives, faster error detection and enable disk to fail on hardware errors 
● Ext3/Ext4/XFS 
● RHEL or CentOS or Ubuntu 
● Swappiness=0 and no swap files 
● Automation with Puppet (e.g. for deploying an HBase cluster) and Fabric (e.g. for deploying new HBase 
release with zero downtime)
13 
Alerting system 
● Need proper alerting system 
– JMX exposes all metrics 
– Ops Dashboards (Ganglia, Cacti, OpenTSDB, NewRelic) 
– Small Dashboard for critical events 
– Define proper level for escalation 
– Critical 
● Loosing a Master or ZooKeeper Node 
● +/- 10% drop in performance or latency 
● Key thresholds (load,swap,IO) 
● Loosing 2 or more slave nodes 
● Disk failures 
● Unbalanced nodes 
● FATAL errors in logs
14 
Tables in HBase 
• Tables are sorted by Row in lexicographical order 
• Table schema only defines its column families 
• Each family consists of any number of columns 
• Each column consists of any number of versions 
• Columns only exist when inserted, NULLs are free 
• Columns within a family are sorted and stored together 
• Everything except table name are byte 
KeyValue: 
(Table, Row, Family:Column, Timestamp) -> Value 
KeyValue instances are not split across blocks. 
For example, if there is an 8 MB KeyValue, 
even if the block-size is 64kb this KeyValue will 
be read in as a coherent block. For more 
information, see the KeyValue source code. 
The KeyValue format inside a byte array is: 
• keylength 
• valuelength 
• key 
• value 
The Key is further decomposed as: 
• rowlength 
• row (i.e., the rowkey) 
• columnfamilylength 
• columnfamily 
• columnqualifier 
• timestamp 
• keytype (e.g., Put, Delete, 
DeleteColumn, DeleteFamily)
15 
What about the schema design ? 
Schema design is a combination of 
• Designing the keys (rows and columns) 
• Segregate data into column families 
• Choose compression and block sizes 
CONFIG file: conf/hbase-site.xml
Designing the keys: READ or WRITE design 
Sequential keys 
([timestamp]) 
would be more appropriate 
for BridgeIris since the 
writing process can be done 
in a batch mode 
Interactive queries require 
a fast access to the data. 
Risk of hotspotting on 
regions when continous 
writing (ok if 
Bulk loads instead) 
16
17 
Designing the keys
18 
Designing keys 
• Tall-Narrow Tables (many rows, few columns) vs Flat-Wide Tables (few rows, 
many columns) 
 Tall-Narrow is recommended 
 Store part of the cell data in the row key 
• Rows do not split => avoid too large rows. 
• Dimensions that are queried together in the same column family since 
those columns will be stored in the same low-level storage file (HFile on HDFS) 
• Atomicity on row level => not an issue in BrideIris: we can build 
row/column key such that we don’t need several rows to be updated in a row.
What about the cluster and HBase config ? 
19 
• Data node and region server should be co-located. Same cluster 
• Replication: at least 3 => OK with HDFS 
• Too many or too small regions are not good. 
• When does a region split ? Region size ? Keep default or set to 1 GB 
• Store larger than hbase.hregion.max.filesize (HBase v0.94 used by EMR: 10GB) after a 
major compaction, for a 10 node cluster it is better to have 10 regions of 0.4 GB than one 
big of 4 GB. But too many will generate an overhead in memory (MSLAB requires 2MB per 
family per region). 
• How is the region assigned to a region server ? Keep default 
– Automated to insure a balance between the region servers (manual command in HBase 
shell: balance_switch, hbase.balancer.period property) 
• What is the best block size ? Keep default 
– The block size can be configured for each column family (default 64 kb). 
– Column families can be defined in memory (quick read access) => are there columns that 
will be almost always requested by the user ??? 
• Should blocks be compressed ? How ? No compression and Snappy if 
needed 
– Compression is possible for each column family. GZIP (built in), SNAPPY (to be installed on 
all nodes). GZIP better compression but slower. If compression, SNAPPY would be more 
appropriate
20 
Benchmark is a key 
● Nothing fits for all 
● Simulate use cases and run the tests: 
– Bulk loading 
– Random access, read/write 
– Batch processing 
– Scan, filter 
● Negative performance 
– Replication factor 
– Zookeeper nodes 
– Network latency 
– Slower disks, CPUs 
– Hot regions, Bad row keys or Bulk loading without pre-splits
21 
MySQL to HBase 
Row key Column family:{column 
qualifier:Version:Value} 
0000000001 gatk_change_stats: 
{'chr':1383859:'5', 
'pos':1383834:'3932', 
…} 
gatk_gene_coverage: 
{'id_project':38398:'38', 
'gene_symbol':3938:'ENSG000034 
33'} 
0000000002 gatk_change_stats: 
{'chr':1383859:'2', 
'pos':1383834:'3232', 
…} 
gatk_gene_coverage: 
{'id_project':38398:'8', 
'gene_symbol':3938:'ENSG000033 
890'} 
SQOOP 
http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html#_connec 
ting_to_a_database_server
22 
Some demo ...
23 
Thanks !

More Related Content

What's hot

Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaCloudera, Inc.
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014larsgeorge
 
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceCloudera, Inc.
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme MakeoverHBaseCon
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...Cloudera, Inc.
 
Digital Library Collection Management using HBase
Digital Library Collection Management using HBaseDigital Library Collection Management using HBase
Digital Library Collection Management using HBaseHBaseCon
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best PracticesVenu Anuganti
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteDataWorks Summit
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceCloudera, Inc.
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0enissoz
 

What's hot (18)

Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
 
Rigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase PerformanceRigorous and Multi-tenant HBase Performance
Rigorous and Multi-tenant HBase Performance
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ Salesforce
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
 
Digital Library Collection Management using HBase
Digital Library Collection Management using HBaseDigital Library Collection Management using HBase
Digital Library Collection Management using HBase
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 
HBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and CompactionHBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and Compaction
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on Mesos
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
 

Similar to Hbase: an introduction

Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统yongboy
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfsNAVER D2
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSqlOmid Vahdaty
 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Yahoo Developer Network
 
Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentFei Dong
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideDouglas Bernardini
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 

Similar to Hbase: an introduction (20)

Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Hadoop Research
Hadoop Research Hadoop Research
Hadoop Research
 
Introduction to NoSql
Introduction to NoSqlIntroduction to NoSql
Introduction to NoSql
 
Data Storage Management
Data Storage ManagementData Storage Management
Data Storage Management
 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud Environment
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Hbase
HbaseHbase
Hbase
 

Recently uploaded

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 

Recently uploaded (20)

Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 

Hbase: an introduction

  • 1. 1 HBASE: overview Jean-Baptiste Poullet Consultant @Stat'Rgy
  • 2. 2 Contents ● What is HBase ? ● HBase vs RDBMS (like MySQL or PostgreSQL) ● Backup ? CRUD operations ? ACID compliant ? ● Hardware/OS ● HBase DB Design ● UI ? Let's make a demo.
  • 3. 3 What is HBase ? ● Wikipedia definition: HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).
  • 4. 4 HBase is used by the largest companies
  • 5. 5 HBase features No real indexes ● Rows are stored sequentially, as are the columns within each row. Therefore, no issues with index bloat, and insert performance is independent of table size. ● ● Automatic partitioning ● As your tables grow, they will automatically be split into regions and distributed across all available nodes. ● ● Scale linearly and automatically with new nodes ● Add a node, point it to the existing cluster, and run the regionserver. Regions will automatically rebalance and load will spread evenly. ● ● Commodity hardware ● Clusters are built on $1,000–$5,000 nodes rather than $50,000 nodes. RDBMSs are I/O hungry, requiring more costly hardware. ● ● Fault tolerance ● Lots of nodes means each is relatively insignificant. No need to worry about individual node downtime. ● ● Batch processing ● MapReduce integration allows fully parallel, distributed jobs against your data with locality awareness.
  • 6. 6 HBase vs RDBMS Why should I migrate to HBase ? ● Scalability / dealing with sparse matrix – In RDBMS, NULL cells need to be set and occupy space – In HBase, NULL cells are simply not stored When ? If you stay up at night worrying about your database (uptime, scale, or speed), then you should seriously consider making a jump from the RDBMS world to HBase. How ? ● ETL (sqoop, scalding/cascading, scala, python, BI ETL, etc)
  • 7. 7 CRUD operations in HBase CRUD operations for many clients Single-row transactions (multiple-row transactions are possible since version 0.94 if the rows are on the same region server) Select columns and version possible Atomic read-modify-write on data stored => concurrent access is not an issue Co-processors are equivalent to stored-procedures in RDBMS allow to push user code in the address space of the server access to server local data implement lightweight batch jobs, data pre-processing, data summarization HFile is persistent and ordered immutable maps from key to value Deleting data: a delete marker (tombstone marker) is written to indicate that a given key is deleted. In the READ process data marked as deleted are skipped. DDI: Stands for Denormalization, Duplication and Intelligent Keys • Denormalization : replacement for JOINs • Duplication : Design for reads • Intelligent Keys : Implement indexing and sorting, optimize reads
  • 8. 8 Is HBase ACID ? ● ACID = Atomicity, Consistency, Isolation, and Durability ● HBase guarantees: – Atomic: All row level operations within a table are atomic. This guarantee is maintained even when there’s more than one column family within a row. – Consistency: Scan operations return a consistent view of the data stored in HBase at some point in the past. Concurrent client interaction could update a row during a multi-row scan, but all rows returned by a scan operation will always contain valid data from some point in the past. – Durability: Any data that can be retrieved from HBase has also been made durable to disk (persisted to HDFS, in other words). – When ACID properties are required by HBase clients, design the HBase schema such that cross row or cross table data operations are not required. Keeping data within a row provides atomicity.
  • 9. 9 HBase cluster – Failure Candidates ● Data Center: geo distributed data ● Cluster: avoid redundant cluster, rather have one big cluster with high redundancy ● Rack: Hadoop has built-in rack awareness ● Network Switch: redundant network within each node ● Power Strip: redundant power within each node ● Region Server or Data Node: can be added/removed dynamically for regular maintenance => need of a replication factor of 3 or 4 ● Zookeeper Node: Zookeeper nodes are distributed and can be added/removed dynamically, must be in odd number due to the quorum (Best practices: 5 or 7) ● HBase Master or Name Node: Multiple Hmaster (Best practices: 2-3, 1 per rack)
  • 10. 10 Backup built-in ● HBase is highly distributed and has built-in versioning, data retention policy – No need to backup just for redundancy – Point-in-time restore: ● Use TTL/Table/CF/C and keep the history for X hours/days – Accidental deletes: ● Use 'KeepDeletedCells' to keep all deleted data HDFS is a key enabling technology not only for Hadoop but also for HBase. By storing data in HDFS, HBase offers reliability, availability, seamless scalability, high performance and much more — all on cost effective distributed servers.
  • 11. 11 Backup - Tools ● Use export/import tool: – Based on timestamp; and use it for point-in-time backup/restore ● Use region snapshots – Take HFile snapshots and copy them over to new storage location – Copy Hlog files for point-in-time roll-forward from snapshot time (replay using WALPlayer post import) ● Table snapshots (0.94.6+)
  • 12. 12 Hardware/Disk/OS best practices ● 1U or 2U preferred, avoid 4U or NAS or expensive systems ● JBOD on slaves, RAID 1+0 on masters ● No SSDs, No virtualized storage ● Good number of cores (4-16), HyperThreading enabled on CPUs ● Good amount of RAM (24-72G) ● Dual 1G network, 10G or InfiniBand ● SATA, 7/10/15K, the cheaper the better ● Use RAID firmware drives, faster error detection and enable disk to fail on hardware errors ● Ext3/Ext4/XFS ● RHEL or CentOS or Ubuntu ● Swappiness=0 and no swap files ● Automation with Puppet (e.g. for deploying an HBase cluster) and Fabric (e.g. for deploying new HBase release with zero downtime)
  • 13. 13 Alerting system ● Need proper alerting system – JMX exposes all metrics – Ops Dashboards (Ganglia, Cacti, OpenTSDB, NewRelic) – Small Dashboard for critical events – Define proper level for escalation – Critical ● Loosing a Master or ZooKeeper Node ● +/- 10% drop in performance or latency ● Key thresholds (load,swap,IO) ● Loosing 2 or more slave nodes ● Disk failures ● Unbalanced nodes ● FATAL errors in logs
  • 14. 14 Tables in HBase • Tables are sorted by Row in lexicographical order • Table schema only defines its column families • Each family consists of any number of columns • Each column consists of any number of versions • Columns only exist when inserted, NULLs are free • Columns within a family are sorted and stored together • Everything except table name are byte KeyValue: (Table, Row, Family:Column, Timestamp) -> Value KeyValue instances are not split across blocks. For example, if there is an 8 MB KeyValue, even if the block-size is 64kb this KeyValue will be read in as a coherent block. For more information, see the KeyValue source code. The KeyValue format inside a byte array is: • keylength • valuelength • key • value The Key is further decomposed as: • rowlength • row (i.e., the rowkey) • columnfamilylength • columnfamily • columnqualifier • timestamp • keytype (e.g., Put, Delete, DeleteColumn, DeleteFamily)
  • 15. 15 What about the schema design ? Schema design is a combination of • Designing the keys (rows and columns) • Segregate data into column families • Choose compression and block sizes CONFIG file: conf/hbase-site.xml
  • 16. Designing the keys: READ or WRITE design Sequential keys ([timestamp]) would be more appropriate for BridgeIris since the writing process can be done in a batch mode Interactive queries require a fast access to the data. Risk of hotspotting on regions when continous writing (ok if Bulk loads instead) 16
  • 18. 18 Designing keys • Tall-Narrow Tables (many rows, few columns) vs Flat-Wide Tables (few rows, many columns)  Tall-Narrow is recommended  Store part of the cell data in the row key • Rows do not split => avoid too large rows. • Dimensions that are queried together in the same column family since those columns will be stored in the same low-level storage file (HFile on HDFS) • Atomicity on row level => not an issue in BrideIris: we can build row/column key such that we don’t need several rows to be updated in a row.
  • 19. What about the cluster and HBase config ? 19 • Data node and region server should be co-located. Same cluster • Replication: at least 3 => OK with HDFS • Too many or too small regions are not good. • When does a region split ? Region size ? Keep default or set to 1 GB • Store larger than hbase.hregion.max.filesize (HBase v0.94 used by EMR: 10GB) after a major compaction, for a 10 node cluster it is better to have 10 regions of 0.4 GB than one big of 4 GB. But too many will generate an overhead in memory (MSLAB requires 2MB per family per region). • How is the region assigned to a region server ? Keep default – Automated to insure a balance between the region servers (manual command in HBase shell: balance_switch, hbase.balancer.period property) • What is the best block size ? Keep default – The block size can be configured for each column family (default 64 kb). – Column families can be defined in memory (quick read access) => are there columns that will be almost always requested by the user ??? • Should blocks be compressed ? How ? No compression and Snappy if needed – Compression is possible for each column family. GZIP (built in), SNAPPY (to be installed on all nodes). GZIP better compression but slower. If compression, SNAPPY would be more appropriate
  • 20. 20 Benchmark is a key ● Nothing fits for all ● Simulate use cases and run the tests: – Bulk loading – Random access, read/write – Batch processing – Scan, filter ● Negative performance – Replication factor – Zookeeper nodes – Network latency – Slower disks, CPUs – Hot regions, Bad row keys or Bulk loading without pre-splits
  • 21. 21 MySQL to HBase Row key Column family:{column qualifier:Version:Value} 0000000001 gatk_change_stats: {'chr':1383859:'5', 'pos':1383834:'3932', …} gatk_gene_coverage: {'id_project':38398:'38', 'gene_symbol':3938:'ENSG000034 33'} 0000000002 gatk_change_stats: {'chr':1383859:'2', 'pos':1383834:'3232', …} gatk_gene_coverage: {'id_project':38398:'8', 'gene_symbol':3938:'ENSG000033 890'} SQOOP http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html#_connec ting_to_a_database_server