SlideShare a Scribd company logo
1 of 21
Download to read offline
© Hortonworks Inc. 2011
Apache HBase
For Architects
Nick Dimiduk
Member of Technical Staff, HBase
Seattle Technical Forum, 2013-05-15
Page 1
© Hortonworks Inc. 2011
Page 2
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Agenda
•  Background
–  (how did we get here?)
•  High-level Architecture
–  (where are we?)
•  Anatomy of a RegionServer
–  (how does this thing work?)
•  TL;DR
–  (what did we learn?)
•  Resources
–  (where do we go from here?)
Page 3
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Background
Architecting the Future of Big Data
Page 4
© Hortonworks Inc. 2011
Apache Hadoop in Review
•  Apache Hadoop Distributed Filesystem (HDFS)
–  Distributed, fault-tolerant, throughput-optimized data storage
–  Uses a filesystem analogy, not structured tables
–  The Google File System, 2003, Ghemawat et al.
–  http://research.google.com/archive/gfs.html
•  Apache Hadoop MapReduce (MR)
–  Distributed, fault-tolerant, batch-oriented data processing
–  Line- or record-oriented processing of the entire dataset *
–  “[Application] schema on read”
–  MapReduce: Simplified Data Processing on Large Clusters, 2004, Dean and
Ghemawat
–  http://research.google.com/archive/mapreduce.html
Page 5
Architecting the Future of Big Data
* For more on writing MapReduce applications, see “MapReduce
Patterns, Algorithms, and Use Cases”
http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
© Hortonworks Inc. 2011
So what is HBase anyway?
•  BigTable paper from Google, 2006, Dean et al.
–  “Bigtable is a sparse, distributed, persistent multi-dimensional sorted map.”
–  http://research.google.com/archive/bigtable.html
•  Key Features:
–  Distributed storage across cluster of machines
–  Random, online read and write data access
–  Schemaless data model (“NoSQL”)
–  Self-managed data partitions
Page 6
Architecting the Future of Big Data
© Hortonworks Inc. 2011
High-level Architecture
Architecting the Future of Big Data
Page 7
© Hortonworks Inc. 2011
Page 9
Architecting the Future of Big Data
Logical Architecture
Distributed, persistent partitions of a BigTable
a
b
d
c
e
f
h
g
i
j
l
k
m
n
p
o
Table A
Region 1
Region 2
Region 3
Region 4
Region Server 7
Table A, Region 1
Table A, Region 2
Table G, Region 1070
Table L, Region 25
Region Server 86
Table A, Region 3
Table C, Region 30
Table F, Region 160
Table F, Region 776
Region Server 367
Table A, Region 4
Table C, Region 17
Table E, Region 52
Table P, Region 1116
Legend:
- A single table is partitioned into Regions of roughly equal size.
- Regions are assigned to Region Servers across the cluster.
- Region Servers host roughly the same number of regions.
© Hortonworks Inc. 2011
Page 11
Architecting the Future of Big Data
Physical Architecture
Distribution and Data Path
...
Zoo
Keeper
Zoo
Keeper
Zoo
Keeper
HBase
Client
JavaApp
HBase
Client
JavaApp
HBase
Client
HBase Shell
HBase
Client
REST/Thrift
Gateway
HBase
Client
JavaApp
HBase
Client
JavaApp
Region
Server
Data
Node
Region
Server
Data
Node
...
Region
Server
Data
Node
Region
Server
Data
Node
HBase
Master
Name
Node
Legend:
- An HBase RegionServer is collocated with an HDFS DataNode.
- HBase clients communicate directly with Region Servers for sending and receiving data.
- HMaster manages Region assignment and handles DDL operations.
- Online configuration state is maintained in ZooKeeper.
- HMaster and ZooKeeper are NOT involved in data path.
© Hortonworks Inc. 2011
Page 13
Architecting the Future of Big Data
Logical Data Model
A sparse, multi-dimensional, sorted map
Legend:
- Rows are sorted by rowkey.
- Within a row, values are located by column family and qualifier.
- Values also carry a timestamp; there can me multiple versions of a value.
- Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes.
1368387247 [3.6 kb png data]"thumb"cf2b
a
cf1
1368394583 7
1368394261 "hello"
"bar"
1368394583 22
1368394925 13.6
1368393847 "world"
"foo"
cf2
1368387684 "almost the loneliest number"1.0001
1368396302 "fourth of July""2011-07-04"
Table A
rowkey
column
family
column
qualifier
timestamp value
© Hortonworks Inc. 2011
Anatomy of a
RegionServer
Architecting the Future of Big Data
Page 14
© Hortonworks Inc. 2011
Page 16
Architecting the Future of Big Data
RegionServer
HDFS
HLog
(WAL)
HRegion
HStore
StoreFile
HFile
StoreFile
HFile
MemStore
...
...
HStore
BlockCache
HRegion
...
HStoreHStore
...
Legend:
- A RegionServer contains a single WAL, single BlockCache, and multiple Regions.
- A Region contains multiple Stores, one for each Column Family.
- A Store consists of multiple StoreFiles and a MemStore.
- A StoreFile corresponds to a single HFile.
- HFiles and WAL are persisted on HDFS.
Storage Machinery
Implementing the data model
© Hortonworks Inc. 2011
TL;DR
Architecting the Future of Big Data
Page 21
© Hortonworks Inc. 2011
For what kinds of workloads is it well suited?
•  It depends on how you tune it, but…
•  HBase is good for:
–  Large datasets
–  Sparse datasets
–  Loosely coupled (denormalized) records
–  Lots of concurrent clients
•  Try to avoid:
–  Small datasets (unless you have lots of them)
–  Highly relational records
–  Schema designs requiring transactions *
Page 22
Architecting the Future of Big Data
* Transactions might not be as necessary as you think, see “Eric
Brewer on why banks are BASE not ACID”
http://highscalability.com/blog/2013/5/1/myth-eric-brewer-on-why-
banks-are-base-not-acid-availability.html
** Or maybe not, “We believe it is better to have application
programmers deal with performance problems due to overuse of
transactions as bottlenecks arise, rather than always coding around
the lack of transactions.” – Google Spanner paper, http://
research.google.com/archive/spanner.html
© Hortonworks Inc. 2011
How does it integrate with my infrastructure?
•  Horizontally scale application data
–  Highly concurrent, read/write access
–  Consistent, persisted shared state
–  Distributed online data processing via Coprocessors (experimental)
•  Gateway between online services and offline storage/analysis
–  Staging area to receive new data
–  Serve online, indexed “views” on datasets from HDFS
–  Glue between batch (HDFS, MR1) and online (CEP, Storm) systems
Page 23
Architecting the Future of Big Data
© Hortonworks Inc. 2011
What data semantics does it provide?
•  GET, PUT, DELETE key-value operations
•  SCAN for queries
•  INCREMENT, CAS server-side atomic operations
•  Row-level write atomicity
•  MapReduce integration
–  Online API (today)
–  Bulkload (today)
–  Snapshots (coming)
Page 24
Architecting the Future of Big Data
© Hortonworks Inc. 2011
What about operational concerns?
•  Provision hardware with more spindles/TB
•  Balance memory and IO for reads
–  Contention between random and sequential access
–  Configure Block size, BlockCache, compression, codecs based on access patterns
–  Additional resources
–  “HBase: Performance Tuners,” http://labs.ericsson.com/blog/hbase-performance-tuners
–  “Scanning in HBase,” http://hadoop-hbase.blogspot.com/2012/01/scanning-in-
hbase.html
•  Balance IO for writes
–  Configure C1 (compactions, region size, compression, pre-splits, &c.) based on
write pattern
–  Balance IO contention between maintaining C1 and serving reads
–  Additional resources
–  “Configuring HBase Memstore: what you should know,” http://blog.sematext.com/
2012/07/16/hbase-memstore-what-you-should-know/
–  “Visualizing HBase Flushes And Compactions,” http://www.ngdata.com/visualizing-
hbase-flushes-and-compactions/
Page 25
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Resources
Architecting the Future of Big Data
Page 26
© Hortonworks Inc. 2011
Join the Community!
•  hbase.apache.org
–  blogs.apache.org/hbase/
•  Mailing lists
–  hbase.apache.org/mail-lists.html
–  user@hbase.apache.org
•  IRC
–  irc.freenode.net#hbase
•  JIRA
–  issues.apache.org/jira/browse/HBASE
•  Source
–  git clone git://git.apache.org/hbase.git
–  svn checkout http://svn.apache.org/repos/asf/hbase/trunk hbase
•  Conference Season
–  HBaseCon 2013, June 13, hbasecon.com
–  Hadoop Summit, June 26-27, hadoopsummit.org
Page 27
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HBase@Hortonworks
•  Mean Time To Recovery (MTTR)
–  HDFS improvements, faster recovery of META, log replay instead of log splitting,
improving failure detection
•  Testing
–  Integration test suite, system tests, destructive testing, ChaosMonkey, load tests,
Namenode HA, test coverage and consistency
•  Compaction Improvements
–  Pluggable compaction, tier based compaction, stripe / leveldb compactions, etc
•  IPC / Wire compatibility
–  Migration to Google’s Protocol Buffers
•  HBase MapReduce improvements (Import / Export, etc)
–  Performance improvements, API uniformity/usability
•  Hardening 0.94
–  Assignment Manager, Log splitting, Region splits, Replication
•  Not to mention:
–  Windows support, Security, Snapshots, Hadoop2, 0.96, LOTS of bug fixes and
community reviews
Page 28
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Thanks!
Architecting the Future of Big Data
Page 29
M A N N I N G
Nick Dimiduk
Amandeep Khurana
FOREWORD BY
Michael Stack
hbaseinaction.com
Nick Dimiduk
github.com/ndimiduk
@xefyr
n10k.com

More Related Content

What's hot

Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance TuningLars Hofhansl
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!
 
Working with JSON Data in PostgreSQL vs. MongoDB
Working with JSON Data in PostgreSQL vs. MongoDBWorking with JSON Data in PostgreSQL vs. MongoDB
Working with JSON Data in PostgreSQL vs. MongoDBScaleGrid.io
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneErik Krogen
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsCloudera, Inc.
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path HBaseCon
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
MongoDB presentation
MongoDB presentationMongoDB presentation
MongoDB presentationHyphen Call
 

What's hot (20)

Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ Salesforce
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Working with JSON Data in PostgreSQL vs. MongoDB
Working with JSON Data in PostgreSQL vs. MongoDBWorking with JSON Data in PostgreSQL vs. MongoDB
Working with JSON Data in PostgreSQL vs. MongoDB
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots
 
HBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and CompactionHBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and Compaction
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Apache hive
Apache hiveApache hive
Apache hive
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Mongodb
MongodbMongodb
Mongodb
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
MongoDB presentation
MongoDB presentationMongoDB presentation
MongoDB presentation
 

Viewers also liked

Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance ImprovementBiju Nair
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case Studies20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case StudiesEvan Liu
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesData Con LA
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
 
HBase Blockcache 101
HBase Blockcache 101HBase Blockcache 101
HBase Blockcache 101Nick Dimiduk
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for ArchitectsNick Dimiduk
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesHBaseCon
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the BasicsHBaseCon
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guidelarsgeorge
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars GeorgeJAX London
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 ReleaseNick Dimiduk
 
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)tatsuya6502
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseCloudera, Inc.
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseEdureka!
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 

Viewers also liked (20)

Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case Studies20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case Studies
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
HBase Blockcache 101
HBase Blockcache 101HBase Blockcache 101
HBase Blockcache 101
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
 
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBase
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 

Similar to HBase for Architects

Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshotsenissoz
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
 
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseHBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseCloudera, Inc.
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseRishabh Dugar
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitData Con LA
 
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseCloudera, Inc.
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统yongboy
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...Michael Stack
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 

Similar to HBase for Architects (20)

Hbase mhug 2015
Hbase mhug 2015Hbase mhug 2015
Hbase mhug 2015
 
Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBaseMar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshots
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseHBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql database
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Hive
HiveHive
Hive
 
Horizon for Big Data
Horizon for Big DataHorizon for Big Data
Horizon for Big Data
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 

More from Nick Dimiduk

Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
 
Apache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixApache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixNick Dimiduk
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014Nick Dimiduk
 
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low LatencyNick Dimiduk
 
HBase Data Types (WIP)
HBase Data Types (WIP)HBase Data Types (WIP)
HBase Data Types (WIP)Nick Dimiduk
 
Bring Cartography to the Cloud
Bring Cartography to the CloudBring Cartography to the Cloud
Bring Cartography to the CloudNick Dimiduk
 
HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)Nick Dimiduk
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLNick Dimiduk
 

More from Nick Dimiduk (10)

Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
Apache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixApache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - Phoenix
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014
 
HBase Data Types
HBase Data TypesHBase Data Types
HBase Data Types
 
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low Latency
 
HBase Data Types (WIP)
HBase Data Types (WIP)HBase Data Types (WIP)
HBase Data Types (WIP)
 
Bring Cartography to the Cloud
Bring Cartography to the CloudBring Cartography to the Cloud
Bring Cartography to the Cloud
 
HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQL
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

HBase for Architects

  • 1. © Hortonworks Inc. 2011 Apache HBase For Architects Nick Dimiduk Member of Technical Staff, HBase Seattle Technical Forum, 2013-05-15 Page 1
  • 2. © Hortonworks Inc. 2011 Page 2 Architecting the Future of Big Data
  • 3. © Hortonworks Inc. 2011 Agenda •  Background –  (how did we get here?) •  High-level Architecture –  (where are we?) •  Anatomy of a RegionServer –  (how does this thing work?) •  TL;DR –  (what did we learn?) •  Resources –  (where do we go from here?) Page 3 Architecting the Future of Big Data
  • 4. © Hortonworks Inc. 2011 Background Architecting the Future of Big Data Page 4
  • 5. © Hortonworks Inc. 2011 Apache Hadoop in Review •  Apache Hadoop Distributed Filesystem (HDFS) –  Distributed, fault-tolerant, throughput-optimized data storage –  Uses a filesystem analogy, not structured tables –  The Google File System, 2003, Ghemawat et al. –  http://research.google.com/archive/gfs.html •  Apache Hadoop MapReduce (MR) –  Distributed, fault-tolerant, batch-oriented data processing –  Line- or record-oriented processing of the entire dataset * –  “[Application] schema on read” –  MapReduce: Simplified Data Processing on Large Clusters, 2004, Dean and Ghemawat –  http://research.google.com/archive/mapreduce.html Page 5 Architecting the Future of Big Data * For more on writing MapReduce applications, see “MapReduce Patterns, Algorithms, and Use Cases” http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  • 6. © Hortonworks Inc. 2011 So what is HBase anyway? •  BigTable paper from Google, 2006, Dean et al. –  “Bigtable is a sparse, distributed, persistent multi-dimensional sorted map.” –  http://research.google.com/archive/bigtable.html •  Key Features: –  Distributed storage across cluster of machines –  Random, online read and write data access –  Schemaless data model (“NoSQL”) –  Self-managed data partitions Page 6 Architecting the Future of Big Data
  • 7. © Hortonworks Inc. 2011 High-level Architecture Architecting the Future of Big Data Page 7
  • 8. © Hortonworks Inc. 2011 Page 9 Architecting the Future of Big Data Logical Architecture Distributed, persistent partitions of a BigTable a b d c e f h g i j l k m n p o Table A Region 1 Region 2 Region 3 Region 4 Region Server 7 Table A, Region 1 Table A, Region 2 Table G, Region 1070 Table L, Region 25 Region Server 86 Table A, Region 3 Table C, Region 30 Table F, Region 160 Table F, Region 776 Region Server 367 Table A, Region 4 Table C, Region 17 Table E, Region 52 Table P, Region 1116 Legend: - A single table is partitioned into Regions of roughly equal size. - Regions are assigned to Region Servers across the cluster. - Region Servers host roughly the same number of regions.
  • 9. © Hortonworks Inc. 2011 Page 11 Architecting the Future of Big Data Physical Architecture Distribution and Data Path ... Zoo Keeper Zoo Keeper Zoo Keeper HBase Client JavaApp HBase Client JavaApp HBase Client HBase Shell HBase Client REST/Thrift Gateway HBase Client JavaApp HBase Client JavaApp Region Server Data Node Region Server Data Node ... Region Server Data Node Region Server Data Node HBase Master Name Node Legend: - An HBase RegionServer is collocated with an HDFS DataNode. - HBase clients communicate directly with Region Servers for sending and receiving data. - HMaster manages Region assignment and handles DDL operations. - Online configuration state is maintained in ZooKeeper. - HMaster and ZooKeeper are NOT involved in data path.
  • 10. © Hortonworks Inc. 2011 Page 13 Architecting the Future of Big Data Logical Data Model A sparse, multi-dimensional, sorted map Legend: - Rows are sorted by rowkey. - Within a row, values are located by column family and qualifier. - Values also carry a timestamp; there can me multiple versions of a value. - Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes. 1368387247 [3.6 kb png data]"thumb"cf2b a cf1 1368394583 7 1368394261 "hello" "bar" 1368394583 22 1368394925 13.6 1368393847 "world" "foo" cf2 1368387684 "almost the loneliest number"1.0001 1368396302 "fourth of July""2011-07-04" Table A rowkey column family column qualifier timestamp value
  • 11. © Hortonworks Inc. 2011 Anatomy of a RegionServer Architecting the Future of Big Data Page 14
  • 12. © Hortonworks Inc. 2011 Page 16 Architecting the Future of Big Data RegionServer HDFS HLog (WAL) HRegion HStore StoreFile HFile StoreFile HFile MemStore ... ... HStore BlockCache HRegion ... HStoreHStore ... Legend: - A RegionServer contains a single WAL, single BlockCache, and multiple Regions. - A Region contains multiple Stores, one for each Column Family. - A Store consists of multiple StoreFiles and a MemStore. - A StoreFile corresponds to a single HFile. - HFiles and WAL are persisted on HDFS. Storage Machinery Implementing the data model
  • 13. © Hortonworks Inc. 2011 TL;DR Architecting the Future of Big Data Page 21
  • 14. © Hortonworks Inc. 2011 For what kinds of workloads is it well suited? •  It depends on how you tune it, but… •  HBase is good for: –  Large datasets –  Sparse datasets –  Loosely coupled (denormalized) records –  Lots of concurrent clients •  Try to avoid: –  Small datasets (unless you have lots of them) –  Highly relational records –  Schema designs requiring transactions * Page 22 Architecting the Future of Big Data * Transactions might not be as necessary as you think, see “Eric Brewer on why banks are BASE not ACID” http://highscalability.com/blog/2013/5/1/myth-eric-brewer-on-why- banks-are-base-not-acid-availability.html ** Or maybe not, “We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.” – Google Spanner paper, http:// research.google.com/archive/spanner.html
  • 15. © Hortonworks Inc. 2011 How does it integrate with my infrastructure? •  Horizontally scale application data –  Highly concurrent, read/write access –  Consistent, persisted shared state –  Distributed online data processing via Coprocessors (experimental) •  Gateway between online services and offline storage/analysis –  Staging area to receive new data –  Serve online, indexed “views” on datasets from HDFS –  Glue between batch (HDFS, MR1) and online (CEP, Storm) systems Page 23 Architecting the Future of Big Data
  • 16. © Hortonworks Inc. 2011 What data semantics does it provide? •  GET, PUT, DELETE key-value operations •  SCAN for queries •  INCREMENT, CAS server-side atomic operations •  Row-level write atomicity •  MapReduce integration –  Online API (today) –  Bulkload (today) –  Snapshots (coming) Page 24 Architecting the Future of Big Data
  • 17. © Hortonworks Inc. 2011 What about operational concerns? •  Provision hardware with more spindles/TB •  Balance memory and IO for reads –  Contention between random and sequential access –  Configure Block size, BlockCache, compression, codecs based on access patterns –  Additional resources –  “HBase: Performance Tuners,” http://labs.ericsson.com/blog/hbase-performance-tuners –  “Scanning in HBase,” http://hadoop-hbase.blogspot.com/2012/01/scanning-in- hbase.html •  Balance IO for writes –  Configure C1 (compactions, region size, compression, pre-splits, &c.) based on write pattern –  Balance IO contention between maintaining C1 and serving reads –  Additional resources –  “Configuring HBase Memstore: what you should know,” http://blog.sematext.com/ 2012/07/16/hbase-memstore-what-you-should-know/ –  “Visualizing HBase Flushes And Compactions,” http://www.ngdata.com/visualizing- hbase-flushes-and-compactions/ Page 25 Architecting the Future of Big Data
  • 18. © Hortonworks Inc. 2011 Resources Architecting the Future of Big Data Page 26
  • 19. © Hortonworks Inc. 2011 Join the Community! •  hbase.apache.org –  blogs.apache.org/hbase/ •  Mailing lists –  hbase.apache.org/mail-lists.html –  user@hbase.apache.org •  IRC –  irc.freenode.net#hbase •  JIRA –  issues.apache.org/jira/browse/HBASE •  Source –  git clone git://git.apache.org/hbase.git –  svn checkout http://svn.apache.org/repos/asf/hbase/trunk hbase •  Conference Season –  HBaseCon 2013, June 13, hbasecon.com –  Hadoop Summit, June 26-27, hadoopsummit.org Page 27 Architecting the Future of Big Data
  • 20. © Hortonworks Inc. 2011 HBase@Hortonworks •  Mean Time To Recovery (MTTR) –  HDFS improvements, faster recovery of META, log replay instead of log splitting, improving failure detection •  Testing –  Integration test suite, system tests, destructive testing, ChaosMonkey, load tests, Namenode HA, test coverage and consistency •  Compaction Improvements –  Pluggable compaction, tier based compaction, stripe / leveldb compactions, etc •  IPC / Wire compatibility –  Migration to Google’s Protocol Buffers •  HBase MapReduce improvements (Import / Export, etc) –  Performance improvements, API uniformity/usability •  Hardening 0.94 –  Assignment Manager, Log splitting, Region splits, Replication •  Not to mention: –  Windows support, Security, Snapshots, Hadoop2, 0.96, LOTS of bug fixes and community reviews Page 28 Architecting the Future of Big Data
  • 21. © Hortonworks Inc. 2011 Thanks! Architecting the Future of Big Data Page 29 M A N N I N G Nick Dimiduk Amandeep Khurana FOREWORD BY Michael Stack hbaseinaction.com Nick Dimiduk github.com/ndimiduk @xefyr n10k.com