SlideShare a Scribd company logo
1 of 21
Download to read offline
© Hortonworks Inc. 2011
Apache HBase
For Architects
Nick Dimiduk
Member of Technical Staff, HBase
Seattle Technical Forum, 2013-05-15
Page 1
© Hortonworks Inc. 2011
Page 2
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Agenda
•  Background
–  (how did we get here?)
•  High-level Architecture
–  (where are we?)
•  Anatomy of a RegionServer
–  (how does this thing work?)
•  TL;DR
–  (what did we learn?)
•  Resources
–  (where do we go from here?)
Page 3
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Background
Architecting the Future of Big Data
Page 4
© Hortonworks Inc. 2011
Apache Hadoop in Review
•  Apache Hadoop Distributed Filesystem (HDFS)
–  Distributed, fault-tolerant, throughput-optimized data storage
–  Uses a filesystem analogy, not structured tables
–  The Google File System, 2003, Ghemawat et al.
–  http://research.google.com/archive/gfs.html
•  Apache Hadoop MapReduce (MR)
–  Distributed, fault-tolerant, batch-oriented data processing
–  Line- or record-oriented processing of the entire dataset *
–  “[Application] schema on read”
–  MapReduce: Simplified Data Processing on Large Clusters, 2004, Dean and
Ghemawat
–  http://research.google.com/archive/mapreduce.html
Page 5
Architecting the Future of Big Data
* For more on writing MapReduce applications, see “MapReduce
Patterns, Algorithms, and Use Cases”
http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
© Hortonworks Inc. 2011
So what is HBase anyway?
•  BigTable paper from Google, 2006, Dean et al.
–  “Bigtable is a sparse, distributed, persistent multi-dimensional sorted map.”
–  http://research.google.com/archive/bigtable.html
•  Key Features:
–  Distributed storage across cluster of machines
–  Random, online read and write data access
–  Schemaless data model (“NoSQL”)
–  Self-managed data partitions
Page 6
Architecting the Future of Big Data
© Hortonworks Inc. 2011
High-level Architecture
Architecting the Future of Big Data
Page 7
© Hortonworks Inc. 2011
Page 9
Architecting the Future of Big Data
Logical Architecture
Distributed, persistent partitions of a BigTable
a
b
d
c
e
f
h
g
i
j
l
k
m
n
p
o
Table A
Region 1
Region 2
Region 3
Region 4
Region Server 7
Table A, Region 1
Table A, Region 2
Table G, Region 1070
Table L, Region 25
Region Server 86
Table A, Region 3
Table C, Region 30
Table F, Region 160
Table F, Region 776
Region Server 367
Table A, Region 4
Table C, Region 17
Table E, Region 52
Table P, Region 1116
Legend:
- A single table is partitioned into Regions of roughly equal size.
- Regions are assigned to Region Servers across the cluster.
- Region Servers host roughly the same number of regions.
© Hortonworks Inc. 2011
Page 11
Architecting the Future of Big Data
Physical Architecture
Distribution and Data Path
...
Zoo
Keeper
Zoo
Keeper
Zoo
Keeper
HBase
Client
JavaApp
HBase
Client
JavaApp
HBase
Client
HBase Shell
HBase
Client
REST/Thrift
Gateway
HBase
Client
JavaApp
HBase
Client
JavaApp
Region
Server
Data
Node
Region
Server
Data
Node
...
Region
Server
Data
Node
Region
Server
Data
Node
HBase
Master
Name
Node
Legend:
- An HBase RegionServer is collocated with an HDFS DataNode.
- HBase clients communicate directly with Region Servers for sending and receiving data.
- HMaster manages Region assignment and handles DDL operations.
- Online configuration state is maintained in ZooKeeper.
- HMaster and ZooKeeper are NOT involved in data path.
© Hortonworks Inc. 2011
Page 13
Architecting the Future of Big Data
Logical Data Model
A sparse, multi-dimensional, sorted map
Legend:
- Rows are sorted by rowkey.
- Within a row, values are located by column family and qualifier.
- Values also carry a timestamp; there can me multiple versions of a value.
- Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes.
1368387247 [3.6 kb png data]"thumb"cf2b
a
cf1
1368394583 7
1368394261 "hello"
"bar"
1368394583 22
1368394925 13.6
1368393847 "world"
"foo"
cf2
1368387684 "almost the loneliest number"1.0001
1368396302 "fourth of July""2011-07-04"
Table A
rowkey
column
family
column
qualifier
timestamp value
© Hortonworks Inc. 2011
Anatomy of a
RegionServer
Architecting the Future of Big Data
Page 14
© Hortonworks Inc. 2011
Page 16
Architecting the Future of Big Data
RegionServer
HDFS
HLog
(WAL)
HRegion
HStore
StoreFile
HFile
StoreFile
HFile
MemStore
...
...
HStore
BlockCache
HRegion
...
HStoreHStore
...
Legend:
- A RegionServer contains a single WAL, single BlockCache, and multiple Regions.
- A Region contains multiple Stores, one for each Column Family.
- A Store consists of multiple StoreFiles and a MemStore.
- A StoreFile corresponds to a single HFile.
- HFiles and WAL are persisted on HDFS.
Storage Machinery
Implementing the data model
© Hortonworks Inc. 2011
TL;DR
Architecting the Future of Big Data
Page 21
© Hortonworks Inc. 2011
For what kinds of workloads is it well suited?
•  It depends on how you tune it, but…
•  HBase is good for:
–  Large datasets
–  Sparse datasets
–  Loosely coupled (denormalized) records
–  Lots of concurrent clients
•  Try to avoid:
–  Small datasets (unless you have lots of them)
–  Highly relational records
–  Schema designs requiring transactions *
Page 22
Architecting the Future of Big Data
* Transactions might not be as necessary as you think, see “Eric
Brewer on why banks are BASE not ACID”
http://highscalability.com/blog/2013/5/1/myth-eric-brewer-on-why-
banks-are-base-not-acid-availability.html
** Or maybe not, “We believe it is better to have application
programmers deal with performance problems due to overuse of
transactions as bottlenecks arise, rather than always coding around
the lack of transactions.” – Google Spanner paper, http://
research.google.com/archive/spanner.html
© Hortonworks Inc. 2011
How does it integrate with my infrastructure?
•  Horizontally scale application data
–  Highly concurrent, read/write access
–  Consistent, persisted shared state
–  Distributed online data processing via Coprocessors (experimental)
•  Gateway between online services and offline storage/analysis
–  Staging area to receive new data
–  Serve online, indexed “views” on datasets from HDFS
–  Glue between batch (HDFS, MR1) and online (CEP, Storm) systems
Page 23
Architecting the Future of Big Data
© Hortonworks Inc. 2011
What data semantics does it provide?
•  GET, PUT, DELETE key-value operations
•  SCAN for queries
•  INCREMENT, CAS server-side atomic operations
•  Row-level write atomicity
•  MapReduce integration
–  Online API (today)
–  Bulkload (today)
–  Snapshots (coming)
Page 24
Architecting the Future of Big Data
© Hortonworks Inc. 2011
What about operational concerns?
•  Provision hardware with more spindles/TB
•  Balance memory and IO for reads
–  Contention between random and sequential access
–  Configure Block size, BlockCache, compression, codecs based on access patterns
–  Additional resources
–  “HBase: Performance Tuners,” http://labs.ericsson.com/blog/hbase-performance-tuners
–  “Scanning in HBase,” http://hadoop-hbase.blogspot.com/2012/01/scanning-in-
hbase.html
•  Balance IO for writes
–  Configure C1 (compactions, region size, compression, pre-splits, &c.) based on
write pattern
–  Balance IO contention between maintaining C1 and serving reads
–  Additional resources
–  “Configuring HBase Memstore: what you should know,” http://blog.sematext.com/
2012/07/16/hbase-memstore-what-you-should-know/
–  “Visualizing HBase Flushes And Compactions,” http://www.ngdata.com/visualizing-
hbase-flushes-and-compactions/
Page 25
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Resources
Architecting the Future of Big Data
Page 26
© Hortonworks Inc. 2011
Join the Community!
•  hbase.apache.org
–  blogs.apache.org/hbase/
•  Mailing lists
–  hbase.apache.org/mail-lists.html
–  user@hbase.apache.org
•  IRC
–  irc.freenode.net#hbase
•  JIRA
–  issues.apache.org/jira/browse/HBASE
•  Source
–  git clone git://git.apache.org/hbase.git
–  svn checkout http://svn.apache.org/repos/asf/hbase/trunk hbase
•  Conference Season
–  HBaseCon 2013, June 13, hbasecon.com
–  Hadoop Summit, June 26-27, hadoopsummit.org
Page 27
Architecting the Future of Big Data
© Hortonworks Inc. 2011
HBase@Hortonworks
•  Mean Time To Recovery (MTTR)
–  HDFS improvements, faster recovery of META, log replay instead of log splitting,
improving failure detection
•  Testing
–  Integration test suite, system tests, destructive testing, ChaosMonkey, load tests,
Namenode HA, test coverage and consistency
•  Compaction Improvements
–  Pluggable compaction, tier based compaction, stripe / leveldb compactions, etc
•  IPC / Wire compatibility
–  Migration to Google’s Protocol Buffers
•  HBase MapReduce improvements (Import / Export, etc)
–  Performance improvements, API uniformity/usability
•  Hardening 0.94
–  Assignment Manager, Log splitting, Region splits, Replication
•  Not to mention:
–  Windows support, Security, Snapshots, Hadoop2, 0.96, LOTS of bug fixes and
community reviews
Page 28
Architecting the Future of Big Data
© Hortonworks Inc. 2011
Thanks!
Architecting the Future of Big Data
Page 29
M A N N I N G
Nick Dimiduk
Amandeep Khurana
FOREWORD BY
Michael Stack
hbaseinaction.com
Nick Dimiduk
github.com/ndimiduk
@xefyr
n10k.com

More Related Content

What's hot

BigData_Chp4: NOSQL
BigData_Chp4: NOSQLBigData_Chp4: NOSQL
BigData_Chp4: NOSQLLilia Sfaxi
 
BI : Analyse des Données avec Mondrian
BI : Analyse des Données avec Mondrian BI : Analyse des Données avec Mondrian
BI : Analyse des Données avec Mondrian Lilia Sfaxi
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache RangerDataWorks Summit
 
Apache Kafka, Un système distribué de messagerie hautement performant
Apache Kafka, Un système distribué de messagerie hautement performantApache Kafka, Un système distribué de messagerie hautement performant
Apache Kafka, Un système distribué de messagerie hautement performantALTIC Altic
 
BigData_Chp3: Data Processing
BigData_Chp3: Data ProcessingBigData_Chp3: Data Processing
BigData_Chp3: Data ProcessingLilia Sfaxi
 
Attacking thru HTTP Host header
Attacking thru HTTP Host headerAttacking thru HTTP Host header
Attacking thru HTTP Host headerSergey Belov
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Les bases de l'HTML / CSS
Les bases de l'HTML / CSSLes bases de l'HTML / CSS
Les bases de l'HTML / CSSSamuel Robert
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking VN
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
Mise en oeuvre des Frameworks de Machines et Deep Learning pour les Applicati...
Mise en oeuvre des Frameworks de Machines et Deep Learning pour les Applicati...Mise en oeuvre des Frameworks de Machines et Deep Learning pour les Applicati...
Mise en oeuvre des Frameworks de Machines et Deep Learning pour les Applicati...ENSET, Université Hassan II Casablanca
 
exercices business intelligence
exercices business intelligence exercices business intelligence
exercices business intelligence Yassine Badri
 
Vault - Secret and Key Management
Vault - Secret and Key ManagementVault - Secret and Key Management
Vault - Secret and Key ManagementAnthony Ikeda
 
Modélisation de données pour MongoDB
Modélisation de données pour MongoDBModélisation de données pour MongoDB
Modélisation de données pour MongoDBMongoDB
 

What's hot (20)

BigData_Chp4: NOSQL
BigData_Chp4: NOSQLBigData_Chp4: NOSQL
BigData_Chp4: NOSQL
 
BI : Analyse des Données avec Mondrian
BI : Analyse des Données avec Mondrian BI : Analyse des Données avec Mondrian
BI : Analyse des Données avec Mondrian
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Apache Kafka, Un système distribué de messagerie hautement performant
Apache Kafka, Un système distribué de messagerie hautement performantApache Kafka, Un système distribué de messagerie hautement performant
Apache Kafka, Un système distribué de messagerie hautement performant
 
Une introduction à Hive
Une introduction à HiveUne introduction à Hive
Une introduction à Hive
 
BigData_Chp3: Data Processing
BigData_Chp3: Data ProcessingBigData_Chp3: Data Processing
BigData_Chp3: Data Processing
 
MongoDB
MongoDBMongoDB
MongoDB
 
Attacking thru HTTP Host header
Attacking thru HTTP Host headerAttacking thru HTTP Host header
Attacking thru HTTP Host header
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Les bases de l'HTML / CSS
Les bases de l'HTML / CSSLes bases de l'HTML / CSS
Les bases de l'HTML / CSS
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Mise en oeuvre des Frameworks de Machines et Deep Learning pour les Applicati...
Mise en oeuvre des Frameworks de Machines et Deep Learning pour les Applicati...Mise en oeuvre des Frameworks de Machines et Deep Learning pour les Applicati...
Mise en oeuvre des Frameworks de Machines et Deep Learning pour les Applicati...
 
exercices business intelligence
exercices business intelligence exercices business intelligence
exercices business intelligence
 
Vault - Secret and Key Management
Vault - Secret and Key ManagementVault - Secret and Key Management
Vault - Secret and Key Management
 
Modélisation de données pour MongoDB
Modélisation de données pour MongoDBModélisation de données pour MongoDB
Modélisation de données pour MongoDB
 
Hadoop
HadoopHadoop
Hadoop
 

Viewers also liked

Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance TuningLars Hofhansl
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance ImprovementBiju Nair
 
20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case Studies20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case StudiesEvan Liu
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesData Con LA
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
 
HBase Blockcache 101
HBase Blockcache 101HBase Blockcache 101
HBase Blockcache 101Nick Dimiduk
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for ArchitectsNick Dimiduk
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesHBaseCon
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the BasicsHBaseCon
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guidelarsgeorge
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars GeorgeJAX London
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 ReleaseNick Dimiduk
 
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)tatsuya6502
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseCloudera, Inc.
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseEdureka!
 

Viewers also liked (20)

Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
 
20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case Studies20090713 Hbase Schema Design Case Studies
20090713 Hbase Schema Design Case Studies
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
 
Hadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema DesignHadoop World 2011: Advanced HBase Schema Design
Hadoop World 2011: Advanced HBase Schema Design
 
HBase Blockcache 101
HBase Blockcache 101HBase Blockcache 101
HBase Blockcache 101
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
 
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBase
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
 

Similar to HBase for Architects

Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshotsenissoz
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
 
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseHBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseCloudera, Inc.
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseRishabh Dugar
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitData Con LA
 
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseCloudera, Inc.
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统yongboy
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...Michael Stack
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolutionDataWorks Summit
 

Similar to HBase for Architects (20)

Hbase mhug 2015
Hbase mhug 2015Hbase mhug 2015
Hbase mhug 2015
 
Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBaseMar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshots
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseHBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql database
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Hive
HiveHive
Hive
 
Horizon for Big Data
Horizon for Big DataHorizon for Big Data
Horizon for Big Data
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 

More from Nick Dimiduk

Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
 
Apache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixApache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixNick Dimiduk
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014Nick Dimiduk
 
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low LatencyNick Dimiduk
 
HBase Data Types (WIP)
HBase Data Types (WIP)HBase Data Types (WIP)
HBase Data Types (WIP)Nick Dimiduk
 
Bring Cartography to the Cloud
Bring Cartography to the CloudBring Cartography to the Cloud
Bring Cartography to the CloudNick Dimiduk
 
HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)Nick Dimiduk
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLNick Dimiduk
 

More from Nick Dimiduk (10)

Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
Apache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixApache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - Phoenix
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014
 
HBase Data Types
HBase Data TypesHBase Data Types
HBase Data Types
 
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low Latency
 
HBase Data Types (WIP)
HBase Data Types (WIP)HBase Data Types (WIP)
HBase Data Types (WIP)
 
Bring Cartography to the Cloud
Bring Cartography to the CloudBring Cartography to the Cloud
Bring Cartography to the Cloud
 
HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQL
 

Recently uploaded

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesExploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesSanjay Willie
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Fact vs. Fiction: Autodetecting Hallucinations in LLMs
Fact vs. Fiction: Autodetecting Hallucinations in LLMsFact vs. Fiction: Autodetecting Hallucinations in LLMs
Fact vs. Fiction: Autodetecting Hallucinations in LLMsZilliz
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your QueriesExploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
Exploring ChatGPT Prompt Hacks To Maximally Optimise Your Queries
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Fact vs. Fiction: Autodetecting Hallucinations in LLMs
Fact vs. Fiction: Autodetecting Hallucinations in LLMsFact vs. Fiction: Autodetecting Hallucinations in LLMs
Fact vs. Fiction: Autodetecting Hallucinations in LLMs
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

HBase for Architects

  • 1. © Hortonworks Inc. 2011 Apache HBase For Architects Nick Dimiduk Member of Technical Staff, HBase Seattle Technical Forum, 2013-05-15 Page 1
  • 2. © Hortonworks Inc. 2011 Page 2 Architecting the Future of Big Data
  • 3. © Hortonworks Inc. 2011 Agenda •  Background –  (how did we get here?) •  High-level Architecture –  (where are we?) •  Anatomy of a RegionServer –  (how does this thing work?) •  TL;DR –  (what did we learn?) •  Resources –  (where do we go from here?) Page 3 Architecting the Future of Big Data
  • 4. © Hortonworks Inc. 2011 Background Architecting the Future of Big Data Page 4
  • 5. © Hortonworks Inc. 2011 Apache Hadoop in Review •  Apache Hadoop Distributed Filesystem (HDFS) –  Distributed, fault-tolerant, throughput-optimized data storage –  Uses a filesystem analogy, not structured tables –  The Google File System, 2003, Ghemawat et al. –  http://research.google.com/archive/gfs.html •  Apache Hadoop MapReduce (MR) –  Distributed, fault-tolerant, batch-oriented data processing –  Line- or record-oriented processing of the entire dataset * –  “[Application] schema on read” –  MapReduce: Simplified Data Processing on Large Clusters, 2004, Dean and Ghemawat –  http://research.google.com/archive/mapreduce.html Page 5 Architecting the Future of Big Data * For more on writing MapReduce applications, see “MapReduce Patterns, Algorithms, and Use Cases” http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
  • 6. © Hortonworks Inc. 2011 So what is HBase anyway? •  BigTable paper from Google, 2006, Dean et al. –  “Bigtable is a sparse, distributed, persistent multi-dimensional sorted map.” –  http://research.google.com/archive/bigtable.html •  Key Features: –  Distributed storage across cluster of machines –  Random, online read and write data access –  Schemaless data model (“NoSQL”) –  Self-managed data partitions Page 6 Architecting the Future of Big Data
  • 7. © Hortonworks Inc. 2011 High-level Architecture Architecting the Future of Big Data Page 7
  • 8. © Hortonworks Inc. 2011 Page 9 Architecting the Future of Big Data Logical Architecture Distributed, persistent partitions of a BigTable a b d c e f h g i j l k m n p o Table A Region 1 Region 2 Region 3 Region 4 Region Server 7 Table A, Region 1 Table A, Region 2 Table G, Region 1070 Table L, Region 25 Region Server 86 Table A, Region 3 Table C, Region 30 Table F, Region 160 Table F, Region 776 Region Server 367 Table A, Region 4 Table C, Region 17 Table E, Region 52 Table P, Region 1116 Legend: - A single table is partitioned into Regions of roughly equal size. - Regions are assigned to Region Servers across the cluster. - Region Servers host roughly the same number of regions.
  • 9. © Hortonworks Inc. 2011 Page 11 Architecting the Future of Big Data Physical Architecture Distribution and Data Path ... Zoo Keeper Zoo Keeper Zoo Keeper HBase Client JavaApp HBase Client JavaApp HBase Client HBase Shell HBase Client REST/Thrift Gateway HBase Client JavaApp HBase Client JavaApp Region Server Data Node Region Server Data Node ... Region Server Data Node Region Server Data Node HBase Master Name Node Legend: - An HBase RegionServer is collocated with an HDFS DataNode. - HBase clients communicate directly with Region Servers for sending and receiving data. - HMaster manages Region assignment and handles DDL operations. - Online configuration state is maintained in ZooKeeper. - HMaster and ZooKeeper are NOT involved in data path.
  • 10. © Hortonworks Inc. 2011 Page 13 Architecting the Future of Big Data Logical Data Model A sparse, multi-dimensional, sorted map Legend: - Rows are sorted by rowkey. - Within a row, values are located by column family and qualifier. - Values also carry a timestamp; there can me multiple versions of a value. - Within a column family, data is schemaless. Qualifiers and values are treated as arbitrary bytes. 1368387247 [3.6 kb png data]"thumb"cf2b a cf1 1368394583 7 1368394261 "hello" "bar" 1368394583 22 1368394925 13.6 1368393847 "world" "foo" cf2 1368387684 "almost the loneliest number"1.0001 1368396302 "fourth of July""2011-07-04" Table A rowkey column family column qualifier timestamp value
  • 11. © Hortonworks Inc. 2011 Anatomy of a RegionServer Architecting the Future of Big Data Page 14
  • 12. © Hortonworks Inc. 2011 Page 16 Architecting the Future of Big Data RegionServer HDFS HLog (WAL) HRegion HStore StoreFile HFile StoreFile HFile MemStore ... ... HStore BlockCache HRegion ... HStoreHStore ... Legend: - A RegionServer contains a single WAL, single BlockCache, and multiple Regions. - A Region contains multiple Stores, one for each Column Family. - A Store consists of multiple StoreFiles and a MemStore. - A StoreFile corresponds to a single HFile. - HFiles and WAL are persisted on HDFS. Storage Machinery Implementing the data model
  • 13. © Hortonworks Inc. 2011 TL;DR Architecting the Future of Big Data Page 21
  • 14. © Hortonworks Inc. 2011 For what kinds of workloads is it well suited? •  It depends on how you tune it, but… •  HBase is good for: –  Large datasets –  Sparse datasets –  Loosely coupled (denormalized) records –  Lots of concurrent clients •  Try to avoid: –  Small datasets (unless you have lots of them) –  Highly relational records –  Schema designs requiring transactions * Page 22 Architecting the Future of Big Data * Transactions might not be as necessary as you think, see “Eric Brewer on why banks are BASE not ACID” http://highscalability.com/blog/2013/5/1/myth-eric-brewer-on-why- banks-are-base-not-acid-availability.html ** Or maybe not, “We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.” – Google Spanner paper, http:// research.google.com/archive/spanner.html
  • 15. © Hortonworks Inc. 2011 How does it integrate with my infrastructure? •  Horizontally scale application data –  Highly concurrent, read/write access –  Consistent, persisted shared state –  Distributed online data processing via Coprocessors (experimental) •  Gateway between online services and offline storage/analysis –  Staging area to receive new data –  Serve online, indexed “views” on datasets from HDFS –  Glue between batch (HDFS, MR1) and online (CEP, Storm) systems Page 23 Architecting the Future of Big Data
  • 16. © Hortonworks Inc. 2011 What data semantics does it provide? •  GET, PUT, DELETE key-value operations •  SCAN for queries •  INCREMENT, CAS server-side atomic operations •  Row-level write atomicity •  MapReduce integration –  Online API (today) –  Bulkload (today) –  Snapshots (coming) Page 24 Architecting the Future of Big Data
  • 17. © Hortonworks Inc. 2011 What about operational concerns? •  Provision hardware with more spindles/TB •  Balance memory and IO for reads –  Contention between random and sequential access –  Configure Block size, BlockCache, compression, codecs based on access patterns –  Additional resources –  “HBase: Performance Tuners,” http://labs.ericsson.com/blog/hbase-performance-tuners –  “Scanning in HBase,” http://hadoop-hbase.blogspot.com/2012/01/scanning-in- hbase.html •  Balance IO for writes –  Configure C1 (compactions, region size, compression, pre-splits, &c.) based on write pattern –  Balance IO contention between maintaining C1 and serving reads –  Additional resources –  “Configuring HBase Memstore: what you should know,” http://blog.sematext.com/ 2012/07/16/hbase-memstore-what-you-should-know/ –  “Visualizing HBase Flushes And Compactions,” http://www.ngdata.com/visualizing- hbase-flushes-and-compactions/ Page 25 Architecting the Future of Big Data
  • 18. © Hortonworks Inc. 2011 Resources Architecting the Future of Big Data Page 26
  • 19. © Hortonworks Inc. 2011 Join the Community! •  hbase.apache.org –  blogs.apache.org/hbase/ •  Mailing lists –  hbase.apache.org/mail-lists.html –  user@hbase.apache.org •  IRC –  irc.freenode.net#hbase •  JIRA –  issues.apache.org/jira/browse/HBASE •  Source –  git clone git://git.apache.org/hbase.git –  svn checkout http://svn.apache.org/repos/asf/hbase/trunk hbase •  Conference Season –  HBaseCon 2013, June 13, hbasecon.com –  Hadoop Summit, June 26-27, hadoopsummit.org Page 27 Architecting the Future of Big Data
  • 20. © Hortonworks Inc. 2011 HBase@Hortonworks •  Mean Time To Recovery (MTTR) –  HDFS improvements, faster recovery of META, log replay instead of log splitting, improving failure detection •  Testing –  Integration test suite, system tests, destructive testing, ChaosMonkey, load tests, Namenode HA, test coverage and consistency •  Compaction Improvements –  Pluggable compaction, tier based compaction, stripe / leveldb compactions, etc •  IPC / Wire compatibility –  Migration to Google’s Protocol Buffers •  HBase MapReduce improvements (Import / Export, etc) –  Performance improvements, API uniformity/usability •  Hardening 0.94 –  Assignment Manager, Log splitting, Region splits, Replication •  Not to mention: –  Windows support, Security, Snapshots, Hadoop2, 0.96, LOTS of bug fixes and community reviews Page 28 Architecting the Future of Big Data
  • 21. © Hortonworks Inc. 2011 Thanks! Architecting the Future of Big Data Page 29 M A N N I N G Nick Dimiduk Amandeep Khurana FOREWORD BY Michael Stack hbaseinaction.com Nick Dimiduk github.com/ndimiduk @xefyr n10k.com