Apache HBase
For Architects
Nick Dimiduk, Hortonworks
Strata/Hadoop World Barcelona, 2014-11-21
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Page 1
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License. Page 2
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Agenda
• Background
– (how did we get here?)
• TL;DR
– (don’t waste my time!)
• High-level Architecture
– (where are we?)
• Anatomy of a RegionServer
– (how does this thing work?)
• By Example
– (how do I use it?)
• Resources
– (where do we go from here?)
Page 3
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Background
Page 4
So what is HBase anyway?
• BigTable paper from Google, 2006, Dean et al.
– “Bigtable is a sparse, distributed, persistent multi-dimensional sorted map.”
– http://research.google.com/archive/bigtable.html
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
• Key Features:
– Distributed storage across cluster of machines
– Random, online read and write data access
– Schemaless data model (“NoSQL”)
– Self-managed data partitions
Page 5
Apache Hadoop Dependencies
• Apache Hadoop Distributed Filesystem (HDFS)
– Distributed, fault-tolerant, throughput-optimized data storage
– The Google File System, 2003, Ghemawat et al.
– http://research.google.com/archive/gfs.html
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
• Apache Zookeeper (ZK)
– Distributed, available, reliable coordination system
– The Chubby Lock Service …, 2006, Burrows
– http://research.google.com/archive/chubby.html
• Apache Hadoop MapReduce (MR)
– Distributed, fault-tolerant, batch-oriented data processing
– MapReduce: …, 2004, Dean and Ghemawat
– http://research.google.com/archive/mapreduce.html
Page 6
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
TL;DR
Page 7
So what is HBase anyway?
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Page 8
C1 tree C0 tree
Disk Memory
Figure 2.1. Schematic picture of an LSM-tree of two components
Figure 2.1 reproduced from O’Neil, Patrick, et al. "The log-structured
merge-tree (LSM-tree)." Acta Informatica 33.4 (1996): 351-385.
So what is HBase anyway?
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Page 9
DataNode RegionServer C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
primarily random reads and writes. In other deployments, is also a part of the workloads, TaskTrackers, Servers can run together.
primarily random primarily reads random and writes. reads and In other writes. is also a part is also of the a part workloads, of the workloads, TaskTrackers, Servers can Servers run together.
can run together.
So what is HBase anyway?
DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes C1 DataNode RegionServer DataNode RegionServer DataNode C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 C1 Figure 3.7 HBase Figure RegionServer 3.7 HBase and RegionServer HDFS DataNode and DataNode RegionServer DataNode RegionServer C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Page 10
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
DataNode RegionServer DataNode C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
DataNode RegionServer C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
RegionServers, and each RegionServer typically hosts multiple regions.
RegionServers, and each RegionServer typically hosts multiple regions.
RegionServers, and each RegionServer typically hosts multiple regions.
RegionServers, and each RegionServer typically hosts multiple regions.
RegionServers, RegionServers, and each RegionServer and each RegionServer typically hosts typically multiple hosts Given that the Given underlying that the underlying data is stored data in is HDFS, stored which in HDFS, is available a single namespace, a single namespace, all RegionServers all RegionServers have access have to the access same to system and system can therefore and can host therefore any region host any (figure region 3.8). (figure By physically 3.8). Nodes and Nodes RegionServers, and RegionServers, you can use you the can data use locality the data property; locality can theoretically can theoretically read and write read to and write local to DataNode local as DataNode the You may wonder You may where wonder the TaskTrackers where the TaskTrackers are in this are scheme HBase deployments, HBase deployments, the MapReduce the MapReduce framework framework isn’t deployed isn’t primarily random primarily reads random and writes. reads and In other writes. deployments, In other deployments, where is also a part is also of the a part workloads, of the workloads, TaskTrackers, TaskTrackers, DataNodes, Servers can Servers run together.
can run together.
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
Nodes and Nodes RegionServers, and Nodes RegionServers, you and can RegionServers, use you the can data use locality you the can data property; use locality the data that property; is, locality can theoretically can theoretically read and can write theoretically read to and write local read to DataNode and write local to as DataNode the primary local as DataNode the You may wonder You may where wonder You the may TaskTrackers where wonder the TaskTrackers where are in the this TaskTrackers are scheme in this of are scheme things. HBase deployments, HBase deployments, the HBase MapReduce deployments, the MapReduce framework the MapReduce framework isn’t deployed framework isn’t at deployed all if the isn’t primarily random primarily reads random primarily and writes. reads random and In other writes. reads deployments, and In other writes. deployments, In where other the deployments, MapReduce where is also a part is also of the a part workloads, is also of the a part workloads, TaskTrackers, of the workloads, TaskTrackers, DataNodes, TaskTrackers, DataNodes, and HBase Servers can Servers run together.
can Servers run together.
can run together.
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
primarily random primarily reads random primarily and writes. reads random and In other writes. reads deployments, and In other writes. deployments, In where other the deployments, MapReduce where is also a part is also of the a part workloads, is also of the a part workloads, TaskTrackers, of the workloads, TaskTrackers, DataNodes, TaskTrackers, DataNodes, and HBase Servers can Servers run together.
can Servers run together.
can run together.
Figure 3.6 A table consists of multiple smaller chunks called regions.
Figure 3.6 A table consists of multiple smaller chunks called regions.
Figure 3.6 A table consists of multiple smaller chunks called regions.
Figure 3.6 A table consists of multiple smaller chunks called regions.
Figure 3.6 A table consists of multiple smaller chunks called regions.
Figure 3.6 A table consists of multiple smaller chunks called DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes C1 Given that the underlying data is stored in HDFS, which is available to all clients as
Given that the underlying data is stored in HDFS, which is available to all clients as
Given that the underlying data is stored in HDFS, which is available to all clients as
Given that the underlying data is stored in HDFS, which is available to a single namespace, all RegionServers have access to the same persisted files system and can therefore host any region (figure 3.8). By physically collocating Nodes and RegionServers, you can use the data locality property; that is, can theoretically read and write to local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together.
can theoretically read and write to the local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
can theoretically read and write to the local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
can theoretically read and write to local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
So what is HBase anyway?
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
a single namespace, all RegionServers have access to the same persisted files in the file
system and can therefore host any region (figure 3.8). By physically collocating Data-
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
a single namespace, all RegionServers have access to the same persisted files in the file
system and can therefore host any region (figure 3.8). By physically collocating Data-
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
a single namespace, all RegionServers have access to the same persisted files in the file
system and can therefore host any region (figure 3.8). By physically collocating Data-
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
Servers can run together.
Servers can run together.
store/access data on HDFS. The master process does the distribution of regions among
RegionServers, and each RegionServer typically hosts multiple regions.
store/access data on HDFS. The master process does the distribution of regions among
RegionServers, and each RegionServer typically hosts multiple regions.
store/access data on HDFS. The master process does the distribution of regions among
RegionServers, and each RegionServer typically hosts multiple regions.
store/access data on HDFS. The master process does the distribution of regions RegionServers, and each RegionServer typically hosts multiple regions.
store/access store/data access on HDFS. data The on master HDFS. process The master does process the distribution does RegionServers, RegionServers, and each RegionServer and each RegionServer typically hosts typically multiple hosts Given that the Given underlying that the underlying data is stored data in is HDFS, stored which in HDFS, is available a single namespace, a single namespace, all RegionServers all RegionServers have access have to the access same to system and system can therefore and can host therefore any region host any (figure region 3.8). (figure By physically 3.8). Nodes and Nodes RegionServers, and RegionServers, you can use you the can data use locality the data property; locality can theoretically can theoretically read and write read to and write local to DataNode local as DataNode the You may wonder You may where wonder the TaskTrackers where the TaskTrackers are in this are scheme HBase deployments, HBase deployments, the MapReduce the MapReduce framework framework isn’t deployed isn’t primarily random primarily reads random and writes. reads and In other writes. deployments, In other deployments, where is also a part is also of the a part workloads, of the workloads, TaskTrackers, TaskTrackers, DataNodes, Servers can Servers run together.
can run together.
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
Given that the underlying data is stored in HDFS, which is available to all clients as
Given that the underlying data is stored in HDFS, which is available to all clients as
Given that the underlying data is stored in HDFS, which is available to all clients as
Given that the underlying data is stored in HDFS, which is available to a single namespace, all RegionServers have access to the same persisted files system and can therefore host any region (figure 3.8). By physically collocating Nodes and RegionServers, you can use the data locality property; that is, can theoretically read and write to local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together.
can theoretically read and write to the local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
can theoretically read and write to the local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
can theoretically read and write to local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
DataNode RegionServer DataNode RegionServer DataNode RegionServer
a single namespace, all RegionServers have access to the same persisted files in the file
system and can therefore host any region (figure 3.8). By physically collocating Data-
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
a single namespace, all RegionServers have access to the same persisted files in the file
system and can therefore host any region (figure 3.8). By physically collocating Data-
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
a single namespace, all RegionServers have access to the same persisted files in the file
system and can therefore host any region (figure 3.8). By physically collocating Data-
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
Servers can run together.
DataNode RegionServer DataNode RegionServer DataNode RegionServer
DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.com>
Servers can run together.
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
C1 tree C0 tree
primarily Disk Memory
C1 tree C0 tree
random reads and writes. In other deployments, where the MapReduce pro-cessing
Disk Memory Cache
C1 tree C0 tree
can theoretically read and write to the local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
can theoretically read and write to the local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
can theoretically read and write to local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
is also Disk Memory
C1 tree C0 tree
a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
is also a part of Disk C1 tree the Memory
C0 tree
workloads, TaskTrackers, DataNodes, and HBase Region-
DataNode RegionServer DataNode RegionServer C1 tree C0 tree
C1 tree C0 tree
Disk Memory
Disk Memory
C1 tree C0 tree
C1 tree C0 tree
Disk Memory Cache
Disk Memory Cache
C1 tree C0 tree
C1 tree C0 tree
Disk Memory
Disk Memory
C1 tree C0 tree
C1 tree C0 tree
Figure 3.7 HBase RegionServer and HDFS DataNode processes Licensed to Nick Dimiduk <ndimiduk@DataNode RegionServer DataNode RegionServer DataNode RegionServer
Disk Memory
DataNode RegionServer DataNode RegionServer DataNode RegionServer
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
DataNode RegionServer DataNode RegionServer DataNode RegionServer
Disk Memory
Servers can run together.
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
C1 tree C0 tree
primarily Disk Memory
C1 tree C0 tree
random reads and writes. In other deployments, where the MapReduce pro-cessing
Disk Memory Cache
Servers can run together.
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
C1 tree C0 tree
C1 tree C0 tree
primarily Disk Memory
C1 tree C0 tree
random Disk Memory
C1 tree C0 tree
reads and writes. In other deployments, where the MapReduce pro-cessing
Disk Memory Cache
Disk Memory Cache
DataNode RegionServer DataNode RegionServer DataNode RegionServer
DataNode RegionServer DataNode RegionServer Disk Memory
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.com>
DataNode RegionServer DataNode RegionServer Disk Memory
Disk Memory
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes Licensed to Nick Dimiduk <ndimiduk@C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
Figure C1 tree C0 tree
3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
is also Disk Memory
C1 tree C0 tree
a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
DataNode RegionServer DataNode RegionServer DataNode RegionServer
Disk Memory
Figure C1 tree C0 tree
3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on Disk Memory
C1 tree C0 tree
Licensed to Nick Dimiduk <ndimiduk@gmail.com>
DataNode RegionServer DataNode RegionServer DataNode RegionServer
Disk Memory
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
DataNode RegionServer DataNode RegionServer DataNode RegionServer
DataNode RegionServer DataNode RegionServer Disk Memory
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically C1 tree C0 tree
Disk Memory
C1 tree C0 tree
DataNode RegionServer DataNode RegionServer Disk Memory
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
Servers can run together.
Servers can run together.
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
Servers can run together.
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
Figure C1 tree C0 tree
3.7 HBase C1 tree C0 tree
RegionServer tree C0 tree
and HDFS tree DataNode C0 tree
processes are typically collocated on Disk Memory
Disk Memory
Disk Memory
Disk Memory
C1 tree C0 tree
C1 tree C0 tree
C1 tree C0 tree
C1 tree C0 tree
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on Licensed to Nick Dimiduk <ndimiduk@gmail.com>
DataNode RegionServer DataNode RegionServer DataNode RegionServer
DataNode RegionServer DataNode RegionServer DataNode RegionServer
DataNode RegionServer DataNode RegionServer DataNode RegionServer
Disk Memory
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
DataNode RegionServer DataNode RegionServer C1 tree C0 tree
C1 tree C0 tree
C1 tree C0 tree
Disk Memory
Disk Memory
Disk Memory
C1 tree C0 tree
C1 tree C0 tree
C1 tree C0 tree
Disk Memory Cache
Disk Memory Cache
Disk Memory Cache
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically C1 tree C0 tree
C1 Disk Memory
C1 tree C0 tree
Disk Memory
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on DataNode RegionServer DataNode RegionServer DataNode RegionServer
Disk Memory
Disk Memory
tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Page 11
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
Servers can run together.
DataNode RegionServer DataNode RegionServer DataNode RegionServer
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
Figure 3.7 HBase C1 tree C0 tree
RegionServer and HDFS DataNode processes are typically collocated on the same host.
Disk Memory
C1 tree C0 tree
Disk Memory
tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
Licensed to Nick Dimiduk <ndimiduk@gmail.com>
Disk Memory
Licensed to Nick Dimiduk <ndimiduk@gmail.com>
Licensed to Nick Dimiduk <ndimiduk@gmail.com>
Servers can run together.
DataNode RegionServer DataNode RegionServer DataNode RegionServer
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
Licensed to Nick Dimiduk <ndimiduk@gmail.com>
tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
Licensed to Nick Dimiduk <ndimiduk@gmail.com>
Licensed to Nick Dimiduk <ndimiduk@gmail.com>
Servers can run together.
DataNode RegionServer DataNode RegionServer DataNode RegionServer
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
RegionServers, and each RegionServer typically hosts multiple regions.
RegionServers, and each RegionServer typically hosts multiple regions.
RegionServers, and each RegionServer typically hosts multiple regions.
RegionServers, and each RegionServer typically hosts multiple regions.
RegionServers, RegionServers, and each RegionServer and each RegionServer typically hosts typically multiple hosts Given that the Given underlying that the underlying data is stored data in is HDFS, stored which in HDFS, is available a single namespace, a single namespace, all RegionServers all RegionServers have access have to the access same to system and system can therefore and can host therefore any region host any (figure region 3.8). (figure By physically 3.8). Nodes and Nodes RegionServers, and RegionServers, you can use you the can data use locality the data property; locality can theoretically can theoretically read and write read to and write local to DataNode local as DataNode the You may wonder You may where wonder the TaskTrackers where the TaskTrackers are in this are scheme HBase deployments, HBase deployments, the MapReduce the MapReduce framework framework isn’t deployed isn’t primarily random primarily reads random and writes. reads and In other writes. deployments, In other deployments, where is also a part is also of the a part workloads, of the workloads, TaskTrackers, TaskTrackers, DataNodes, Servers can Servers run together.
can run together.
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
Nodes and Nodes RegionServers, and Nodes RegionServers, you and can RegionServers, use you the can data use locality you the can data property; use locality the data that property; is, locality can theoretically can theoretically read and can write theoretically read to and write local read to DataNode and write local to as DataNode the primary local as DataNode the You may wonder You may where wonder You the may TaskTrackers where wonder the TaskTrackers where are in the this TaskTrackers are scheme in this of are scheme things. HBase deployments, HBase deployments, the HBase MapReduce deployments, the MapReduce framework the MapReduce framework isn’t deployed framework isn’t at deployed all if the isn’t primarily random primarily reads random primarily and writes. reads random and In other writes. reads deployments, and In other writes. deployments, In where other the deployments, MapReduce where is also a part is also of the a part workloads, is also of the a part workloads, TaskTrackers, of the workloads, TaskTrackers, DataNodes, TaskTrackers, DataNodes, and HBase Servers can Servers run together.
can Servers run together.
can run together.
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
primarily random primarily reads random primarily and writes. reads random and In other writes. reads deployments, and In other writes. deployments, In where other the deployments, MapReduce where is also a part is also of the a part workloads, is also of the a part workloads, TaskTrackers, of the workloads, TaskTrackers, DataNodes, TaskTrackers, DataNodes, and HBase Servers can Servers run together.
can Servers run together.
can run together.
Figure 3.6 A table consists of multiple smaller chunks called regions.
Figure 3.6 A table consists of multiple smaller chunks called regions.
Figure 3.6 A table consists of multiple smaller chunks called regions.
Figure 3.6 A table consists of multiple smaller chunks called regions.
Figure 3.6 A table consists of multiple smaller chunks called regions.
Figure 3.6 A table consists of multiple smaller chunks called DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes C1 Given that the underlying data is stored in HDFS, which is available to all clients as
Given that the underlying data is stored in HDFS, which is available to all clients as
Given that the underlying data is stored in HDFS, which is available to all clients as
Given that the underlying data is stored in HDFS, which is available to a single namespace, all RegionServers have access to the same persisted files system and can therefore host any region (figure 3.8). By physically collocating Nodes and RegionServers, you can use the data locality property; that is, can theoretically read and write to local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together.
can theoretically read and write to the local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
can theoretically read and write to the local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
can theoretically read and write to local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
So what is HBase anyway?
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
a single namespace, all RegionServers have access to the same persisted files in the file
system and can therefore host any region (figure 3.8). By physically collocating Data-
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
a single namespace, all RegionServers have access to the same persisted files in the file
system and can therefore host any region (figure 3.8). By physically collocating Data-
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
a single namespace, all RegionServers have access to the same persisted files in the file
system and can therefore host any region (figure 3.8). By physically collocating Data-
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
Servers can run together.
Servers can run together.
store/access data on HDFS. The master process does the distribution of regions among
RegionServers, and each RegionServer typically hosts multiple regions.
store/access data on HDFS. The master process does the distribution of regions among
RegionServers, and each RegionServer typically hosts multiple regions.
store/access data on HDFS. The master process does the distribution of regions among
RegionServers, and each RegionServer typically hosts multiple regions.
store/access data on HDFS. The master process does the distribution of regions RegionServers, and each RegionServer typically hosts multiple regions.
store/access store/data access on HDFS. data The on master HDFS. process The master does process the distribution does RegionServers, RegionServers, and each RegionServer and each RegionServer typically hosts typically multiple hosts Given that the Given underlying that the underlying data is stored data in is HDFS, stored which in HDFS, is available a single namespace, a single namespace, all RegionServers all RegionServers have access have to the access same to system and system can therefore and can host therefore any region host any (figure region 3.8). (figure By physically 3.8). Nodes and Nodes RegionServers, and RegionServers, you can use you the can data use locality the data property; locality can theoretically can theoretically read and write read to and write local to DataNode local as DataNode the You may wonder You may where wonder the TaskTrackers where the TaskTrackers are in this are scheme HBase deployments, HBase deployments, the MapReduce the MapReduce framework framework isn’t deployed isn’t primarily random primarily reads random and writes. reads and In other writes. deployments, In other deployments, where is also a part is also of the a part workloads, of the workloads, TaskTrackers, TaskTrackers, DataNodes, Servers can Servers run together.
can run together.
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
Given that the underlying data is stored in HDFS, which is available to all clients as
Given that the underlying data is stored in HDFS, which is available to all clients as
Given that the underlying data is stored in HDFS, which is available to all clients as
Given that the underlying data is stored in HDFS, which is available to a single namespace, all RegionServers have access to the same persisted files system and can therefore host any region (figure 3.8). By physically collocating Nodes and RegionServers, you can use the data locality property; that is, can theoretically read and write to local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together.
can theoretically read and write to the local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
can theoretically read and write to the local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
can theoretically read and write to local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
DataNode RegionServer DataNode RegionServer DataNode RegionServer
a single namespace, all RegionServers have access to the same persisted files in the file
system and can therefore host any region (figure 3.8). By physically collocating Data-
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
a single namespace, all RegionServers have access to the same persisted files in the file
system and can therefore host any region (figure 3.8). By physically collocating Data-
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
a single namespace, all RegionServers have access to the same persisted files in the file
system and can therefore host any region (figure 3.8). By physically collocating Data-
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
Servers can run together.
DataNode RegionServer DataNode RegionServer DataNode RegionServer
DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.com>
Servers can run together.
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
C1 tree C0 tree
primarily Disk Memory
C1 tree C0 tree
random reads and writes. In other deployments, where the MapReduce pro-cessing
Disk Memory Cache
C1 tree C0 tree
can theoretically read and write to the local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
can theoretically read and write to the local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
can theoretically read and write to local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
is also Disk Memory
C1 tree C0 tree
a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
is also a part of Disk C1 tree the Memory
C0 tree
workloads, TaskTrackers, DataNodes, and HBase Region-
DataNode RegionServer DataNode RegionServer C1 tree C0 tree
C1 tree C0 tree
Disk Memory
Disk Memory
C1 tree C0 tree
C1 tree C0 tree
Disk Memory Cache
Disk Memory Cache
C1 tree C0 tree
C1 tree C0 tree
Disk Memory
Disk Memory
C1 tree C0 tree
C1 tree C0 tree
Figure 3.7 HBase RegionServer and HDFS DataNode processes Licensed to Nick Dimiduk <ndimiduk@DataNode RegionServer DataNode RegionServer DataNode RegionServer
Disk Memory
DataNode RegionServer DataNode RegionServer DataNode RegionServer
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
DataNode RegionServer DataNode RegionServer DataNode RegionServer
Disk Memory
Servers can run together.
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
C1 tree C0 tree
primarily Disk Memory
C1 tree C0 tree
random reads and writes. In other deployments, where the MapReduce pro-cessing
Disk Memory Cache
Servers can run together.
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
C1 tree C0 tree
C1 tree C0 tree
primarily Disk Memory
C1 tree C0 tree
random Disk Memory
C1 tree C0 tree
reads and writes. In other deployments, where the MapReduce pro-cessing
Disk Memory Cache
Disk Memory Cache
DataNode RegionServer DataNode RegionServer DataNode RegionServer
DataNode RegionServer DataNode RegionServer Disk Memory
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.com>
DataNode RegionServer DataNode RegionServer Disk Memory
Disk Memory
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes Licensed to Nick Dimiduk <ndimiduk@C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
Figure C1 tree C0 tree
3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
is also Disk Memory
C1 tree C0 tree
a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
DataNode RegionServer DataNode RegionServer DataNode RegionServer
Disk Memory
Figure C1 tree C0 tree
3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on Disk Memory
C1 tree C0 tree
Licensed to Nick Dimiduk <ndimiduk@gmail.com>
DataNode RegionServer DataNode RegionServer DataNode RegionServer
Disk Memory
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
DataNode RegionServer DataNode RegionServer DataNode RegionServer
DataNode RegionServer DataNode RegionServer Disk Memory
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically C1 tree C0 tree
Disk Memory
C1 tree C0 tree
DataNode RegionServer DataNode RegionServer Disk Memory
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
Servers can run together.
Servers can run together.
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
Servers can run together.
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
Figure C1 tree C0 tree
3.7 HBase C1 RegionServer and HDFS DataNode processes are typically collocated on Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on Licensed to Nick Dimiduk <ndimiduk@gmail.com>
"2011-07-04" Disk Memory
C1 tree C0 tree
1368396302 "fourth of July"
tree C0 tree
Disk Memory
C1 tree C0 tree
tree C0 tree
Disk Memory
C1 tree C0 tree
tree C0 tree
Disk Memory
C1 tree C0 tree
DataNode RegionServer DataNode RegionServer DataNode RegionServer
DataNode RegionServer DataNode RegionServer DataNode RegionServer
DataNode RegionServer DataNode RegionServer DataNode RegionServer
Disk Memory
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
DataNode RegionServer DataNode RegionServer C1 tree C0 tree
C1 tree C0 tree
C1 tree C0 tree
Disk Memory
Disk Memory
Disk Memory
C1 tree C0 tree
C1 tree C0 tree
C1 tree C0 tree
Disk Memory Cache
Disk Memory Cache
Disk Memory Cache
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically C1 tree C0 tree
C1 Disk Memory
C1 tree C0 tree
Disk Memory
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on DataNode RegionServer DataNode RegionServer DataNode RegionServer
Disk Memory
Disk Memory
tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Page 12
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
Servers can run together.
DataNode RegionServer DataNode RegionServer DataNode RegionServer
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
Figure 3.7 HBase C1 tree C0 tree
RegionServer and HDFS DataNode processes are typically collocated on the same host.
Disk Memory
C1 tree C0 tree
Disk Memory
tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
Licensed to Nick Dimiduk <ndimiduk@gmail.com>
Disk Memory
Licensed to Nick Dimiduk <ndimiduk@gmail.com>
Licensed to Nick Dimiduk <ndimiduk@gmail.com>
Servers can run together.
DataNode RegionServer DataNode RegionServer DataNode RegionServer
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
Licensed to Nick Dimiduk <ndimiduk@gmail.com>
tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
Licensed to Nick Dimiduk <ndimiduk@gmail.com>
Licensed to Nick Dimiduk <ndimiduk@gmail.com>
Servers can run together.
DataNode RegionServer DataNode RegionServer DataNode RegionServer
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
a
cf1
1368394583 7
1368394261 "hello"
"bar"
1368394583 22
1368394925 13.6
1368393847 "world"
"foo"
cf2
1.0001 1368387684 "almost the loneliest number"
Logical Data Model
1368394583 7
1368394261 "hello"
"bar"
1368394583 22
1368394925 13.6
1368393847 "world"
"foo"
"2011-07-04" 1368396302 "fourth of July"
1.0001 1368387684 "almost the loneliest number"
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Page 14
a
cf1
cf2
b cf2 "thumb" 1368387247 [3.6 kb png data]
Table A
rowkey
column
family
column
qualifier
timestamp value
Rows
Column Families
Logical Architecture
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Page 15
Table A
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
Region 1
Region 2
Region 3
Region 4
Region Server 7
Table A, Region 1
Table A, Region 2
Table G, Region 1070
Table L, Region 25
Region Server 86
Table A, Region 3
Table C, Region 30
Table F, Region 160
Table F, Region 776
Region Server 367
Table A, Region 4
Table C, Region 17
Table E, Region 52
Table P, Region 1116
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
and RegionServers, you can use the data locality property; that can theoretically read and write to the local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of HBase deployments, the MapReduce framework isn’t deployed at all primarily random reads and writes. In other deployments, where the is also a part of the workloads, TaskTrackers, DataNodes, and Servers can run together.
Nodes Nodes DataNode RegionServer DataNode DataNode RegionServer RegionServer DataNode DataNode RegionServer Figure 3.7 HBase RegionServer Figure 3.7 and HDFS HBase DataNode RegionServer processes and HDFS are typically DataNode collocated processes and RegionServers, you can use the data locality property; can theoretically read and write to the local DataNode You may wonder where the TaskTrackers are in this HBase deployments, the MapReduce framework isn’t deployed primarily random reads and writes. In other deployments, is also a part of the workloads, TaskTrackers, DataNodes, Servers can run together.
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same Nodes and RegionServers, you can use the data can theoretically read and write to the local You may wonder where the TaskTrackers HBase deployments, the MapReduce framework primarily random reads and writes. In other deployments, is also a part of the workloads, TaskTrackers, Servers can run together.
system and can therefore host any region (figure 3.8). By physically collocating Data-
Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers
Physical Architecture
can theoretically read and write to the local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the workload primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together.
can theoretically read and write to the local DataNode as the primary DataNode.
You may wonder where the TaskTrackers are in this scheme of things. In some
HBase deployments, the MapReduce framework isn’t deployed at all if the workload is
primarily random reads and writes. In other deployments, where the MapReduce pro-cessing
is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region-
Zoo
Keeper
Zoo
Keeper
DataNode RegionServer DataNode RegionServer DataNode RegionServer
DataNode RegionServer DataNode RegionServer Region
Region
Region
Server
Server
Server
Data
Data
Data
Node
Node
Node
Figure Licensed Attribution-3.7 under a ShareAlike HBase Creative Commons
3.0 Unported RegionServer License.
and HDFS DataNode processes Page 16
are typically Region
Server
Data
Node
...
can theoretically read and write You may wonder where the TaskTrackers HBase deployments, the MapReduce primarily random reads and writes. is also a part of the workloads, Servers can run together.
DataNode RegionServer Master
Master
Figure 3.7 HBase RegionServer and HDFS Servers can run together.
DataNode RegionServer DataNode RegionServer DataNode RegionServer
Name
Node
DataNode RegionServer DataNode RegionServer HBase
Client
Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.HDFS
HBase
User API
• {rowkey => {family => {qualifier => {version => value}}}}
– Think: nested TreeMap (Java), OrderedDictionary (C#), OrderedDict (Python)
• Basic data operations: GET, PUT, DELETE
• SCAN over range of key-values
– benefit of the sorted rowkey business
– this is how you implement any kind of "complex query” *
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
• GET, SCAN support Filters
– Push application logic to RegionServers
• INCREMENT, APPEND, CheckAnd{Put,Delete}
– Server-side, atomic data operations, can be contentious!
Page 17
* This is also a foundational component in what we refer to as
“schema design” in this “schemaless” database.
Anatomy of a RegionServer
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Page 18
So what is HBase anyway?
C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory Cache
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Page 19
DataNode RegionServer C1 tree C0 tree
Disk Memory
C1 tree C0 tree
Disk Memory
Web-scale Database
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Page 26
App
server
App
server
App
server
App
server
App
server
Counters
Sessions
User profiles
Social Media
Application
Data
Latency
Write Read
Throughput
Search Search
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
“BigIndex”
Page 27
"BigIndex"
Document
store
Search Search
Search
App
server
App
server
App
server
App
server
App
server
Latency
Write Read
Throughput
Materialized View
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Page 28
App
server
App
server
App
server
App
server
App
server
Latency
Write Read
Throughput
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
ETL Assist
Write Read
Page 29
Latency
Throughput
Lambda Architecture
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Page 30
App
server
App
server
App
server
App
server
App
server
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Resources
Page 31
Join the Community!
• hbase.apache.org
– hbase.apache.org/book.html
– blogs.apache.org/hbase/
– hbase.apache.org/mail-lists.html
• IRC: irc.freenode.net #hbase
• JIRA: issues.apache.org/jira/browse/HBASE
• Source: git clone git://git.apache.org/hbase.git
• In person
– HBaseCon, hbasecon.com
– Hadoop Summit, hadoopsummit.org
– Strata / Hadoop World, strataconf.com
– Local meetup near you!
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Page 32
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
HBase 0.98/1.0
• Hardening
– Stability, Reliability, Availability, Performance
• Horizontal Scalability
– 1000’s of machines
• Availability
– Speed of Recovery (MTTR), Region Replicas
• Improved Multi-tenancy
– RPC priorities/QoS, namespace management
• Cell-level security
• Semantic Versioning
• Client API cleanup
Page 33
Thanks!
Licensed under a Creative Commons
Attribution-ShareAlike 3.0 Unported License.
Page 34
M A N N I N G
Nick Dimiduk
Amandeep Khurana
FOREWORD BY
Michael Stack
hbaseinaction.com
Nick Dimiduk
github.com/ndimiduk
@xefyr
n10k.com
strataeucftw
slideshare.net/xefyr/hbase-for-architects