Your SlideShare is downloading. ×
Apache HBase for Architects
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Apache HBase for Architects

8,328
views

Published on

An overview of HBase's architecture, internal design, and its place in your application.

An overview of HBase's architecture, internal design, and its place in your application.

Published in: Technology, Business

1 Comment
30 Likes
Statistics
Notes
  • Attended this presentation at Strata NYC 2013....was really well done. Some great tips for designing the logical DM and scaling.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
8,328
On Slideshare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
265
Comments
1
Likes
30
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Apache HBase For Architects Nick Dimiduk, Hortonworks Strata/Hadoop World Barcelona, 2014-11-21 Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 1
  • 2. Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 2
  • 3. Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Agenda • Background – (how did we get here?) • TL;DR – (don’t waste my time!) • High-level Architecture – (where are we?) • Anatomy of a RegionServer – (how does this thing work?) • By Example – (how do I use it?) • Resources – (where do we go from here?) Page 3
  • 4. Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Background Page 4
  • 5. So what is HBase anyway? • BigTable paper from Google, 2006, Dean et al. – “Bigtable is a sparse, distributed, persistent multi-dimensional sorted map.” – http://research.google.com/archive/bigtable.html Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. • Key Features: – Distributed storage across cluster of machines – Random, online read and write data access – Schemaless data model (“NoSQL”) – Self-managed data partitions Page 5
  • 6. Apache Hadoop Dependencies • Apache Hadoop Distributed Filesystem (HDFS) – Distributed, fault-tolerant, throughput-optimized data storage – The Google File System, 2003, Ghemawat et al. – http://research.google.com/archive/gfs.html Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. • Apache Zookeeper (ZK) – Distributed, available, reliable coordination system – The Chubby Lock Service …, 2006, Burrows – http://research.google.com/archive/chubby.html • Apache Hadoop MapReduce (MR) – Distributed, fault-tolerant, batch-oriented data processing – MapReduce: …, 2004, Dean and Ghemawat – http://research.google.com/archive/mapreduce.html Page 6
  • 7. Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. TL;DR Page 7
  • 8. So what is HBase anyway? Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 8 C1 tree C0 tree Disk Memory Figure 2.1. Schematic picture of an LSM-tree of two components Figure 2.1 reproduced from O’Neil, Patrick, et al. "The log-structured merge-tree (LSM-tree)." Acta Informatica 33.4 (1996): 351-385.
  • 9. So what is HBase anyway? C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 9 DataNode RegionServer C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory
  • 10. primarily random reads and writes. In other deployments, is also a part of the workloads, TaskTrackers, Servers can run together. primarily random primarily reads random and writes. reads and In other writes. is also a part is also of the a part workloads, of the workloads, TaskTrackers, Servers can Servers run together. can run together. So what is HBase anyway? DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes C1 DataNode RegionServer DataNode RegionServer DataNode C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 C1 Figure 3.7 HBase Figure RegionServer 3.7 HBase and RegionServer HDFS DataNode and DataNode RegionServer DataNode RegionServer C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 10 C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache tree C0 tree Disk Memory C1 tree C0 tree Disk Memory tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache tree C0 tree Disk Memory C1 tree C0 tree Disk Memory DataNode RegionServer DataNode C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory DataNode RegionServer C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory
  • 11. RegionServers, and each RegionServer typically hosts multiple regions. RegionServers, and each RegionServer typically hosts multiple regions. RegionServers, and each RegionServer typically hosts multiple regions. RegionServers, and each RegionServer typically hosts multiple regions. RegionServers, RegionServers, and each RegionServer and each RegionServer typically hosts typically multiple hosts Given that the Given underlying that the underlying data is stored data in is HDFS, stored which in HDFS, is available a single namespace, a single namespace, all RegionServers all RegionServers have access have to the access same to system and system can therefore and can host therefore any region host any (figure region 3.8). (figure By physically 3.8). Nodes and Nodes RegionServers, and RegionServers, you can use you the can data use locality the data property; locality can theoretically can theoretically read and write read to and write local to DataNode local as DataNode the You may wonder You may where wonder the TaskTrackers where the TaskTrackers are in this are scheme HBase deployments, HBase deployments, the MapReduce the MapReduce framework framework isn’t deployed isn’t primarily random primarily reads random and writes. reads and In other writes. deployments, In other deployments, where is also a part is also of the a part workloads, of the workloads, TaskTrackers, TaskTrackers, DataNodes, Servers can Servers run together. can run together. Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Nodes and Nodes RegionServers, and Nodes RegionServers, you and can RegionServers, use you the can data use locality you the can data property; use locality the data that property; is, locality can theoretically can theoretically read and can write theoretically read to and write local read to DataNode and write local to as DataNode the primary local as DataNode the You may wonder You may where wonder You the may TaskTrackers where wonder the TaskTrackers where are in the this TaskTrackers are scheme in this of are scheme things. HBase deployments, HBase deployments, the HBase MapReduce deployments, the MapReduce framework the MapReduce framework isn’t deployed framework isn’t at deployed all if the isn’t primarily random primarily reads random primarily and writes. reads random and In other writes. reads deployments, and In other writes. deployments, In where other the deployments, MapReduce where is also a part is also of the a part workloads, is also of the a part workloads, TaskTrackers, of the workloads, TaskTrackers, DataNodes, TaskTrackers, DataNodes, and HBase Servers can Servers run together. can Servers run together. can run together. primarily random reads and writes. In other deployments, where the MapReduce pro-cessing primarily random reads and writes. In other deployments, where the MapReduce pro-cessing primarily random primarily reads random primarily and writes. reads random and In other writes. reads deployments, and In other writes. deployments, In where other the deployments, MapReduce where is also a part is also of the a part workloads, is also of the a part workloads, TaskTrackers, of the workloads, TaskTrackers, DataNodes, TaskTrackers, DataNodes, and HBase Servers can Servers run together. can Servers run together. can run together. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes C1 Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to a single namespace, all RegionServers have access to the same persisted files system and can therefore host any region (figure 3.8). By physically collocating Nodes and RegionServers, you can use the data locality property; that is, can theoretically read and write to local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together. can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some So what is HBase anyway? is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Servers can run together. Servers can run together. store/access data on HDFS. The master process does the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. store/access data on HDFS. The master process does the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. store/access data on HDFS. The master process does the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. store/access data on HDFS. The master process does the distribution of regions RegionServers, and each RegionServer typically hosts multiple regions. store/access store/data access on HDFS. data The on master HDFS. process The master does process the distribution does RegionServers, RegionServers, and each RegionServer and each RegionServer typically hosts typically multiple hosts Given that the Given underlying that the underlying data is stored data in is HDFS, stored which in HDFS, is available a single namespace, a single namespace, all RegionServers all RegionServers have access have to the access same to system and system can therefore and can host therefore any region host any (figure region 3.8). (figure By physically 3.8). Nodes and Nodes RegionServers, and RegionServers, you can use you the can data use locality the data property; locality can theoretically can theoretically read and write read to and write local to DataNode local as DataNode the You may wonder You may where wonder the TaskTrackers where the TaskTrackers are in this are scheme HBase deployments, HBase deployments, the MapReduce the MapReduce framework framework isn’t deployed isn’t primarily random primarily reads random and writes. reads and In other writes. deployments, In other deployments, where is also a part is also of the a part workloads, of the workloads, TaskTrackers, TaskTrackers, DataNodes, Servers can Servers run together. can run together. HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to a single namespace, all RegionServers have access to the same persisted files system and can therefore host any region (figure 3.8). By physically collocating Nodes and RegionServers, you can use the data locality property; that is, can theoretically read and write to local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together. can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- DataNode RegionServer DataNode RegionServer DataNode RegionServer a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Servers can run together. DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.com> Servers can run together. HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is C1 tree C0 tree primarily Disk Memory C1 tree C0 tree random reads and writes. In other deployments, where the MapReduce pro-cessing Disk Memory Cache C1 tree C0 tree can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also Disk Memory C1 tree C0 tree a part of the workloads, TaskTrackers, DataNodes, and HBase Region- C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree is also a part of Disk C1 tree the Memory C0 tree workloads, TaskTrackers, DataNodes, and HBase Region- DataNode RegionServer DataNode RegionServer C1 tree C0 tree C1 tree C0 tree Disk Memory Disk Memory C1 tree C0 tree C1 tree C0 tree Disk Memory Cache Disk Memory Cache C1 tree C0 tree C1 tree C0 tree Disk Memory Disk Memory C1 tree C0 tree C1 tree C0 tree Figure 3.7 HBase RegionServer and HDFS DataNode processes Licensed to Nick Dimiduk <ndimiduk@DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory DataNode RegionServer DataNode RegionServer DataNode RegionServer C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory Servers can run together. HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is C1 tree C0 tree primarily Disk Memory C1 tree C0 tree random reads and writes. In other deployments, where the MapReduce pro-cessing Disk Memory Cache Servers can run together. HBase deployments, the MapReduce framework isn’t deployed at all if the workload is C1 tree C0 tree C1 tree C0 tree primarily Disk Memory C1 tree C0 tree random Disk Memory C1 tree C0 tree reads and writes. In other deployments, where the MapReduce pro-cessing Disk Memory Cache Disk Memory Cache DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.com> DataNode RegionServer DataNode RegionServer Disk Memory Disk Memory C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes Licensed to Nick Dimiduk <ndimiduk@C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure C1 tree C0 tree 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also Disk Memory C1 tree C0 tree a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory Figure C1 tree C0 tree 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on Disk Memory C1 tree C0 tree Licensed to Nick Dimiduk <ndimiduk@gmail.com> DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically C1 tree C0 tree Disk Memory C1 tree C0 tree DataNode RegionServer DataNode RegionServer Disk Memory C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Servers can run together. Servers can run together. C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache Servers can run together. C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure C1 tree C0 tree 3.7 HBase C1 tree C0 tree RegionServer tree C0 tree and HDFS tree DataNode C0 tree processes are typically collocated on Disk Memory Disk Memory Disk Memory Disk Memory C1 tree C0 tree C1 tree C0 tree C1 tree C0 tree C1 tree C0 tree Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on Licensed to Nick Dimiduk <ndimiduk@gmail.com> DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache DataNode RegionServer DataNode RegionServer C1 tree C0 tree C1 tree C0 tree C1 tree C0 tree Disk Memory Disk Memory Disk Memory C1 tree C0 tree C1 tree C0 tree C1 tree C0 tree Disk Memory Cache Disk Memory Cache Disk Memory Cache C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically C1 tree C0 tree C1 Disk Memory C1 tree C0 tree Disk Memory Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory Disk Memory tree C0 tree Disk Memory C1 tree C0 tree Disk Memory tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 11 primarily random reads and writes. In other deployments, where the MapReduce pro-cessing is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- Servers can run together. DataNode RegionServer DataNode RegionServer DataNode RegionServer C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Figure 3.7 HBase C1 tree C0 tree RegionServer and HDFS DataNode processes are typically collocated on the same host. Disk Memory C1 tree C0 tree Disk Memory tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Licensed to Nick Dimiduk <ndimiduk@gmail.com> Disk Memory Licensed to Nick Dimiduk <ndimiduk@gmail.com> Licensed to Nick Dimiduk <ndimiduk@gmail.com> Servers can run together. DataNode RegionServer DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Licensed to Nick Dimiduk <ndimiduk@gmail.com> tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Licensed to Nick Dimiduk <ndimiduk@gmail.com> Licensed to Nick Dimiduk <ndimiduk@gmail.com> Servers can run together. DataNode RegionServer DataNode RegionServer DataNode RegionServer C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host.
  • 12. RegionServers, and each RegionServer typically hosts multiple regions. RegionServers, and each RegionServer typically hosts multiple regions. RegionServers, and each RegionServer typically hosts multiple regions. RegionServers, and each RegionServer typically hosts multiple regions. RegionServers, RegionServers, and each RegionServer and each RegionServer typically hosts typically multiple hosts Given that the Given underlying that the underlying data is stored data in is HDFS, stored which in HDFS, is available a single namespace, a single namespace, all RegionServers all RegionServers have access have to the access same to system and system can therefore and can host therefore any region host any (figure region 3.8). (figure By physically 3.8). Nodes and Nodes RegionServers, and RegionServers, you can use you the can data use locality the data property; locality can theoretically can theoretically read and write read to and write local to DataNode local as DataNode the You may wonder You may where wonder the TaskTrackers where the TaskTrackers are in this are scheme HBase deployments, HBase deployments, the MapReduce the MapReduce framework framework isn’t deployed isn’t primarily random primarily reads random and writes. reads and In other writes. deployments, In other deployments, where is also a part is also of the a part workloads, of the workloads, TaskTrackers, TaskTrackers, DataNodes, Servers can Servers run together. can run together. Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Nodes and Nodes RegionServers, and Nodes RegionServers, you and can RegionServers, use you the can data use locality you the can data property; use locality the data that property; is, locality can theoretically can theoretically read and can write theoretically read to and write local read to DataNode and write local to as DataNode the primary local as DataNode the You may wonder You may where wonder You the may TaskTrackers where wonder the TaskTrackers where are in the this TaskTrackers are scheme in this of are scheme things. HBase deployments, HBase deployments, the HBase MapReduce deployments, the MapReduce framework the MapReduce framework isn’t deployed framework isn’t at deployed all if the isn’t primarily random primarily reads random primarily and writes. reads random and In other writes. reads deployments, and In other writes. deployments, In where other the deployments, MapReduce where is also a part is also of the a part workloads, is also of the a part workloads, TaskTrackers, of the workloads, TaskTrackers, DataNodes, TaskTrackers, DataNodes, and HBase Servers can Servers run together. can Servers run together. can run together. primarily random reads and writes. In other deployments, where the MapReduce pro-cessing primarily random reads and writes. In other deployments, where the MapReduce pro-cessing primarily random primarily reads random primarily and writes. reads random and In other writes. reads deployments, and In other writes. deployments, In where other the deployments, MapReduce where is also a part is also of the a part workloads, is also of the a part workloads, TaskTrackers, of the workloads, TaskTrackers, DataNodes, TaskTrackers, DataNodes, and HBase Servers can Servers run together. can Servers run together. can run together. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called regions. Figure 3.6 A table consists of multiple smaller chunks called DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes C1 Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to a single namespace, all RegionServers have access to the same persisted files system and can therefore host any region (figure 3.8). By physically collocating Nodes and RegionServers, you can use the data locality property; that is, can theoretically read and write to local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together. can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some So what is HBase anyway? is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Servers can run together. Servers can run together. store/access data on HDFS. The master process does the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. store/access data on HDFS. The master process does the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. store/access data on HDFS. The master process does the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. store/access data on HDFS. The master process does the distribution of regions RegionServers, and each RegionServer typically hosts multiple regions. store/access store/data access on HDFS. data The on master HDFS. process The master does process the distribution does RegionServers, RegionServers, and each RegionServer and each RegionServer typically hosts typically multiple hosts Given that the Given underlying that the underlying data is stored data in is HDFS, stored which in HDFS, is available a single namespace, a single namespace, all RegionServers all RegionServers have access have to the access same to system and system can therefore and can host therefore any region host any (figure region 3.8). (figure By physically 3.8). Nodes and Nodes RegionServers, and RegionServers, you can use you the can data use locality the data property; locality can theoretically can theoretically read and write read to and write local to DataNode local as DataNode the You may wonder You may where wonder the TaskTrackers where the TaskTrackers are in this are scheme HBase deployments, HBase deployments, the MapReduce the MapReduce framework framework isn’t deployed isn’t primarily random primarily reads random and writes. reads and In other writes. deployments, In other deployments, where is also a part is also of the a part workloads, of the workloads, TaskTrackers, TaskTrackers, DataNodes, Servers can Servers run together. can run together. HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to all clients as Given that the underlying data is stored in HDFS, which is available to a single namespace, all RegionServers have access to the same persisted files system and can therefore host any region (figure 3.8). By physically collocating Nodes and RegionServers, you can use the data locality property; that is, can theoretically read and write to local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together. can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- DataNode RegionServer DataNode RegionServer DataNode RegionServer a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers a single namespace, all RegionServers have access to the same persisted files in the file system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Servers can run together. DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.com> Servers can run together. HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is C1 tree C0 tree primarily Disk Memory C1 tree C0 tree random reads and writes. In other deployments, where the MapReduce pro-cessing Disk Memory Cache C1 tree C0 tree can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some can theoretically read and write to local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also Disk Memory C1 tree C0 tree a part of the workloads, TaskTrackers, DataNodes, and HBase Region- C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree is also a part of Disk C1 tree the Memory C0 tree workloads, TaskTrackers, DataNodes, and HBase Region- DataNode RegionServer DataNode RegionServer C1 tree C0 tree C1 tree C0 tree Disk Memory Disk Memory C1 tree C0 tree C1 tree C0 tree Disk Memory Cache Disk Memory Cache C1 tree C0 tree C1 tree C0 tree Disk Memory Disk Memory C1 tree C0 tree C1 tree C0 tree Figure 3.7 HBase RegionServer and HDFS DataNode processes Licensed to Nick Dimiduk <ndimiduk@DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory DataNode RegionServer DataNode RegionServer DataNode RegionServer C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory Servers can run together. HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing HBase deployments, the MapReduce framework isn’t deployed at all if the workload is C1 tree C0 tree primarily Disk Memory C1 tree C0 tree random reads and writes. In other deployments, where the MapReduce pro-cessing Disk Memory Cache Servers can run together. HBase deployments, the MapReduce framework isn’t deployed at all if the workload is C1 tree C0 tree C1 tree C0 tree primarily Disk Memory C1 tree C0 tree random Disk Memory C1 tree C0 tree reads and writes. In other deployments, where the MapReduce pro-cessing Disk Memory Cache Disk Memory Cache DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.com> DataNode RegionServer DataNode RegionServer Disk Memory Disk Memory C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes Licensed to Nick Dimiduk <ndimiduk@C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure C1 tree C0 tree 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also Disk Memory C1 tree C0 tree a part of the workloads, TaskTrackers, DataNodes, and HBase Region- is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory Figure C1 tree C0 tree 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on Disk Memory C1 tree C0 tree Licensed to Nick Dimiduk <ndimiduk@gmail.com> DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically C1 tree C0 tree Disk Memory C1 tree C0 tree DataNode RegionServer DataNode RegionServer Disk Memory C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Servers can run together. Servers can run together. C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache Servers can run together. C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure C1 tree C0 tree 3.7 HBase C1 RegionServer and HDFS DataNode processes are typically collocated on Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on Licensed to Nick Dimiduk <ndimiduk@gmail.com> "2011-07-04" Disk Memory C1 tree C0 tree 1368396302 "fourth of July" tree C0 tree Disk Memory C1 tree C0 tree tree C0 tree Disk Memory C1 tree C0 tree tree C0 tree Disk Memory C1 tree C0 tree DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache DataNode RegionServer DataNode RegionServer C1 tree C0 tree C1 tree C0 tree C1 tree C0 tree Disk Memory Disk Memory Disk Memory C1 tree C0 tree C1 tree C0 tree C1 tree C0 tree Disk Memory Cache Disk Memory Cache Disk Memory Cache C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically C1 tree C0 tree C1 Disk Memory C1 tree C0 tree Disk Memory Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on DataNode RegionServer DataNode RegionServer DataNode RegionServer Disk Memory Disk Memory tree C0 tree Disk Memory C1 tree C0 tree Disk Memory tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 12 primarily random reads and writes. In other deployments, where the MapReduce pro-cessing is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- Servers can run together. DataNode RegionServer DataNode RegionServer DataNode RegionServer C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Figure 3.7 HBase C1 tree C0 tree RegionServer and HDFS DataNode processes are typically collocated on the same host. Disk Memory C1 tree C0 tree Disk Memory tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Licensed to Nick Dimiduk <ndimiduk@gmail.com> Disk Memory Licensed to Nick Dimiduk <ndimiduk@gmail.com> Licensed to Nick Dimiduk <ndimiduk@gmail.com> Servers can run together. DataNode RegionServer DataNode RegionServer DataNode RegionServer Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Licensed to Nick Dimiduk <ndimiduk@gmail.com> tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Licensed to Nick Dimiduk <ndimiduk@gmail.com> Licensed to Nick Dimiduk <ndimiduk@gmail.com> Servers can run together. DataNode RegionServer DataNode RegionServer DataNode RegionServer C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. C1 Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same host. a cf1 1368394583 7 1368394261 "hello" "bar" 1368394583 22 1368394925 13.6 1368393847 "world" "foo" cf2 1.0001 1368387684 "almost the loneliest number"
  • 13. High-level Architecture Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 13
  • 14. Logical Data Model 1368394583 7 1368394261 "hello" "bar" 1368394583 22 1368394925 13.6 1368393847 "world" "foo" "2011-07-04" 1368396302 "fourth of July" 1.0001 1368387684 "almost the loneliest number" Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 14 a cf1 cf2 b cf2 "thumb" 1368387247 [3.6 kb png data] Table A rowkey column family column qualifier timestamp value Rows Column Families
  • 15. Logical Architecture Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 15 Table A a b c d e f g h i j k l m n o p Region 1 Region 2 Region 3 Region 4 Region Server 7 Table A, Region 1 Table A, Region 2 Table G, Region 1070 Table L, Region 25 Region Server 86 Table A, Region 3 Table C, Region 30 Table F, Region 160 Table F, Region 776 Region Server 367 Table A, Region 4 Table C, Region 17 Table E, Region 52 Table P, Region 1116
  • 16. Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers and RegionServers, you can use the data locality property; that can theoretically read and write to the local DataNode as the primary You may wonder where the TaskTrackers are in this scheme of HBase deployments, the MapReduce framework isn’t deployed at all primarily random reads and writes. In other deployments, where the is also a part of the workloads, TaskTrackers, DataNodes, and Servers can run together. Nodes Nodes DataNode RegionServer DataNode DataNode RegionServer RegionServer DataNode DataNode RegionServer Figure 3.7 HBase RegionServer Figure 3.7 and HDFS HBase DataNode RegionServer processes and HDFS are typically DataNode collocated processes and RegionServers, you can use the data locality property; can theoretically read and write to the local DataNode You may wonder where the TaskTrackers are in this HBase deployments, the MapReduce framework isn’t deployed primarily random reads and writes. In other deployments, is also a part of the workloads, TaskTrackers, DataNodes, Servers can run together. Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically collocated on the same Nodes and RegionServers, you can use the data can theoretically read and write to the local You may wonder where the TaskTrackers HBase deployments, the MapReduce framework primarily random reads and writes. In other deployments, is also a part of the workloads, TaskTrackers, Servers can run together. system and can therefore host any region (figure 3.8). By physically collocating Data- Nodes and RegionServers, you can use the data locality property; that is, RegionServ-ers Physical Architecture can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. HBase deployments, the MapReduce framework isn’t deployed at all if the workload primarily random reads and writes. In other deployments, where the MapReduce is also a part of the workloads, TaskTrackers, DataNodes, and HBase Servers can run together. can theoretically read and write to the local DataNode as the primary DataNode. You may wonder where the TaskTrackers are in this scheme of things. In some HBase deployments, the MapReduce framework isn’t deployed at all if the workload is primarily random reads and writes. In other deployments, where the MapReduce pro-cessing is also a part of the workloads, TaskTrackers, DataNodes, and HBase Region- Zoo Keeper Zoo Keeper DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer DataNode RegionServer Region Region Region Server Server Server Data Data Data Node Node Node Figure Licensed Attribution-3.7 under a ShareAlike HBase Creative Commons 3.0 Unported RegionServer License. and HDFS DataNode processes Page 16 are typically Region Server Data Node ... can theoretically read and write You may wonder where the TaskTrackers HBase deployments, the MapReduce primarily random reads and writes. is also a part of the workloads, Servers can run together. DataNode RegionServer Master Master Figure 3.7 HBase RegionServer and HDFS Servers can run together. DataNode RegionServer DataNode RegionServer DataNode RegionServer Name Node DataNode RegionServer DataNode RegionServer HBase Client Figure 3.7 HBase RegionServer and HDFS DataNode processes are typically Licensed to Nick Dimiduk <ndimiduk@gmail.HDFS HBase
  • 17. User API • {rowkey => {family => {qualifier => {version => value}}}} – Think: nested TreeMap (Java), OrderedDictionary (C#), OrderedDict (Python) • Basic data operations: GET, PUT, DELETE • SCAN over range of key-values – benefit of the sorted rowkey business – this is how you implement any kind of "complex query” * Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. • GET, SCAN support Filters – Push application logic to RegionServers • INCREMENT, APPEND, CheckAnd{Put,Delete} – Server-side, atomic data operations, can be contentious! Page 17 * This is also a foundational component in what we refer to as “schema design” in this “schemaless” database.
  • 18. Anatomy of a RegionServer Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 18
  • 19. So what is HBase anyway? C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory Cache Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 19 DataNode RegionServer C1 tree C0 tree Disk Memory C1 tree C0 tree Disk Memory
  • 20. Storage Machinery Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 20 RegionServer (HBase) DataNode (Hadoop DFS) HLog (WAL) HRegion HStore StoreFile HFile StoreFile HFile MemStore ... ... HStore BlockCache HRegion HStore HStore ... ...
  • 21. Storage Machinery Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 21 RegionServer (HBase) DataNode (Hadoop DFS) HLog (WAL) HRegion HStore StoreFile HFile StoreFile HFile MemStore ... ... HStore BlockCache HRegion HStore HStore ... ... C1 C0 C1 C0 C1 C0 C1 C0 Cache
  • 22. Storage Machinery: Write Path Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 22 RegionServer (HBase) DataNode (Hadoop DFS) HLog (WAL) HRegion HStore StoreFile HFile StoreFile HFile MemStore ... ... HStore BlockCache HRegion HStore HStore ... ... 1 2 3 4 5
  • 23. Storage Machinery: Read Path Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 23 RegionServer (HBase) DataNode (Hadoop DFS) HLog (WAL) HRegion HStore StoreFile HFile StoreFile HFile MemStore ... ... HStore BlockCache HRegion HStore HStore ... ... 1 5 2 3 3 2 4
  • 24. Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. By Example Page 24
  • 25. Database Dichotomy Latency Predicate Pushdown Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 25 Row+Column Bloomfilter Lossy WAL Durability Smart Rowkey Design Compression Write Read Throughput Larger Block Size Bulkload HFiles Compression Compression Compression Smaller Block Size Larger BlockCache Increase Blocking Storefiles Increased Scanner Caching Larger Flush Size Smart Rowkey Design Smart Rowkey Design Smart Rowkey Design
  • 26. Web-scale Database Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 26 App server App server App server App server App server Counters Sessions User profiles Social Media Application Data Latency Write Read Throughput
  • 27. Search Search Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. “BigIndex” Page 27 "BigIndex" Document store Search Search Search App server App server App server App server App server Latency Write Read Throughput
  • 28. Materialized View Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 28 App server App server App server App server App server Latency Write Read Throughput
  • 29. Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. ETL Assist Write Read Page 29 Latency Throughput
  • 30. Lambda Architecture Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 30 App server App server App server App server App server
  • 31. Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Resources Page 31
  • 32. Join the Community! • hbase.apache.org – hbase.apache.org/book.html – blogs.apache.org/hbase/ – hbase.apache.org/mail-lists.html • IRC: irc.freenode.net #hbase • JIRA: issues.apache.org/jira/browse/HBASE • Source: git clone git://git.apache.org/hbase.git • In person – HBaseCon, hbasecon.com – Hadoop Summit, hadoopsummit.org – Strata / Hadoop World, strataconf.com – Local meetup near you! Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 32
  • 33. Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. HBase 0.98/1.0 • Hardening – Stability, Reliability, Availability, Performance • Horizontal Scalability – 1000’s of machines • Availability – Speed of Recovery (MTTR), Region Replicas • Improved Multi-tenancy – RPC priorities/QoS, namespace management • Cell-level security • Semantic Versioning • Client API cleanup Page 33
  • 34. Thanks! Licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Page 34 M A N N I N G Nick Dimiduk Amandeep Khurana FOREWORD BY Michael Stack hbaseinaction.com Nick Dimiduk github.com/ndimiduk @xefyr n10k.com strataeucftw slideshare.net/xefyr/hbase-for-architects

×