SlideShare a Scribd company logo
1 of 102
© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1 
Getting Started with HBase Application development 
http://answers.mapr.com/
© 2014 MapR Technologies 2 
Objectives of this session 
• Why do we need NoSQL / HBase? 
• Quick overview of HBase & HBase data model 
• Design considerations when migrating from RDBMS to HBase 
• How to work around transactions that span multiple rows
© 2014 MapR Technologies 3 
Why do we need NoSQL / HBase? 
Relational database model 
Relational Data is typed and structured before stored: 
– Entities map to tables, normalized 
– Structured Query Language 
• Joins tables to bring back data
© 2014 MapR Technologies 4 
Why do we need NoSQL / HBase? 
Relational Model 
• Pros 
– Standard persistence model 
• standard language for data manipulation 
– Transactions handle concurrency , 
consistency 
– efficient and robust structure for storing 
data
© 2014 MapR Technologies 5 
What changed to bring on NoSQL? 
Lots of data & the need to scale 
With the internet, increase in traffic and data, came the need to scale 
• Vertical Scaling is expensive and has limits 
• Not all use cases require ACID transactions 
Vertical scale = big box 
$$
© 2014 MapR Technologies 6 
What changed to bring on NoSQL? 
Lots of data, the need to scale horizontally 
• Horizontal scaling 
– Cheaper than vertical 
– parallel execution 
– high reliability (doesn’t depend on one box) 
• Relational databases were not designed to do this automatically 
Horizonal scale: 
Split table by rows into 
partitions across a cluster 
Key colB colC 
val val val 
xxx val val 
id 1-1000 
id 1000-2000 
id 2000=3000 
Key colB colC 
val val val 
xxx val val 
Key colB colC 
val val val 
xxx val val
© 2014 MapR Technologies 7 
Facebook 2010 
•9000 memcache instances 
•4000 Shards mysql 
Lots of people tried to spread 
databases across a cluster. 
http://gigaom.com/2011/07/07/facebook-trapped-in-mysql-fate-worse-than-death/
© 2014 MapR Technologies 8 
What changed to bring on NoSQL? Big data 
• Cons of the Relational Model: 
– Does not scale horizontally: 
• Sharding is difficult to manage 
• Distributed join, transactions do not scale across shards 
Horizonal scale : partition or shard tables across cluster 
Distributed Joins, Transactions are Expensive 
bottleneck
© 2014 MapR Technologies 9 
Google File System MapReduce 
• Distributed Storage System 
• Designed to scale 
• Paper published in 2006. 
Big Table 
Distributed File Model Distributed Compute Model
© 2014 MapR Technologies 10 
Google 
Big Table 
Crawlers 
MapReduce 
Indexing Queries 
Indexes 
CF 
colA colB colC 
val val 
val 
Metadata 
MapReduce 
metadata
© 2014 MapR Technologies 11 
Distributed 
File System 
(HDFS) 
Map 
Reduce
© 2014 MapR Technologies 12 
HBase is a Distributed Database 
Data is automatically 
distributed across the cluster. 
• Row is indexed by a row 
key 
• Key range is used for 
horizontal partitioning 
• Table splits happen 
automatically as the data 
grows 
Key 
Range 
xxxx 
xxxx 
CF1 
colA colB colC 
val val 
val 
CF2 
colA colB colC 
val val 
val 
Key 
Range 
xxxx 
xxxx 
CF1 
colA colB colC 
val val 
val 
CF2 
colA colB colC 
val val 
val 
Key 
Range 
xxxx 
xxxx 
CF1 
colA colB colC 
val val 
val 
CF2 
colA colB colC 
val val 
val 
Put, Get by Key
© 2014 MapR Technologies 13 
HBase is a ColumnFamily oriented Database 
Data is accessed and stored together: 
• Row is indexed by a row key 
• Similar Data is grouped & stored in Column Families 
– share common properties: 
• Number of versions 
• Time to Live (TTL) 
• Compression [lz4, lzf, Zlib] 
• In memory option … 
CF1 
colA colB colC 
Val val 
val 
CF2 
colA colB colC 
val val 
val 
RowKey 
axxx 
gxxx 
Customer id Customer Address data Customer order data
© 2014 MapR Technologies 14 
HBase designed for Distribution 
• distributed data stored and accessed together: 
– Key range is used for horizontal partitioning 
• Pros 
– scalable handles data volume and velocity 
– Fast Writes and Reads by Key 
• Cons 
– No joins 
– No dynamic queries 
– Need to know how data will be queried in advance to 
achieve best schema design 
Key 
Range 
axxx 
kxxx 
Key 
Range 
axxx 
kxxx 
Key 
Range 
axxx 
kxxx
© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 15 
Introduction to Hbase, data model and 
Architecture 
– Understand how the data flows for writes and reads
© 2014 MapR Technologies 16 
HBase Data Model 
Row Keys: identify the rows in an HBase table 
Columns are grouped into column families 
– can contain an arbitrary number of columns. 
Row 
Key 
CF1 CF2 … 
colA colB colC colA colB colC colD 
R1 
axxx val val val val 
… 
gxxx val val val val 
R2 
hxxx val val val val val val val 
… 
jxxx val 
R3 
kxxx val val val val 
… 
rxxx val val val val val val
© 2014 MapR Technologies 17 
HBase Data Storage - Cells 
• Data is stored in Key value format 
• Value for each cell is specified by complete coordinates: 
• (Row key, ColumnFamily, Column Qualifier, timestamp ) => Value 
– RowKey:CF:Col:Version:Value 
– smithj:data:city:1391813876369:nashville 
Cell Coordinates= Key 
Row key Column Family Column Qualifier Timestamp Value 
Smithj data city 1391813876369 nashville 
Column Key 
Value
© 2014 MapR Technologies 18 
Logical Data Model vs Physical Data Storage 
• Data is stored in Key Value format 
• Key Value is stored for each Cell 
• Column families data are stored in 
separate files 
RowK 
ey 
CF1 CF2 
colA colB colA colC 
ra 1 2 
rxxxx 
rxxx 
Logical Model 
Row 
Key 
CF1:Col version value 
ra cf1:cola 1 1 
row 
Key 
CF2:Col version value 
ra cf2:cola 1 2 
Physical Storage 
Key Value Key Value 
Physical Storage
© 2014 MapR Technologies 19 
Sparse Data with Cell Versions 
CF1:colA CF1:colB CF1:colC 
Row1 
Row10 
Row11 
Row2 
@time1: 
value1 
@time5: 
value2 
@time7: 
value3 
@time2: 
value1 
@time3: 
value1 
@time4: 
value1 
@time2: 
value1 
@time4: 
value1 
@time6: 
value2
© 2014 MapR Technologies 20 
Versioned Data 
• Version 
– each put, delete adds new cell, new version 
– A long 
• by default the current time in milliseconds if no version specified 
– Last 3 versions are stored by default 
• Configurable by column family 
– You can delete specific cell versions 
– When a cell exceeds the maximum number of versions, the extra 
records are removed 
Key CF1:Col version value 
ra cf1:cola 3 3 
ra cf1:cola 2 2 
ra cf1:cola 1 1
© 2014 MapR Technologies 21 
Table Physical View 
Physically data is stored per Column family as a sorted map 
• Ordered by row key, column qualifier in ascending order 
• Ordered by timestamp in descending order 
Row 
key 
Column 
qualifier 
Cell 
value 
Timestamp 
(long) 
Row1 CF1:colA value3 time7 
Row1 CF1:colA value2 time5 
Row1 CF1:colA value1 time1 
Row10 CF1:colA value1 time4 
Row 10 CF1:colB value1 time4 
Sorted by 
Row key 
and Column 
Sorted in 
descending 
order
© 2014 MapR Technologies 22 
Logical Data Model vs Physical Data Storage 
Key CF1:Col version value 
ra cf1:ca 1 1 
rb cf1:cb 2 4 
rb cf1:cb 1 3 
rc cf1:ca 1 5 
Row 
Key 
CF1 CF2 
ca cb ca cd 
ra 1 2 
rb 3,4 
rc 5 6,7 8 
Key CF2:Col version value 
ra cf2:ca 1 2 
rc cf2:ca 2 7 
rc cf2:ca 1 6 
rc cf2:cd 1 8 
Physical Storage 
Logical 
Model 
Column families are stored 
separately 
Row keys, Qualifiers are sorted 
lexicographically 
Key Value Key Value
© 2014 MapR Technologies 23 
HBase Table is a Sorted map of maps 
SortedMap<Key, Value> 
Table 
Map of Rows 
Map of CF 
Map of columns 
Map of cells 
Key CF1:Col version value 
ra cf1:ca v1 1 
rb cf1:cb v2 4 
rb cf1:cb v1 3 
rc cf1:ca v1 5 
Key CF2:Col version value 
ra cf2:ca v1 2 
rc cf2:ca v2 7 
rc cf2:ca v1 6 
rc cf2:cd v1 8 
SortedMap<RowKey, 
SortedMap< ColumnFamily, 
SortedMap< ColumnName, 
SortedMap < version, Value> >>>
© 2014 MapR Technologies 24 
HBase Table SortedMap<Key, Value> 
<ra,<cf1, <ca, <v1, 1>> 
<cf2, <ca, <v1, 2>>> 
<rb,<cf1, <cb, <v2, 4> 
<v1, 3>>> 
<rc,<cf1, <ca, <v1, 5>> 
<cf2, <ca, <v2, 7>> 
<ca, <v1, 6>> 
<cd, <v1, 8>>> 
Key CF1:Col version value 
ra cf1:ca v1 1 
rb cf1:cb v2 4 
rb cf1:cb v1 3 
rc cf1:ca v1 5 
Key CF2:Col version value 
ra cf2:ca v1 2 
rc cf2:ca v2 7 
rc cf2:ca v1 6 
rc cf2:cd v1 8
© 2014 MapR Technologies 25 
Basic Table Operations 
• Create Table, define Column Families before data is 
imported 
– but not the rows keys or number/names of columns 
• Low level API, technically more demanding 
• Basic data access operations (CRUD): 
put Inserts data into rows (both add and update) 
get Accesses data from one row 
scan Accesses data from a range of rows 
delete Delete a row or a range of rows or columns
© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 26 
Hbase Architecture
© 2014 MapR Technologies 27 
What is a Region? 
• Tables are partitioned into key ranges (regions) 
• Region servers serve data for reads and writes 
– For the range of keys it is responsible for 
Region Server 
Client 
Region Region 
HMaster 
zookeepe 
zozoorkoekeepeeprer 
Region Server 
Region Region 
Get 
Key colB colC 
xxx val val 
xxx val val 
Key colB colC 
xxx val val 
xxx val val 
Key colB colC 
xxx val val 
xxx val val 
Key colB colC 
xxx val val 
xxx val val
© 2014 MapR Technologies 28 
Region Server Components 
• WAL: write ahead log on disk (commit log), Used for recovery 
• BlockCache: Read Cache, LRU, Least Recently Used evicted 
• MemStore: Write Cache, sorted map of keyValue updates. 
– 1 memstore per column family per region 
• Hfile=sorted KeyValues on disk 
Region 
Server 
HDFS Data Node 
BlockCache 
memstore 
HFile 
memstore 
HFile 
Region Region 
HFile HFile 
memstore memstore 
WAL
© 2014 MapR Technologies 29 
HBase Write Steps 
Put 
each incoming record written 
to WAL for durability: 
• log on disk 
• updates appended sequentially 
HDFS Data Node 
memstore memstore 
Region Server 
Region 
WAL
© 2014 MapR Technologies 30 
HBase Write Steps 
Put 
Next updates are written to the 
Memstore: 
• write cache 
• in-memory 
• sorted list of KeyValue updates HDFS Data Node 
memstore memstore 
Region Server 
Region 
WAL 
Ack 
Updates quickly sorted in memory are available to queries after put returns
© 2014 MapR Technologies 31 
HBase Memstore 
• in-memory 
• sorted list of Key → Value 
• One per column family 
• Updates quickly sorted in memory 
memstore memstore 
Region 
Key CF1:Col version value 
ra cf1:ca v1 1 
rb cf1:cb v2 4 
rb cf1:cb v1 3 
rc cf1:ca v1 5 
Key CF2:Col version value 
ra cf2:ca v1 2 
rc cf2:ca v2 7 
rc cf2:ca v1 6 
rc cf2:cd v1 8 
Key Value Key Value
© 2014 MapR Technologies 32 
HBase Region Flush 
When 1 Memstore is full: 
• all memstores in region flushed 
to new Hfiles on disk 
• Hfile: sorted list of key → values 
On disk 
HDFS Data Node 
memstore memstore 
Region Server 
Region 
WAL 
HFile HFile 
FLUSH 
HHFFileile
© 2014 MapR Technologies 33 
HBase HFile 
• On disk sorted list of key → values 
• One per column family 
• Flushed quickly to file 
• Sequential write HDFS Data Node 
HFile HFile 
Sequential write 
Key CF1:Col version value 
ra cf1:ca v1 1 
rb cf1:cb v2 4 
rb cf1:cb v1 3 
rc cf1:ca v1 5 
Key CF2:Col version value 
ra cf2:ca v1 2 
rc cf2:ca v2 7 
rc cf2:ca v1 6 
rc cf2:cd v1 8 
Key Value Key Value
© 2014 MapR Technologies 34 
HBase HFile Structure 
• Memstore flushes to an Hfile 
64Kbyte blocks 
are compressed 
Key-value 
pairs are 
stored in 
increasing 
order 
Index points to 
row keys 
location 
B+-tree: leaf 
index , root index 
Key A Value 
… 
… 
… 
Key P Value 
….. 
…. 
… 
Key T Value 
….. 
…. 
… 
Key z Value 
….. 
…. 
… 
Root Index 
Bloom 
Leaf Index 
Leaf Index 
Leaf Index 
Leaf Index 
Leaf Index 
Leaf Index 
Root Index 
Interm Index 
Bloom 
Trailer 
Bloom 
Bloom 
Intermediate Index 
Bloom 
Bloom 
Bloom 
Bloom
© 2014 MapR Technologies 35 
HBase Read Merge from Memory and Files 
• MemStore creates multiple small store files over time when flushing. 
• Read Amplification: When a get/scan comes in, multiple files have to be 
examined 
HDFS Data Node 
BlockCache 
memstore 
Region Server Region 
WAL HFile 
HFile 
HFile 
scanner 
read Get or Scan searches for 
Row Cell KeyValues: 
1. Block Cache ((Memory) 
2. Memstore (Memory) 
3. Load HFiles from Disk 
into Block Cache based on 
indexes and bloomfilters 
1 
2 
3
© 2014 MapR Technologies 36 
HDFS Data Node 
HBase Compaction • minor compaction: 
• merges files into fewer larger ones. 
• Major compaction: 
• merge all Hfiles into one per column 
family. 
• remove cells marked for 
deletion 
Region Server 
Region Region 
memstore memstore 
WAL minor 
compaction 
HFile 
HFile 
HFile 
HFile 
HFile 
HFile 
HFile 
HFile 
HFile 
HFile 
HFile 
HFile 
HFile 
updates 
HFile 
major 
compaction 
Flush to disk
© 2014 MapR Technologies 37 
HBase Background: Log-Structured Merge Trees 
• Traditional Databases use B+ trees: 
– expensive to update 
• HBase: Log Structured Merge Trees 
– transforms random writes into 
sequential writes 
• Writes go to memory 
– And WAL 
• memstore flushes to disk 
– Reads 
• From memory, index, sorted disk 
predictable disk seeks
© 2014 MapR Technologies 38 
Data Model for Fast Writes, Reads 
• Predictable disk lay out 
• Minimize disk seek 
• Get, Put by row key: primary index, fast random access 
• Scan by row key range: stored sorted, efficient 
sequential access for key range 
Region1 
Key Range 
ra 
rx 
Region 
Region 
Region 
Server 
Key CF1:Col version value 
ra cf1:ca v1 1 
rb cf1:cb v2 4 
rb cf1:cb v1 3 
rc cf1:ca v1 5 
Get key 
Scan start key, 
Stop key 
Minimize disk seek
© 2014 MapR Technologies 39 
Region = contiguous keys 
• Regions fundamental partitioning/sharding object. 
• By default, on table creation 1 region is created that holds the 
entire key range. 
• When region becomes too large, splits into two child regions. 
• Typical region size is a few GB, sometimes even 10G or 20G 
Region 
CF1 
colA colB colC 
val val 
val 
CF2 
colA colB colC 
val val 
val 
Key 
Range 
axxx 
gxxx 
Region
© 2014 MapR Technologies 40 
Region Split 
• The RegionServer splits a 
region 
• daughter regions 
– represent 1/2 of the 
original region 
– each with half of the 
original regions keys. 
– opened in parallel on 
same server 
• reports the split to the Master 
Region 1 Region 2 
Region Server 1 
Key colB colC 
val val 
val 
Key colB colC 
val val 
val 
Region Server 1 
Region 1 
Key 
Range 
axxx 
kxxx 
Key 
Range 
Lxxx 
zxxx 
Key colB colC 
val val 
val
© 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 41 
HBase Use Cases
© 2014 MapR Technologies 43 
3 Main Use Case Categories 
• Capturing Incremental data – Hi Volume, Velocity Writes 
– Time Series Data, Stuff with a Time Stamp 
• Sensor, System Metrics, Events, 
• Stock Ticker, User clicks, 
HBase 
Put 
App 
Server 
App 
read Server 
Put 
Put 
Put 
Event time stamped 
data 
sensor 
OpenTSDB 
Data for real-time monitoring.
© 2014 MapR Technologies 44 
HBase 
Messages 
read 
Put 
App 
Server 
read 
App 
Server 
read 
App 
Server Put 
Put 
Put 
App 
Server 
3 Main Use Case Categories 
• Information Exchange Hi Volume, Velocity Write/Read 
• Messaging on Facebook is backed by Hbase 
– communication coming from email, SMS, Facebook Chat, 
and the Inbox 
– https://www.facebook.com/UsingHbase
© 2014 MapR Technologies 45 
3 Main Use Case Categories 
• Content Serving, Web Application Backend– Hi Volume, Velocity 
Reads 
– Online Catalog: Intuit Merchant, Gap, World Library Catalog. 
– CRM: Salesforce CMS: Lily 
– Search Index: ebay, photobucket 
– Online Pre-Computed View: Groupon, Pinterest 
Hbase 
Processed 
data 
read 
App 
Server 
read 
App 
Server 
read 
App 
Bulk Import Server
© 2014 MapR Technologies 46 
Agenda 
• Introduction to HBase data model and Architecture 
• Using HBase shell to create tables and insert data 
– Demo / Lab using Hbase shell to create tables 
• Java API fundamentals to perform CRUD operations 
– Demo / Lab using Eclipse, HBase Java API & MapR Sandbox 
• Understand how the data flows for writes and reads 
• Schema design concepts including rowkey design 
• Advanced Java APIs to perform scans and do transactions 
– Demo / Lab
© 2014 MapR Technologies 47 
Lab Exercise 
See Lab_Hbase_Shell.pdf 
Start MapR Sandbox and log into the cluster 
[user: mapr, passwd: mapr] 
Use the HBase shell 
>Hbase shell 
hbase> help 
hbase> create ’/user/mapr/mytable’, {NAME =>’cf1’} 
hbase> put ’/user/mapr/mytable’, ’row1’, ’cf1:col1’, ‘datacf1c1v1’ 
hbase> get ’/user/mapr/mytable’, ’row1’ 
hbase> scan ’/user/mapr/mytable’ 
hbase> describe ’/user/mapr/mytable’
© 2014 MapR Technologies 48 
© MapR Technologies, confidential 
HBase 
Java API fundamentals to 
perform CRUD operations
© 2014 MapR Technologies 50 
Shoppingcart Application Requirements 
• Need to create Tables: Shoppingcart & Inventory 
• Perform CRUD operations on these tables 
– Create, Read, Update, and Delete items from these tables
© 2014 MapR Technologies 51 
Inventory & Shoppingcart Tables 
Perform checkout operation for Mike 
Inventory Table Shoppingcart Table 
quantity 
Pens 24 
Notepads 54 
Erasers 15 
Pencils 90 
pens notepads erasers 
Mike 5 5 
John 10 15 4 
Mary 9 10 
Adam 18 7 10 
CF “stock " CF “items"
© 2014 MapR Technologies 52 
Java API Fundamentals 
• CRUD operations 
– Get, Put, Delete, Scan, checkAndPut, checkAndDelete, Increment 
– KeyValue, Result, Scan – ResultScanner, 
– Batch Operations
© 2014 MapR Technologies 53 
CRUD Operations Follow A Pattern (mostly) 
• common pattern 
– Instantiate object for an operation: Put put = new Put(key) 
– Add attributes to specify what to insert: put.add(…) 
– invoke operation with HTable: myTable.put(put) 
// Insert value1 into rowKey in columnFamily:columnName1 
Put put = new Put(rowKey); 
put.add(columnFamily, columnName1, value1); 
myTable.put(put);
© 2014 MapR Technologies 54 
erasers notepads pens 
Mike 2 5 5 
CF “items" 
Shoppingcart Table 
Shopping Cart Table 
Key CF:COL ts value 
Mike items:erasers 1391813876369 2 
Mike items:notepads 1391813876369 5 
Mike items:pens 1391813876369 5 
Physical Storage
© 2014 MapR Technologies 55 
Put Operation 
adding multiple column values to a row 
byte [] tableName = Bytes.toBytes("/path/Shopping"); 
byte [] itemsCF = Bytes.toBytes(“items"); 
byte [] penCol = Bytes.toBytes (“pens”); 
byte [] noteCol = Bytes.toBytes (“notes”); 
byte [] eraserCol = Bytes.toBytes (“erasers”); 
HTableInterface table = new HTable(hbaseConfig, tableName); 
Put put = new Put(“Mike”); 
put.add(itemsCF, penCol, Bytes.toBytes(5l)); 
put.add(itemsCF, noteCol, Bytes.toBytes(5l)); 
put.add(itemsCF, eraserCol, Bytes.toBytes(2l)); 
table.put(put); 
Key CF:COL ts value 
Mike items:erasers 1391813876369 2 
Mike items:notepads 1391813876369 5 
Mike items:pens 1391813876369 5
© 2014 MapR Technologies 56 
Get Example 
byte [] tableName = Bytes.toBytes("/path/Shopping"); 
byte [] itemsCF = Bytes.toBytes(“items"); 
byte [] penCol = Bytes.toBytes (“pens”); 
HTableInterface table = new HTable(hbaseConfig, tableName); 
Get get = new Get(“Mike”); 
get.addColumn(itemsCF, penCol); 
Result result = myTable.get(get); 
byte[] val = result.getValue(itemsCF, penCol); 
System.out.println("Value: " + Bytes.toLong(val)); //prints 5 
Key CF:COL ts value 
Mike items:erasers 1391813876369 2 
Mike items:notepads 1391813876369 5 
Mike items:pens 1391813876369 5
© 2014 MapR Technologies 57 
Result Class 
• A Result instance wraps data from a row returned from a get or a 
scan operation. Result wraps KeyValues 
• Result toString() looks like this : 
keyvalues={Adam/items:erasers/1391813876369/Put/vlen=8/ts=0, 
Adam/items:notepads/1391813876369/Put/vlen=8/ts=0, 
Adam/items:pens/1391813876369/Put/vlen=8/ts=0} 
• The Result object provides methods to return values 
byte[] b = result.getValue(columnFamilyName,columnName1); 
Items:erasers Items:notepads Items:pens 
Adam 10 7 18 
http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/client/Result.html
© 2014 MapR Technologies 58 
KeyValue – The Fundamental HBase Type 
• A KeyValue instance is a cell instance 
– Contains Key (cell coordinates) and the Value (data) 
• Cell coordinates: Row key, Column family, Column qualifier, 
Timestamp 
• KeyValue toString() looks like this : 
Adam/items:erasers/1391813876369/Put/vlen=8/ 
Key =Cell Coordinates 
Row key Column Family Column Qualifier Timestamp Value 
Value 
Adam items erasers 1391813876369 10
© 2014 MapR Technologies 59 
Bytes class 
http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/util/Bytes.html 
• org.apache.hadoop.hbase.util.Bytes 
• Provides methods to convert Java types to and from byte[] arrays 
• Support for 
– String, boolean, short, int, long, double, and float 
byte[] bytesTable = Bytes.toBytes("Shopping"); 
String table = Bytes.toString(bytesTable); 
byte[] amountBytes = Bytes.toBytes(1000l); 
long amount = Bytes.toLong(amount);
© 2014 MapR Technologies 60 
Scan Operation – Example 
byte[] startRow=Bytes.toBytes(“Adam”); 
byte[] stopRow=Bytes.toBytes(“N”); 
Scan s = new Scan(startRow, stopRow); 
scan.addFamily(columnFamily); 
ResultScanner rs = myTable.getScanner(s);
© 2014 MapR Technologies 61 
ResultScanner - Example 
Resultscanner provides iterator-like functionality 
Scan scan = new Scan(); 
scan.addFamily(columnFamily); 
ResultScanner scanner = myTable.getScanner(scan); 
try { 
for (Result res : scanner) { 
System.out.println(res); 
} 
} catch (Exception e) { 
System.out.println(e); 
} finally { 
scanner.close(); 
} 
Calls scanner.next() 
Always put in finally block
© 2014 MapR Technologies 62 
Lab Exercise Program Structure 
• ShoppingCartApp– main class 
• InventoryDAO – A DAO for the Inventory CRUD 
functionality 
• ShoppingCartDAO – A DAO for the Inventory 
CRUD functionality 
• Inventory – A Java object that holds data for a 
single Inventory row 
• ShoppingCart – A Java object that holds data for 
a single Inventory row 
• MockHtable – in memory test hbase table, allows 
to run code, debug without hbase running on a 
cluster or vm.
© 2014 MapR Technologies 63 
Lab Exercise 
• See Lab_3_Java_API.pdf 
– Import the project “lab-exercises-shopping” into Eclipse 
– Setup creates Inventory and Shoppingcart Tables and inserts data 
– Use Get, Put, Scan, and Delete operations
© 2014 MapR Technologies 64 
Lab: Import, build 
• Download the code 
• Import Maven project lab-exercises-shopping 
into Eclipse 
• Build : Run As -> Maven Install
© 2014 MapR Technologies 65 
Lab: run TestInventorySetup JUnit 
• Select Test Class, Then Run As -> JUnit Test 
• Uses MockHTable https://gist.github.com/agaoglu/613217
© 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 66 
Schema Design Guidelines 
• HBase tables ≠ Relational tables! 
• Be careful about assumptions you bring from past experience. 
• Examples: 
– Data normalization not required 
– Different atomicity rules
© 2014 MapR Technologies 67 
Use Case Example: Record Stock Trade Information 
in a Table 
• Trade data: 
Trade 
• timestamp 
• stock symbol 
• price per share 
• volume of trade 
• Example 
– 1381396363000 
(epoch timestamp with 
millisecond granularity) 
– AMZN 
– $304.66 
– 1333 shares
© 2014 MapR Technologies 68 
Intelligent keys 
• Only the row keys are indexed 
• Compose the key with attributes used for searching 
– Composite key : 2 or more identifying attributes 
– Like multi-column index design in RDB 
– Use fixed length, or separators 
Cell Coordinates (Key) 
Granularity 
Row key Column Family Column Name Timestamp Value 
Restrict disk I/O Restrict network traffic
© 2014 MapR Technologies 69 
Composite Keys 
Use composite rowkey to bound scan ranges and provide sub-indexing 
to cell data. 
• Include multiple elements in the rowkey 
– Use a separator or fixed length 
• Example rowkey format: 
– Ex: GOOG_20131012 
• Get operations require complete row key. 
• Scans can use partial keys. 
– Ex: “GOOG” or "GOOG_2014" 
SYMBOL + DATE (YYYYMMDD)
© 2014 MapR Technologies 70 
Consider Access Patterns for Application 
• By date? By hour? By companyId? By PersonId? 
– Rowkey design 
• What if the Date/Timestamp is leftmost ? 
How will data be retrieved? 
Key 
1391813876369_AMZN 
1391813876370_AMZN 
1391813876371_GOOG
© 2014 MapR Technologies 71 
Hot-Spotting and Region Splits 
• If rowkeys are written in sequential order then 
writes go to only one server 
– Split when full 
1900 
1950 … 
1999 
Region Server 1 
Key colB col 
C 
1900 val 
val 
1999 
Region 1 
Key 
Range 
1900 
1999 
Sequential key, like a timestamp 
File 
Server 1
© 2014 MapR Technologies 72 
Hot-Spotting and Region Splits 
• Regions split as the table grows. 
– RegionServer Creates two new regions, each with half of the 
original regions keys. 
• Sequential writes will go to new region 
2040 
2050 
2000 
Region Server 1 
File 
Server 1 Key colB col 
C 
1900 val 
val 
1999 
Region 1 
Key 
Range 
1900 
1950 
Key colB col 
C 
1900 val 
val 
1999 
Region 2 
Key 
Range 
1950 
2050
© 2014 MapR Technologies 73 
Hot-Spotting and Region Splits 
3040 
3000 
Sequential writes will go 
to new region 
Region Server 1 
File 
Server 1 Key colB col 
C 
1900 val 
val 
1999 
Region 1 
Key 
Range 
1900 
1950 
Key colB col 
C 
1900 val 
val 
1999 
Region 2 
Key 
Range 
1950 
3050
© 2014 MapR Technologies 74 
Hot-Spotting and Region Splits 
3045 
Regions split as the table 
grows. 
Sequential writes will go 
to new region 
Region Server 1 
File 
Server 1 Key colB col 
C 
1900 val 
val 
1999 
Region 1 
Key 
Range 
1900 
1950 
Key colB col 
C 
1900 val 
val 
1999 
Region 2 
Key 
Range 
1950 
2050 
Key colB col 
C 
1900 val 
val 
1999 
Region 3 
Key 
Range 
2051 
3050 
3041 
3050
© 2014 MapR Technologies 75 
Random keys 
Key 
Range 
a23148 
3d1a5f 
e0e9b4 
Key 
Range 
Key 
Range 
Key 
Range 
Key 
Range 
Key 
Range 
MD5 
Hash 
rowkey 
Random writes will go to different regions 
If table was pre-split or big enough to have split 
d = MessageDigest.getInstance("MD5"); 
byte[] prefix = d.digest(Bytes.toBytes(s));
© 2014 MapR Technologies 76 
Sequential vs. Random keys 
Random is better for writing , but sequential is better for scanning 
row keys 
Sequential 
Keys 
Performance 
Salted 
Keys 
Promoted 
Field Keys 
Random 
Keys
© 2014 MapR Technologies 77 
Prefix, Promote a field key 
Key 
Range 
Key 
Range 
Key 
Range 
Key 
Range 
Key 
Range 
Key 
Range 
amzn_1999 
amzn_2003 
amzn_2005 
cisc_1998 
cisc_2002 
cisc_2010 
goog_1990 
goog_2020 
goog_2030
© 2014 MapR Technologies 78 
Prefix with a Hashed field key 
Key 
Range 
a23148_2003 
1d1a5f_1999 
e0e9b4_2000 
Key 
Range 
Key 
Range 
Key 
Range 
Key 
Range 
Key 
Range 
MD5 
Hash 
prefix 
rowkey 
• prefix the rowkey with a (shortened) hash: 
byte[] hash = d.digest(Bytes.toBytes(fieldkey)); 
Bytes.putBytes(rowkey, 0, hash, 0, length); 
g0e8b4_2004 
b33148_2006 
3d1a5f_2007
© 2014 MapR Technologies 79 
Consider Access Patterns for Application 
• Which trade data needs fastest access (or most frequent)? 
– Rowkey ordering 
• What if you want to retrieve the stocks by symbol and date? 
• Scan by row key prefix Increasing time: PREFIX_TIMESTAMP 
• What if you usually want to retrieve the most recent? 
How will data be retrieved? 
Key 
AMZN_1391813876369 
AMZN_1391813876370 
GOOG_1391813876371 
SYMBOL + timestamp
© 2014 MapR Technologies 80 
Last In First Out Access: Use Reverse-Timestamp 
• Row keys are sorted in increasing order 
• For fast access to most-recent writes: 
– design composite rowkey with reverse-timestamp that decreases 
over time. 
– Scan by row key prefix Decreasing: [MAXTIME–TIMESTAMP] 
• Ex: Long.MAX_VALUE-date.getTime() 
SYMBOL + Reverse timestamp 
Key 
AMZN_98618600666 
AMZN_98618600777 
GOOG_98618608888
© 2014 MapR Technologies 81 
Consider Access Patterns for Application 
• What are the needs for atomicity of transactions? 
– Column design 
– More Values in a single row 
• Works well to get or update multiple values 
How will data be retrieved?
© 2014 MapR Technologies 82 
Rowkey design influences shape of Tables: Tall or Flat 
Tall Narrow Flat Wide
© 2014 MapR Technologies 83 
Tall Table for Stock Trades 
Rowkey format: 
Ex: AMZN_98618600888 
SYMBOL + Reverse timestamp 
rowkey 
CF: CF1 
CF1:price CF1:vol 
… … … 
AMZN_98618600666 12.34 2000 
AMZN_98618600777 12.41 50 
AMZN_98618600888 12.37 10000 
… … … 
CSCO_98618600777 23.01 1000 
…
© 2014 MapR Technologies 84 
Consider Access Patterns for Application 
• Are Price and Volume data typically accessed together, or are 
they unrelated? 
– Column family structure 
• Column Families 
– group data based on access pattern & with similar attributes together: # 
Min/Max versions, compression, in-memory, Time-To-Live 
• Columns 
– Column names are dynamic, not pre-defined 
– every row does not need to have same columns 
How will data be retrieved?
© 2014 MapR Technologies 85 
Wide Table for Stock Trades 
rowkey 
CF price CF vol 
p:10 p:1000 … p:2000 v:10 v:1000 … v:2000 
AMZN_986186006 12.37 13 12.34 10000 2000 
… 
CSCO_986186070 23.01 1000 
Rowkey format: 
Ex: AMZN_20131020 
SYMBOL + Reverse timestamp rounded to the hour 
• Separate price & volume data into column families 
• Segregate time into buckets: 
– Time rounded to the hour in the rowkey 
– Time in column name represents seconds since the timestamp in the key 
– Column names can be dynamic, every row does not need to have same columns 
– One row stores a bucket of measurements for the hour
© 2014 MapR Technologies 86 
Lesson: Schemas can be very 
flexible and can even 
change on the fly 
Column names can be dynamic, every row does not 
need to have same columns
© 2014 MapR Technologies 87 
Consider Access Patterns for Application 
• Do all trades need to be saved forever? 
– TTL Time to Live , CF can be set to expire cells 
• How many Versions? 
– Max Versions 
– You can have many versions of data in a cell, default is 3 
How will data be retrieved?
© 2014 MapR Technologies 88 
Wide Table for Stock Trades 
rowkey 
CF price CF vol CF stats 
price:00 … price:23 vol:00 … vol:23 Day Hi Day Lo 
AMZN_20131020 12.37 12.34 10000 2000 
… 
CSCO_20130817 23.01 1000 
Rowkey format: 
Ex: AMZN_20131020 
SYMBOL + date YYYYMMDD 
• Separate price & volume data into column families 
• Segregate time into buckets: 
– Date in the rowkey 
– Hour in the column name 
– Set Column Family to store Max Versions, timestamp in the version
© 2014 MapR Technologies 89 
Flat-Wide Vs. Tall-Narrow Tables 
• Tall-Narrow provides better query granularity 
– Finer grained Row Key 
– Works well with scan 
• Flat-Wide supports built-in row atomicity 
– More Values in a single row 
• Works well to update multiple values (row atomicity) 
• Works well to get multiple associated values
© 2014 MapR Technologies 90 
Lesson: Have to know the 
queries to design in 
performance
© 2014 MapR Technologies 91 
Comparing Relational schema to HBase 
• HBase is lower-level than relational tables 
– Design is different 
• Relational design 
– Data centric, focus on entities and relations 
– Query joins 
• New views of data from different tables easily created 
– Does not scale across cluster 
• HBase is designed for clustering: 
– Distributed data is stored and accessed together 
– Query centric, focus on how the data is read 
– Design for the questions 
Key 
Range 
axxx 
kxxx 
Key 
Range 
xxx 
xxx 
Key 
Range 
xxx 
zxxx
© 2014 MapR Technologies 93 
 Goal Normalization: 
– eliminate redundant data 
– Put repeating information in its own table 
Normalized database : 
– Causes joins 
• data has to be retrieved from more tables. 
• queries can take more time 
Normalization
© 2014 MapR Technologies 94 
HBase Nested Entity 
• A one-to-many relationship can be modeled as a single row 
– Embedded, Nested Entity 
• Order one-to-many with Line Items 
– Row key: parent id 
• OrderId 
– column name : child id stored 
• line Item id 
OrderId Data:date item:id1 item:id2 item:id3 
123 20131010 $10 $20 $9.45
© 2014 MapR Technologies 95 
De-Normalization 
Blog and Comments in same table 
De-normalization: 
– store data about an entity and related entities in the same 
table. 
• Reads are faster across a cluster 
– retrieve data about entity and related entities in one read 
Key Data:ts Comment:id1 Comment:id2 Comment:id3 
Blog post Id Blog text interesting boring exciting
© 2014 MapR Technologies 96 
Many to Many Relationship RDBMS 
user 
id (primary key) 
name 
alias 
email 
book 
id (primary key) 
title 
description 
user_book_rating 
id (primary key) 
userId (foreign key) 
bookId (foreign key) 
rating 
1 ∞ ∞ 1 
Online book store 
• Querys 
• Get name for user x 
• Get tiltle for book x 
• Get books and corresponding ratings for userId x 
• Get all userids and corresponding ratings for book y
© 2014 MapR Technologies 97 
Many to Many Relationship HBase 
User table Column family for book ratings by userid for bookids 
Key data:fname … rating:bookid1 rating:bookid2 
userid1 5 4 
Key data:title … rating:userid1 rating:userid2 
bookid1 5 4 
Book table Column family for ratings for bookid by userid 
• Querys 
• Get books and corresponding ratings for userId x 
• Get all userids and corresponding ratings for book y
© 2014 MapR Technologies 98 
Generic data, Event Data, Entity-Attribute-Value 
• Generic table: Entity, Attributes, Values 
– Event Id, Event attributes, Values 
– object-property-value , name-value pairs , schema-less 
patientXYZ-ts1, Temperature , "102" 
patientXYZ-ts2, Coughing, "True" 
patientXYZ-ts3, Heart Rate, "98" 
• This is the advantage of HBase 
– Define columns on the fly, 
• put attribute name in column qualifier 
– group data by column families 
Key event:heartrate event:coughing event:temperature 
Patientxyz-ts1 98 true 102 
Event id=row key Event type name=qualifier Event measurement=value
© 2014 MapR Technologies 99 
Self Join Relationship HBase 
Twitter 
user x follows user y 
user b followed by user a 
• Querys 
• Get all users who Carol follows 
• Get all users following Carol 
Key data:timestamp 
Carol:follows:SteveJobs 
Carol:followedby:BillyBob
© 2014 MapR Technologies 100 
Hierarchical Data 
• tree like structure 
• Use a flat-wide 
– Parents, children in columns 
usa 
TN FL 
Nashville Miami 
Key P:USA P:TN p:FL c:TN C:FL C:Nashvl C:Miami 
USA state state 
TN country city 
FL country city 
Nashville state 
miami state
© 2014 MapR Technologies 101 
Inheritance mapping 
• Online Store Example Product table 
– put sub class type abbreviation in in key prefix for searching 
– Columns do not all have to be the same for different types 
Key price title details model 
Bok+id1 10 Hbase 
Dvd+Id2 15 stones 
Kin+Id3 100 fire
© 2014 MapR Technologies 102 
Agenda 
• Issues with RDMS and how Hbase helps 
• Quick overview of HBase Data Model & API 
• Design considerations when moving from RDBMS to Hbase 
• How to work around transactions that span multiple rows
© 2014 MapR Technologies 103 
Summary 
• Why NoSQL and why HBase 
• Design considerations when migrating from RDBMS to Hbase 
• De-normalize data 
• Column Families & Columns 
• Versions 
• RowKey design 
• Transactions using checkAndPut and Increment for simple use 
cases
© 2014 MapR Technologies 104 
References 
• http://hbase.apache.org/ 
• http://hbase.apache.org/0.94/apidocs/ 
• http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/client/pack 
age-summary.html 
• http://hbase.apache.org/book/book.html 
• http://doc.mapr.com/display/MapR/MapR+Overview 
• http://doc.mapr.com/display/MapR/M7+-+Native+Storage+for+MapR+Tables 
• http://doc.mapr.com/display/MapR/MapR+Sandbox+for+Hadoop 
• http://doc.mapr.com/display/MapR/Migrating+Between+Apache+HBase+Ta 
bles+and+MapR+Tables
© 2014 MapR Technologies 105 
Q & A 
@mapr maprtech 
yourname@mapr.com 
Engage with us! 
MapR 
maprtech 
mapr-technologies

More Related Content

What's hot

Facebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platformsFacebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platformsNitish Upreti
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)NAVER D2
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache RangerDataWorks Summit
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureDataWorks Summit
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Storesconfluent
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controllerconfluent
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connectconfluent
 

What's hot (20)

HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Facebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platformsFacebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platforms
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMUpgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Scaling HBase for Big Data
Scaling HBase for Big DataScaling HBase for Big Data
Scaling HBase for Big Data
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connect
 

Viewers also liked

2 - Trafodion and Hadoop HBase
2 - Trafodion and Hadoop HBase2 - Trafodion and Hadoop HBase
2 - Trafodion and Hadoop HBaseRohit Jain
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Carol McDonald
 
HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs
HFile: A Block-Indexed File Format to Store Sorted Key-Value PairsHFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs
HFile: A Block-Indexed File Format to Store Sorted Key-Value PairsSchubert Zhang
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Jeremy Walsh
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanTaro L. Saito
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZDataFactZ
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBaseCarol McDonald
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...DataWorks Summit/Hadoop Summit
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An OverviewMohit Jain
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesData Con LA
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for ArchitectsNick Dimiduk
 

Viewers also liked (20)

Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
2 - Trafodion and Hadoop HBase
2 - Trafodion and Hadoop HBase2 - Trafodion and Hadoop HBase
2 - Trafodion and Hadoop HBase
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs
HFile: A Block-Indexed File Format to Store Sorted Key-Value PairsHFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs
HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs
 
Hbase trabalho final
Hbase trabalho finalHbase trabalho final
Hbase trabalho final
 
Proposta de arquitetura Hadoop
Proposta de arquitetura HadoopProposta de arquitetura Hadoop
Proposta de arquitetura Hadoop
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
 

Similar to Getting Started with HBase

Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBaseCarol McDonald
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill Carol McDonald
 
Dchug m7-30 apr2013
Dchug m7-30 apr2013Dchug m7-30 apr2013
Dchug m7-30 apr2013jdfiori
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityMapR Technologies
 
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"Inhacking
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentationMurat Çakal
 
SQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPSQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPTony Rogerson
 
Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011mubarakss
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency  and consistencyRenegotiating the boundary between database latency  and consistency
Renegotiating the boundary between database latency and consistencyScyllaDB
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet  With High Performance Time Series DatabaseDealing with an Upside Down Internet  With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series DatabaseDataWorks Summit
 
Xia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLXia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLMLconf
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
 
Update on Trinity System Procurement and Plans
Update on Trinity System Procurement and PlansUpdate on Trinity System Procurement and Plans
Update on Trinity System Procurement and Plansinside-BigData.com
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelAndrey Lomakin
 

Similar to Getting Started with HBase (20)

Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBase
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 
Dchug m7-30 apr2013
Dchug m7-30 apr2013Dchug m7-30 apr2013
Dchug m7-30 apr2013
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
 
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
 
Valerii Moisieienko Apache hbase workshop
Valerii Moisieienko	Apache hbase workshopValerii Moisieienko	Apache hbase workshop
Valerii Moisieienko Apache hbase workshop
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentation
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
 
SQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPSQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTP
 
Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011Bay area Cassandra Meetup 2011
Bay area Cassandra Meetup 2011
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Renegotiating the boundary between database latency and consistency
Renegotiating the boundary between database latency  and consistencyRenegotiating the boundary between database latency  and consistency
Renegotiating the boundary between database latency and consistency
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet  With High Performance Time Series DatabaseDealing with an Upside Down Internet  With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series Database
 
Xia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATLXia Zhu – Intel at MLconf ATL
Xia Zhu – Intel at MLconf ATL
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Update on Trinity System Procurement and Plans
Update on Trinity System Procurement and PlansUpdate on Trinity System Procurement and Plans
Update on Trinity System Procurement and Plans
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 

More from Carol McDonald

Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUsCarol McDonald
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Carol McDonald
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBCarol McDonald
 
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...Carol McDonald
 
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine LearningPredicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine LearningCarol McDonald
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBCarol McDonald
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Carol McDonald
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Carol McDonald
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Carol McDonald
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareCarol McDonald
 
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningDemystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningCarol McDonald
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Carol McDonald
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Carol McDonald
 
Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churnCarol McDonald
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Carol McDonald
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient DataCarol McDonald
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APICarol McDonald
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesCarol McDonald
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataCarol McDonald
 

More from Carol McDonald (20)

Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
 
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
 
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine LearningPredicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine Learning
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningDemystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep Learning
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures
 
Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churn
 
Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient Data
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 

Recently uploaded

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Getting Started with HBase

  • 1. © 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1 Getting Started with HBase Application development http://answers.mapr.com/
  • 2. © 2014 MapR Technologies 2 Objectives of this session • Why do we need NoSQL / HBase? • Quick overview of HBase & HBase data model • Design considerations when migrating from RDBMS to HBase • How to work around transactions that span multiple rows
  • 3. © 2014 MapR Technologies 3 Why do we need NoSQL / HBase? Relational database model Relational Data is typed and structured before stored: – Entities map to tables, normalized – Structured Query Language • Joins tables to bring back data
  • 4. © 2014 MapR Technologies 4 Why do we need NoSQL / HBase? Relational Model • Pros – Standard persistence model • standard language for data manipulation – Transactions handle concurrency , consistency – efficient and robust structure for storing data
  • 5. © 2014 MapR Technologies 5 What changed to bring on NoSQL? Lots of data & the need to scale With the internet, increase in traffic and data, came the need to scale • Vertical Scaling is expensive and has limits • Not all use cases require ACID transactions Vertical scale = big box $$
  • 6. © 2014 MapR Technologies 6 What changed to bring on NoSQL? Lots of data, the need to scale horizontally • Horizontal scaling – Cheaper than vertical – parallel execution – high reliability (doesn’t depend on one box) • Relational databases were not designed to do this automatically Horizonal scale: Split table by rows into partitions across a cluster Key colB colC val val val xxx val val id 1-1000 id 1000-2000 id 2000=3000 Key colB colC val val val xxx val val Key colB colC val val val xxx val val
  • 7. © 2014 MapR Technologies 7 Facebook 2010 •9000 memcache instances •4000 Shards mysql Lots of people tried to spread databases across a cluster. http://gigaom.com/2011/07/07/facebook-trapped-in-mysql-fate-worse-than-death/
  • 8. © 2014 MapR Technologies 8 What changed to bring on NoSQL? Big data • Cons of the Relational Model: – Does not scale horizontally: • Sharding is difficult to manage • Distributed join, transactions do not scale across shards Horizonal scale : partition or shard tables across cluster Distributed Joins, Transactions are Expensive bottleneck
  • 9. © 2014 MapR Technologies 9 Google File System MapReduce • Distributed Storage System • Designed to scale • Paper published in 2006. Big Table Distributed File Model Distributed Compute Model
  • 10. © 2014 MapR Technologies 10 Google Big Table Crawlers MapReduce Indexing Queries Indexes CF colA colB colC val val val Metadata MapReduce metadata
  • 11. © 2014 MapR Technologies 11 Distributed File System (HDFS) Map Reduce
  • 12. © 2014 MapR Technologies 12 HBase is a Distributed Database Data is automatically distributed across the cluster. • Row is indexed by a row key • Key range is used for horizontal partitioning • Table splits happen automatically as the data grows Key Range xxxx xxxx CF1 colA colB colC val val val CF2 colA colB colC val val val Key Range xxxx xxxx CF1 colA colB colC val val val CF2 colA colB colC val val val Key Range xxxx xxxx CF1 colA colB colC val val val CF2 colA colB colC val val val Put, Get by Key
  • 13. © 2014 MapR Technologies 13 HBase is a ColumnFamily oriented Database Data is accessed and stored together: • Row is indexed by a row key • Similar Data is grouped & stored in Column Families – share common properties: • Number of versions • Time to Live (TTL) • Compression [lz4, lzf, Zlib] • In memory option … CF1 colA colB colC Val val val CF2 colA colB colC val val val RowKey axxx gxxx Customer id Customer Address data Customer order data
  • 14. © 2014 MapR Technologies 14 HBase designed for Distribution • distributed data stored and accessed together: – Key range is used for horizontal partitioning • Pros – scalable handles data volume and velocity – Fast Writes and Reads by Key • Cons – No joins – No dynamic queries – Need to know how data will be queried in advance to achieve best schema design Key Range axxx kxxx Key Range axxx kxxx Key Range axxx kxxx
  • 15. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 15 Introduction to Hbase, data model and Architecture – Understand how the data flows for writes and reads
  • 16. © 2014 MapR Technologies 16 HBase Data Model Row Keys: identify the rows in an HBase table Columns are grouped into column families – can contain an arbitrary number of columns. Row Key CF1 CF2 … colA colB colC colA colB colC colD R1 axxx val val val val … gxxx val val val val R2 hxxx val val val val val val val … jxxx val R3 kxxx val val val val … rxxx val val val val val val
  • 17. © 2014 MapR Technologies 17 HBase Data Storage - Cells • Data is stored in Key value format • Value for each cell is specified by complete coordinates: • (Row key, ColumnFamily, Column Qualifier, timestamp ) => Value – RowKey:CF:Col:Version:Value – smithj:data:city:1391813876369:nashville Cell Coordinates= Key Row key Column Family Column Qualifier Timestamp Value Smithj data city 1391813876369 nashville Column Key Value
  • 18. © 2014 MapR Technologies 18 Logical Data Model vs Physical Data Storage • Data is stored in Key Value format • Key Value is stored for each Cell • Column families data are stored in separate files RowK ey CF1 CF2 colA colB colA colC ra 1 2 rxxxx rxxx Logical Model Row Key CF1:Col version value ra cf1:cola 1 1 row Key CF2:Col version value ra cf2:cola 1 2 Physical Storage Key Value Key Value Physical Storage
  • 19. © 2014 MapR Technologies 19 Sparse Data with Cell Versions CF1:colA CF1:colB CF1:colC Row1 Row10 Row11 Row2 @time1: value1 @time5: value2 @time7: value3 @time2: value1 @time3: value1 @time4: value1 @time2: value1 @time4: value1 @time6: value2
  • 20. © 2014 MapR Technologies 20 Versioned Data • Version – each put, delete adds new cell, new version – A long • by default the current time in milliseconds if no version specified – Last 3 versions are stored by default • Configurable by column family – You can delete specific cell versions – When a cell exceeds the maximum number of versions, the extra records are removed Key CF1:Col version value ra cf1:cola 3 3 ra cf1:cola 2 2 ra cf1:cola 1 1
  • 21. © 2014 MapR Technologies 21 Table Physical View Physically data is stored per Column family as a sorted map • Ordered by row key, column qualifier in ascending order • Ordered by timestamp in descending order Row key Column qualifier Cell value Timestamp (long) Row1 CF1:colA value3 time7 Row1 CF1:colA value2 time5 Row1 CF1:colA value1 time1 Row10 CF1:colA value1 time4 Row 10 CF1:colB value1 time4 Sorted by Row key and Column Sorted in descending order
  • 22. © 2014 MapR Technologies 22 Logical Data Model vs Physical Data Storage Key CF1:Col version value ra cf1:ca 1 1 rb cf1:cb 2 4 rb cf1:cb 1 3 rc cf1:ca 1 5 Row Key CF1 CF2 ca cb ca cd ra 1 2 rb 3,4 rc 5 6,7 8 Key CF2:Col version value ra cf2:ca 1 2 rc cf2:ca 2 7 rc cf2:ca 1 6 rc cf2:cd 1 8 Physical Storage Logical Model Column families are stored separately Row keys, Qualifiers are sorted lexicographically Key Value Key Value
  • 23. © 2014 MapR Technologies 23 HBase Table is a Sorted map of maps SortedMap<Key, Value> Table Map of Rows Map of CF Map of columns Map of cells Key CF1:Col version value ra cf1:ca v1 1 rb cf1:cb v2 4 rb cf1:cb v1 3 rc cf1:ca v1 5 Key CF2:Col version value ra cf2:ca v1 2 rc cf2:ca v2 7 rc cf2:ca v1 6 rc cf2:cd v1 8 SortedMap<RowKey, SortedMap< ColumnFamily, SortedMap< ColumnName, SortedMap < version, Value> >>>
  • 24. © 2014 MapR Technologies 24 HBase Table SortedMap<Key, Value> <ra,<cf1, <ca, <v1, 1>> <cf2, <ca, <v1, 2>>> <rb,<cf1, <cb, <v2, 4> <v1, 3>>> <rc,<cf1, <ca, <v1, 5>> <cf2, <ca, <v2, 7>> <ca, <v1, 6>> <cd, <v1, 8>>> Key CF1:Col version value ra cf1:ca v1 1 rb cf1:cb v2 4 rb cf1:cb v1 3 rc cf1:ca v1 5 Key CF2:Col version value ra cf2:ca v1 2 rc cf2:ca v2 7 rc cf2:ca v1 6 rc cf2:cd v1 8
  • 25. © 2014 MapR Technologies 25 Basic Table Operations • Create Table, define Column Families before data is imported – but not the rows keys or number/names of columns • Low level API, technically more demanding • Basic data access operations (CRUD): put Inserts data into rows (both add and update) get Accesses data from one row scan Accesses data from a range of rows delete Delete a row or a range of rows or columns
  • 26. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 26 Hbase Architecture
  • 27. © 2014 MapR Technologies 27 What is a Region? • Tables are partitioned into key ranges (regions) • Region servers serve data for reads and writes – For the range of keys it is responsible for Region Server Client Region Region HMaster zookeepe zozoorkoekeepeeprer Region Server Region Region Get Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val
  • 28. © 2014 MapR Technologies 28 Region Server Components • WAL: write ahead log on disk (commit log), Used for recovery • BlockCache: Read Cache, LRU, Least Recently Used evicted • MemStore: Write Cache, sorted map of keyValue updates. – 1 memstore per column family per region • Hfile=sorted KeyValues on disk Region Server HDFS Data Node BlockCache memstore HFile memstore HFile Region Region HFile HFile memstore memstore WAL
  • 29. © 2014 MapR Technologies 29 HBase Write Steps Put each incoming record written to WAL for durability: • log on disk • updates appended sequentially HDFS Data Node memstore memstore Region Server Region WAL
  • 30. © 2014 MapR Technologies 30 HBase Write Steps Put Next updates are written to the Memstore: • write cache • in-memory • sorted list of KeyValue updates HDFS Data Node memstore memstore Region Server Region WAL Ack Updates quickly sorted in memory are available to queries after put returns
  • 31. © 2014 MapR Technologies 31 HBase Memstore • in-memory • sorted list of Key → Value • One per column family • Updates quickly sorted in memory memstore memstore Region Key CF1:Col version value ra cf1:ca v1 1 rb cf1:cb v2 4 rb cf1:cb v1 3 rc cf1:ca v1 5 Key CF2:Col version value ra cf2:ca v1 2 rc cf2:ca v2 7 rc cf2:ca v1 6 rc cf2:cd v1 8 Key Value Key Value
  • 32. © 2014 MapR Technologies 32 HBase Region Flush When 1 Memstore is full: • all memstores in region flushed to new Hfiles on disk • Hfile: sorted list of key → values On disk HDFS Data Node memstore memstore Region Server Region WAL HFile HFile FLUSH HHFFileile
  • 33. © 2014 MapR Technologies 33 HBase HFile • On disk sorted list of key → values • One per column family • Flushed quickly to file • Sequential write HDFS Data Node HFile HFile Sequential write Key CF1:Col version value ra cf1:ca v1 1 rb cf1:cb v2 4 rb cf1:cb v1 3 rc cf1:ca v1 5 Key CF2:Col version value ra cf2:ca v1 2 rc cf2:ca v2 7 rc cf2:ca v1 6 rc cf2:cd v1 8 Key Value Key Value
  • 34. © 2014 MapR Technologies 34 HBase HFile Structure • Memstore flushes to an Hfile 64Kbyte blocks are compressed Key-value pairs are stored in increasing order Index points to row keys location B+-tree: leaf index , root index Key A Value … … … Key P Value ….. …. … Key T Value ….. …. … Key z Value ….. …. … Root Index Bloom Leaf Index Leaf Index Leaf Index Leaf Index Leaf Index Leaf Index Root Index Interm Index Bloom Trailer Bloom Bloom Intermediate Index Bloom Bloom Bloom Bloom
  • 35. © 2014 MapR Technologies 35 HBase Read Merge from Memory and Files • MemStore creates multiple small store files over time when flushing. • Read Amplification: When a get/scan comes in, multiple files have to be examined HDFS Data Node BlockCache memstore Region Server Region WAL HFile HFile HFile scanner read Get or Scan searches for Row Cell KeyValues: 1. Block Cache ((Memory) 2. Memstore (Memory) 3. Load HFiles from Disk into Block Cache based on indexes and bloomfilters 1 2 3
  • 36. © 2014 MapR Technologies 36 HDFS Data Node HBase Compaction • minor compaction: • merges files into fewer larger ones. • Major compaction: • merge all Hfiles into one per column family. • remove cells marked for deletion Region Server Region Region memstore memstore WAL minor compaction HFile HFile HFile HFile HFile HFile HFile HFile HFile HFile HFile HFile HFile updates HFile major compaction Flush to disk
  • 37. © 2014 MapR Technologies 37 HBase Background: Log-Structured Merge Trees • Traditional Databases use B+ trees: – expensive to update • HBase: Log Structured Merge Trees – transforms random writes into sequential writes • Writes go to memory – And WAL • memstore flushes to disk – Reads • From memory, index, sorted disk predictable disk seeks
  • 38. © 2014 MapR Technologies 38 Data Model for Fast Writes, Reads • Predictable disk lay out • Minimize disk seek • Get, Put by row key: primary index, fast random access • Scan by row key range: stored sorted, efficient sequential access for key range Region1 Key Range ra rx Region Region Region Server Key CF1:Col version value ra cf1:ca v1 1 rb cf1:cb v2 4 rb cf1:cb v1 3 rc cf1:ca v1 5 Get key Scan start key, Stop key Minimize disk seek
  • 39. © 2014 MapR Technologies 39 Region = contiguous keys • Regions fundamental partitioning/sharding object. • By default, on table creation 1 region is created that holds the entire key range. • When region becomes too large, splits into two child regions. • Typical region size is a few GB, sometimes even 10G or 20G Region CF1 colA colB colC val val val CF2 colA colB colC val val val Key Range axxx gxxx Region
  • 40. © 2014 MapR Technologies 40 Region Split • The RegionServer splits a region • daughter regions – represent 1/2 of the original region – each with half of the original regions keys. – opened in parallel on same server • reports the split to the Master Region 1 Region 2 Region Server 1 Key colB colC val val val Key colB colC val val val Region Server 1 Region 1 Key Range axxx kxxx Key Range Lxxx zxxx Key colB colC val val val
  • 41. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 41 HBase Use Cases
  • 42. © 2014 MapR Technologies 43 3 Main Use Case Categories • Capturing Incremental data – Hi Volume, Velocity Writes – Time Series Data, Stuff with a Time Stamp • Sensor, System Metrics, Events, • Stock Ticker, User clicks, HBase Put App Server App read Server Put Put Put Event time stamped data sensor OpenTSDB Data for real-time monitoring.
  • 43. © 2014 MapR Technologies 44 HBase Messages read Put App Server read App Server read App Server Put Put Put App Server 3 Main Use Case Categories • Information Exchange Hi Volume, Velocity Write/Read • Messaging on Facebook is backed by Hbase – communication coming from email, SMS, Facebook Chat, and the Inbox – https://www.facebook.com/UsingHbase
  • 44. © 2014 MapR Technologies 45 3 Main Use Case Categories • Content Serving, Web Application Backend– Hi Volume, Velocity Reads – Online Catalog: Intuit Merchant, Gap, World Library Catalog. – CRM: Salesforce CMS: Lily – Search Index: ebay, photobucket – Online Pre-Computed View: Groupon, Pinterest Hbase Processed data read App Server read App Server read App Bulk Import Server
  • 45. © 2014 MapR Technologies 46 Agenda • Introduction to HBase data model and Architecture • Using HBase shell to create tables and insert data – Demo / Lab using Hbase shell to create tables • Java API fundamentals to perform CRUD operations – Demo / Lab using Eclipse, HBase Java API & MapR Sandbox • Understand how the data flows for writes and reads • Schema design concepts including rowkey design • Advanced Java APIs to perform scans and do transactions – Demo / Lab
  • 46. © 2014 MapR Technologies 47 Lab Exercise See Lab_Hbase_Shell.pdf Start MapR Sandbox and log into the cluster [user: mapr, passwd: mapr] Use the HBase shell >Hbase shell hbase> help hbase> create ’/user/mapr/mytable’, {NAME =>’cf1’} hbase> put ’/user/mapr/mytable’, ’row1’, ’cf1:col1’, ‘datacf1c1v1’ hbase> get ’/user/mapr/mytable’, ’row1’ hbase> scan ’/user/mapr/mytable’ hbase> describe ’/user/mapr/mytable’
  • 47. © 2014 MapR Technologies 48 © MapR Technologies, confidential HBase Java API fundamentals to perform CRUD operations
  • 48. © 2014 MapR Technologies 50 Shoppingcart Application Requirements • Need to create Tables: Shoppingcart & Inventory • Perform CRUD operations on these tables – Create, Read, Update, and Delete items from these tables
  • 49. © 2014 MapR Technologies 51 Inventory & Shoppingcart Tables Perform checkout operation for Mike Inventory Table Shoppingcart Table quantity Pens 24 Notepads 54 Erasers 15 Pencils 90 pens notepads erasers Mike 5 5 John 10 15 4 Mary 9 10 Adam 18 7 10 CF “stock " CF “items"
  • 50. © 2014 MapR Technologies 52 Java API Fundamentals • CRUD operations – Get, Put, Delete, Scan, checkAndPut, checkAndDelete, Increment – KeyValue, Result, Scan – ResultScanner, – Batch Operations
  • 51. © 2014 MapR Technologies 53 CRUD Operations Follow A Pattern (mostly) • common pattern – Instantiate object for an operation: Put put = new Put(key) – Add attributes to specify what to insert: put.add(…) – invoke operation with HTable: myTable.put(put) // Insert value1 into rowKey in columnFamily:columnName1 Put put = new Put(rowKey); put.add(columnFamily, columnName1, value1); myTable.put(put);
  • 52. © 2014 MapR Technologies 54 erasers notepads pens Mike 2 5 5 CF “items" Shoppingcart Table Shopping Cart Table Key CF:COL ts value Mike items:erasers 1391813876369 2 Mike items:notepads 1391813876369 5 Mike items:pens 1391813876369 5 Physical Storage
  • 53. © 2014 MapR Technologies 55 Put Operation adding multiple column values to a row byte [] tableName = Bytes.toBytes("/path/Shopping"); byte [] itemsCF = Bytes.toBytes(“items"); byte [] penCol = Bytes.toBytes (“pens”); byte [] noteCol = Bytes.toBytes (“notes”); byte [] eraserCol = Bytes.toBytes (“erasers”); HTableInterface table = new HTable(hbaseConfig, tableName); Put put = new Put(“Mike”); put.add(itemsCF, penCol, Bytes.toBytes(5l)); put.add(itemsCF, noteCol, Bytes.toBytes(5l)); put.add(itemsCF, eraserCol, Bytes.toBytes(2l)); table.put(put); Key CF:COL ts value Mike items:erasers 1391813876369 2 Mike items:notepads 1391813876369 5 Mike items:pens 1391813876369 5
  • 54. © 2014 MapR Technologies 56 Get Example byte [] tableName = Bytes.toBytes("/path/Shopping"); byte [] itemsCF = Bytes.toBytes(“items"); byte [] penCol = Bytes.toBytes (“pens”); HTableInterface table = new HTable(hbaseConfig, tableName); Get get = new Get(“Mike”); get.addColumn(itemsCF, penCol); Result result = myTable.get(get); byte[] val = result.getValue(itemsCF, penCol); System.out.println("Value: " + Bytes.toLong(val)); //prints 5 Key CF:COL ts value Mike items:erasers 1391813876369 2 Mike items:notepads 1391813876369 5 Mike items:pens 1391813876369 5
  • 55. © 2014 MapR Technologies 57 Result Class • A Result instance wraps data from a row returned from a get or a scan operation. Result wraps KeyValues • Result toString() looks like this : keyvalues={Adam/items:erasers/1391813876369/Put/vlen=8/ts=0, Adam/items:notepads/1391813876369/Put/vlen=8/ts=0, Adam/items:pens/1391813876369/Put/vlen=8/ts=0} • The Result object provides methods to return values byte[] b = result.getValue(columnFamilyName,columnName1); Items:erasers Items:notepads Items:pens Adam 10 7 18 http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/client/Result.html
  • 56. © 2014 MapR Technologies 58 KeyValue – The Fundamental HBase Type • A KeyValue instance is a cell instance – Contains Key (cell coordinates) and the Value (data) • Cell coordinates: Row key, Column family, Column qualifier, Timestamp • KeyValue toString() looks like this : Adam/items:erasers/1391813876369/Put/vlen=8/ Key =Cell Coordinates Row key Column Family Column Qualifier Timestamp Value Value Adam items erasers 1391813876369 10
  • 57. © 2014 MapR Technologies 59 Bytes class http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/util/Bytes.html • org.apache.hadoop.hbase.util.Bytes • Provides methods to convert Java types to and from byte[] arrays • Support for – String, boolean, short, int, long, double, and float byte[] bytesTable = Bytes.toBytes("Shopping"); String table = Bytes.toString(bytesTable); byte[] amountBytes = Bytes.toBytes(1000l); long amount = Bytes.toLong(amount);
  • 58. © 2014 MapR Technologies 60 Scan Operation – Example byte[] startRow=Bytes.toBytes(“Adam”); byte[] stopRow=Bytes.toBytes(“N”); Scan s = new Scan(startRow, stopRow); scan.addFamily(columnFamily); ResultScanner rs = myTable.getScanner(s);
  • 59. © 2014 MapR Technologies 61 ResultScanner - Example Resultscanner provides iterator-like functionality Scan scan = new Scan(); scan.addFamily(columnFamily); ResultScanner scanner = myTable.getScanner(scan); try { for (Result res : scanner) { System.out.println(res); } } catch (Exception e) { System.out.println(e); } finally { scanner.close(); } Calls scanner.next() Always put in finally block
  • 60. © 2014 MapR Technologies 62 Lab Exercise Program Structure • ShoppingCartApp– main class • InventoryDAO – A DAO for the Inventory CRUD functionality • ShoppingCartDAO – A DAO for the Inventory CRUD functionality • Inventory – A Java object that holds data for a single Inventory row • ShoppingCart – A Java object that holds data for a single Inventory row • MockHtable – in memory test hbase table, allows to run code, debug without hbase running on a cluster or vm.
  • 61. © 2014 MapR Technologies 63 Lab Exercise • See Lab_3_Java_API.pdf – Import the project “lab-exercises-shopping” into Eclipse – Setup creates Inventory and Shoppingcart Tables and inserts data – Use Get, Put, Scan, and Delete operations
  • 62. © 2014 MapR Technologies 64 Lab: Import, build • Download the code • Import Maven project lab-exercises-shopping into Eclipse • Build : Run As -> Maven Install
  • 63. © 2014 MapR Technologies 65 Lab: run TestInventorySetup JUnit • Select Test Class, Then Run As -> JUnit Test • Uses MockHTable https://gist.github.com/agaoglu/613217
  • 64. © 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 66 Schema Design Guidelines • HBase tables ≠ Relational tables! • Be careful about assumptions you bring from past experience. • Examples: – Data normalization not required – Different atomicity rules
  • 65. © 2014 MapR Technologies 67 Use Case Example: Record Stock Trade Information in a Table • Trade data: Trade • timestamp • stock symbol • price per share • volume of trade • Example – 1381396363000 (epoch timestamp with millisecond granularity) – AMZN – $304.66 – 1333 shares
  • 66. © 2014 MapR Technologies 68 Intelligent keys • Only the row keys are indexed • Compose the key with attributes used for searching – Composite key : 2 or more identifying attributes – Like multi-column index design in RDB – Use fixed length, or separators Cell Coordinates (Key) Granularity Row key Column Family Column Name Timestamp Value Restrict disk I/O Restrict network traffic
  • 67. © 2014 MapR Technologies 69 Composite Keys Use composite rowkey to bound scan ranges and provide sub-indexing to cell data. • Include multiple elements in the rowkey – Use a separator or fixed length • Example rowkey format: – Ex: GOOG_20131012 • Get operations require complete row key. • Scans can use partial keys. – Ex: “GOOG” or "GOOG_2014" SYMBOL + DATE (YYYYMMDD)
  • 68. © 2014 MapR Technologies 70 Consider Access Patterns for Application • By date? By hour? By companyId? By PersonId? – Rowkey design • What if the Date/Timestamp is leftmost ? How will data be retrieved? Key 1391813876369_AMZN 1391813876370_AMZN 1391813876371_GOOG
  • 69. © 2014 MapR Technologies 71 Hot-Spotting and Region Splits • If rowkeys are written in sequential order then writes go to only one server – Split when full 1900 1950 … 1999 Region Server 1 Key colB col C 1900 val val 1999 Region 1 Key Range 1900 1999 Sequential key, like a timestamp File Server 1
  • 70. © 2014 MapR Technologies 72 Hot-Spotting and Region Splits • Regions split as the table grows. – RegionServer Creates two new regions, each with half of the original regions keys. • Sequential writes will go to new region 2040 2050 2000 Region Server 1 File Server 1 Key colB col C 1900 val val 1999 Region 1 Key Range 1900 1950 Key colB col C 1900 val val 1999 Region 2 Key Range 1950 2050
  • 71. © 2014 MapR Technologies 73 Hot-Spotting and Region Splits 3040 3000 Sequential writes will go to new region Region Server 1 File Server 1 Key colB col C 1900 val val 1999 Region 1 Key Range 1900 1950 Key colB col C 1900 val val 1999 Region 2 Key Range 1950 3050
  • 72. © 2014 MapR Technologies 74 Hot-Spotting and Region Splits 3045 Regions split as the table grows. Sequential writes will go to new region Region Server 1 File Server 1 Key colB col C 1900 val val 1999 Region 1 Key Range 1900 1950 Key colB col C 1900 val val 1999 Region 2 Key Range 1950 2050 Key colB col C 1900 val val 1999 Region 3 Key Range 2051 3050 3041 3050
  • 73. © 2014 MapR Technologies 75 Random keys Key Range a23148 3d1a5f e0e9b4 Key Range Key Range Key Range Key Range Key Range MD5 Hash rowkey Random writes will go to different regions If table was pre-split or big enough to have split d = MessageDigest.getInstance("MD5"); byte[] prefix = d.digest(Bytes.toBytes(s));
  • 74. © 2014 MapR Technologies 76 Sequential vs. Random keys Random is better for writing , but sequential is better for scanning row keys Sequential Keys Performance Salted Keys Promoted Field Keys Random Keys
  • 75. © 2014 MapR Technologies 77 Prefix, Promote a field key Key Range Key Range Key Range Key Range Key Range Key Range amzn_1999 amzn_2003 amzn_2005 cisc_1998 cisc_2002 cisc_2010 goog_1990 goog_2020 goog_2030
  • 76. © 2014 MapR Technologies 78 Prefix with a Hashed field key Key Range a23148_2003 1d1a5f_1999 e0e9b4_2000 Key Range Key Range Key Range Key Range Key Range MD5 Hash prefix rowkey • prefix the rowkey with a (shortened) hash: byte[] hash = d.digest(Bytes.toBytes(fieldkey)); Bytes.putBytes(rowkey, 0, hash, 0, length); g0e8b4_2004 b33148_2006 3d1a5f_2007
  • 77. © 2014 MapR Technologies 79 Consider Access Patterns for Application • Which trade data needs fastest access (or most frequent)? – Rowkey ordering • What if you want to retrieve the stocks by symbol and date? • Scan by row key prefix Increasing time: PREFIX_TIMESTAMP • What if you usually want to retrieve the most recent? How will data be retrieved? Key AMZN_1391813876369 AMZN_1391813876370 GOOG_1391813876371 SYMBOL + timestamp
  • 78. © 2014 MapR Technologies 80 Last In First Out Access: Use Reverse-Timestamp • Row keys are sorted in increasing order • For fast access to most-recent writes: – design composite rowkey with reverse-timestamp that decreases over time. – Scan by row key prefix Decreasing: [MAXTIME–TIMESTAMP] • Ex: Long.MAX_VALUE-date.getTime() SYMBOL + Reverse timestamp Key AMZN_98618600666 AMZN_98618600777 GOOG_98618608888
  • 79. © 2014 MapR Technologies 81 Consider Access Patterns for Application • What are the needs for atomicity of transactions? – Column design – More Values in a single row • Works well to get or update multiple values How will data be retrieved?
  • 80. © 2014 MapR Technologies 82 Rowkey design influences shape of Tables: Tall or Flat Tall Narrow Flat Wide
  • 81. © 2014 MapR Technologies 83 Tall Table for Stock Trades Rowkey format: Ex: AMZN_98618600888 SYMBOL + Reverse timestamp rowkey CF: CF1 CF1:price CF1:vol … … … AMZN_98618600666 12.34 2000 AMZN_98618600777 12.41 50 AMZN_98618600888 12.37 10000 … … … CSCO_98618600777 23.01 1000 …
  • 82. © 2014 MapR Technologies 84 Consider Access Patterns for Application • Are Price and Volume data typically accessed together, or are they unrelated? – Column family structure • Column Families – group data based on access pattern & with similar attributes together: # Min/Max versions, compression, in-memory, Time-To-Live • Columns – Column names are dynamic, not pre-defined – every row does not need to have same columns How will data be retrieved?
  • 83. © 2014 MapR Technologies 85 Wide Table for Stock Trades rowkey CF price CF vol p:10 p:1000 … p:2000 v:10 v:1000 … v:2000 AMZN_986186006 12.37 13 12.34 10000 2000 … CSCO_986186070 23.01 1000 Rowkey format: Ex: AMZN_20131020 SYMBOL + Reverse timestamp rounded to the hour • Separate price & volume data into column families • Segregate time into buckets: – Time rounded to the hour in the rowkey – Time in column name represents seconds since the timestamp in the key – Column names can be dynamic, every row does not need to have same columns – One row stores a bucket of measurements for the hour
  • 84. © 2014 MapR Technologies 86 Lesson: Schemas can be very flexible and can even change on the fly Column names can be dynamic, every row does not need to have same columns
  • 85. © 2014 MapR Technologies 87 Consider Access Patterns for Application • Do all trades need to be saved forever? – TTL Time to Live , CF can be set to expire cells • How many Versions? – Max Versions – You can have many versions of data in a cell, default is 3 How will data be retrieved?
  • 86. © 2014 MapR Technologies 88 Wide Table for Stock Trades rowkey CF price CF vol CF stats price:00 … price:23 vol:00 … vol:23 Day Hi Day Lo AMZN_20131020 12.37 12.34 10000 2000 … CSCO_20130817 23.01 1000 Rowkey format: Ex: AMZN_20131020 SYMBOL + date YYYYMMDD • Separate price & volume data into column families • Segregate time into buckets: – Date in the rowkey – Hour in the column name – Set Column Family to store Max Versions, timestamp in the version
  • 87. © 2014 MapR Technologies 89 Flat-Wide Vs. Tall-Narrow Tables • Tall-Narrow provides better query granularity – Finer grained Row Key – Works well with scan • Flat-Wide supports built-in row atomicity – More Values in a single row • Works well to update multiple values (row atomicity) • Works well to get multiple associated values
  • 88. © 2014 MapR Technologies 90 Lesson: Have to know the queries to design in performance
  • 89. © 2014 MapR Technologies 91 Comparing Relational schema to HBase • HBase is lower-level than relational tables – Design is different • Relational design – Data centric, focus on entities and relations – Query joins • New views of data from different tables easily created – Does not scale across cluster • HBase is designed for clustering: – Distributed data is stored and accessed together – Query centric, focus on how the data is read – Design for the questions Key Range axxx kxxx Key Range xxx xxx Key Range xxx zxxx
  • 90. © 2014 MapR Technologies 93  Goal Normalization: – eliminate redundant data – Put repeating information in its own table Normalized database : – Causes joins • data has to be retrieved from more tables. • queries can take more time Normalization
  • 91. © 2014 MapR Technologies 94 HBase Nested Entity • A one-to-many relationship can be modeled as a single row – Embedded, Nested Entity • Order one-to-many with Line Items – Row key: parent id • OrderId – column name : child id stored • line Item id OrderId Data:date item:id1 item:id2 item:id3 123 20131010 $10 $20 $9.45
  • 92. © 2014 MapR Technologies 95 De-Normalization Blog and Comments in same table De-normalization: – store data about an entity and related entities in the same table. • Reads are faster across a cluster – retrieve data about entity and related entities in one read Key Data:ts Comment:id1 Comment:id2 Comment:id3 Blog post Id Blog text interesting boring exciting
  • 93. © 2014 MapR Technologies 96 Many to Many Relationship RDBMS user id (primary key) name alias email book id (primary key) title description user_book_rating id (primary key) userId (foreign key) bookId (foreign key) rating 1 ∞ ∞ 1 Online book store • Querys • Get name for user x • Get tiltle for book x • Get books and corresponding ratings for userId x • Get all userids and corresponding ratings for book y
  • 94. © 2014 MapR Technologies 97 Many to Many Relationship HBase User table Column family for book ratings by userid for bookids Key data:fname … rating:bookid1 rating:bookid2 userid1 5 4 Key data:title … rating:userid1 rating:userid2 bookid1 5 4 Book table Column family for ratings for bookid by userid • Querys • Get books and corresponding ratings for userId x • Get all userids and corresponding ratings for book y
  • 95. © 2014 MapR Technologies 98 Generic data, Event Data, Entity-Attribute-Value • Generic table: Entity, Attributes, Values – Event Id, Event attributes, Values – object-property-value , name-value pairs , schema-less patientXYZ-ts1, Temperature , "102" patientXYZ-ts2, Coughing, "True" patientXYZ-ts3, Heart Rate, "98" • This is the advantage of HBase – Define columns on the fly, • put attribute name in column qualifier – group data by column families Key event:heartrate event:coughing event:temperature Patientxyz-ts1 98 true 102 Event id=row key Event type name=qualifier Event measurement=value
  • 96. © 2014 MapR Technologies 99 Self Join Relationship HBase Twitter user x follows user y user b followed by user a • Querys • Get all users who Carol follows • Get all users following Carol Key data:timestamp Carol:follows:SteveJobs Carol:followedby:BillyBob
  • 97. © 2014 MapR Technologies 100 Hierarchical Data • tree like structure • Use a flat-wide – Parents, children in columns usa TN FL Nashville Miami Key P:USA P:TN p:FL c:TN C:FL C:Nashvl C:Miami USA state state TN country city FL country city Nashville state miami state
  • 98. © 2014 MapR Technologies 101 Inheritance mapping • Online Store Example Product table – put sub class type abbreviation in in key prefix for searching – Columns do not all have to be the same for different types Key price title details model Bok+id1 10 Hbase Dvd+Id2 15 stones Kin+Id3 100 fire
  • 99. © 2014 MapR Technologies 102 Agenda • Issues with RDMS and how Hbase helps • Quick overview of HBase Data Model & API • Design considerations when moving from RDBMS to Hbase • How to work around transactions that span multiple rows
  • 100. © 2014 MapR Technologies 103 Summary • Why NoSQL and why HBase • Design considerations when migrating from RDBMS to Hbase • De-normalize data • Column Families & Columns • Versions • RowKey design • Transactions using checkAndPut and Increment for simple use cases
  • 101. © 2014 MapR Technologies 104 References • http://hbase.apache.org/ • http://hbase.apache.org/0.94/apidocs/ • http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/client/pack age-summary.html • http://hbase.apache.org/book/book.html • http://doc.mapr.com/display/MapR/MapR+Overview • http://doc.mapr.com/display/MapR/M7+-+Native+Storage+for+MapR+Tables • http://doc.mapr.com/display/MapR/MapR+Sandbox+for+Hadoop • http://doc.mapr.com/display/MapR/Migrating+Between+Apache+HBase+Ta bles+and+MapR+Tables
  • 102. © 2014 MapR Technologies 105 Q & A @mapr maprtech yourname@mapr.com Engage with us! MapR maprtech mapr-technologies