HBase Data Modeling and Access Patterns with Kite SDK

1
HBase Data Modeling and Access Patterns with Kite SDK
Adam Warrington
Sr. Manager Customer Ops Tools Team

2
Developing on top of Apache Hadoop
©2014 Cloudera, Inc. All rights reserved.2
• Apache Hadoop is an incredibly powerful platform on which to
develop data applications.
• Scale
• it provides the infrastructure needed to process big data at scale.
• Flexibility
• General purpose platform on top of which one can build almost any type of
big data application.
• Diverse Ecosystem
• Multitude of storage engines, tools for ETL, machine learning, analysis, and
data science.
• This comes at a cost…

3
Developing on top of Apache Hadoop: The Cost
• The API is very basic and low level.
• Developers are required to build plumbing and
infrastructure to create even a basic system.
• Repeat process for every system you create.
• Have to understand the quirks of each system.
• The barrier to entry is high for many enterprise Java
developers in the industry.

4
What is Kite SDK?
©2014 Cloudera, Inc. All rights reserved.
• Kite SDK aims to solve this problem by building a higher level
API on top of the Hadoop ecosystem
• Kite exists as a client-side library for writing Hadoop Data
Applications
• Modular
• Datasets: standard storage
• Morphlines: ETL as configuration
• Data Management Tools

5
What is Kite SDK?
• Kite SDK aims to solve this problem by building a higher level
API on top of the Hadoop ecosystem
• Kite exists as a client-side library for writing Hadoop Data
Applications
• Modular
• Datasets: standard storage
• Morphlines: ETL as configuration
• Data Management Tools
• Today’s talk will focus on the Datasets Module

6
Kite Datasets
• Motivation
• Focus on your data, not managing it
• Goals
• Think in terms of data, not files
• Describe your data and Kite does the right thing
• Consistency - should work across the platform
• Reliability

7
Kite Datasets
At the heart of the Kite Datasets module is a unified storage
interface.
• Dataset – a collection of entities
• DatasetRepository – physical storage location for datasets
• DatasetDescriptor – holds dataset metadata (schema, format)
• DatasetWriter – write entities to a dataset in a stream
• DatasetReader – read entities from a dataset

8
Kite Partition Strategies
PartitionStrategy defines how to map an entity to
partitions in HDFS or row keys in HBase
PartitionStrategy p = new PartitionStrategy.Builder()
.year("timestamp")
.month("timestamp")
.day("timestamp").build();
/user/hive/warehouse/events
/year=2014/month=05/day=05
/FlumeData.1375659013795
/FlumeData.1375659013796

9
Kite Datasets Example
Event.avsc
{
"type" : "record",
"name" : ”Event",
"namespace" : "com.example”,
"fields" : [
{ "name”: ”id", "type”: ”long” },
{ “name”: “timestamp”, “type”: “long” },
{ “name”: “source”, “type”: “string” }
]
}
Log4j Configuration
log4j.appender.flume = org.kitesdk.data.flume.Log4jAppender
log4j.appender.flume.Hostname = localhost
log4j.appender.flume.Port = 41415
log4j.appender.flume.DatasetRepositoryUri = repo:hive
log4j.appender.flume.DatasetName = events

10
Kite Datasets Example Continued
Dataset Creation
DatasetRepository repo = DatasetRepositories.open("repo:hive");
DatasetDescriptor descriptor = new DatasetDescriptor.Builder()
schema(Event.avsc).build();
repo.create("events", descriptor);
Java Code
Logger logger = Logger.getLogger(...);
Event event = new Event();
event.setId(id);
event.setTimestamp(System.currentTimeMillis());
event.setSource(source);
logger.info(event);

11
Kite Datasets Example Continued
/user
/hive
/warehouse
/events
/FlumeData.1375659013795
/FlumeData.1375659013796
Avro
files
Resulting File Layout

13
HBase Storage Format
HBase storage concepts are fundamentally different
from file formats on HDFS
• Ordered Rows
• Column Families
• Random Access Operations

14
HBase Storage Format
New concepts added to the Dataset API:
• Composite Keys – support for entity ordering with
composite keys
• Column mapping – define how data is split across
column families and columns in a table
• Random Access Dataset Methods– support for Get,
Put, and Delete operations on the Dataset interface

15
Composite Key Engineering
• Properly engineered row keys is crucial for optimizing
HBase scans.
• HBase tables sort using lexicographical ordering of key
byte arrays
• Composite keys are a common use case, but hard to
get correct.

16
Composite Key Engineering With Partition Strategies
• We already have a way to split records across storage buckets with a PartitionStrategy.
• Let’s re-use that concept.
• Example: Define a PartitionStrategy optimized for historical web page scans
Website.avsc
{
"type" : "record",
"name" : ”Website",
"fields" : [
{ "name”: ”url", "type”: ”string” },
{ "name”: ”content", "type" : ”string” }
]
}
Partition Strategy Builder
PartitionStrategy p =
new PartitionStrategy.Builder()
.identity(”url")
.identity(”timestamp")
.build();

17
Composite Key Engineering With Partition Strategies
Or with the Partition Strategy JSON format
Website.avsc
{
"type" : "record",
"fields" : [
]
}
WebsitePartitionStrat.json
[
{ “source”: “url”, “type”: “id” },
{ “source”: “timestamp”, “type”: “id” }
]

18
Key Memcmp Encoding
• Encode composite key parts so serialized byte array
will sort lexicographically by key fields in order.
{
“id”: 1,
“ts”: 100,
…
}
{
“id”: 2,
“ts”: 50,
…
}
{
“id”: 2,
“ts”: 102,
…
}
< <

19
Key Memcmp Encoding (Integer and Long)
Value Bytes
1 0x00000001
0 0x00000000
-1 0xFFFFFFFFF
-2 0xFFFFFFFFE
Standard integer and long
serialization sorts across negative
and positive numbers wrong
So we flip the sign bit when
serializing an integer or long
Value Bytes
1 0x80000001
0 0x80000000
-1 0x7FFFFFFFF
-2 0x7FFFFFFFE

20
Key Memcmp Encoding (Variable Length Types)
Value1 Value2 Bytes
“foo” “bar” x03foox03bar
“foo” “zr” x03foox02zr
“zo” “bar” 0xFFFFFFFFF
Binary Avro encoding is length
prefixed. This can sort composite
keys wrong.
So we terminated Strings with
a terminating character.
Value1 Value2 Bytes
“foo” “bar” foox00barx00
“foo” “zr” foox00zrx00
“zo” “bar” zox00barx00

21
Key Memcmp Encoding (Variable Length Types)
• How do we handle a x00 byte present in the variable length type?
• Convert x00 byte to x00x01, and use x00x00 as terminating
character.
Value1 Value2 Bytes
“fo” “bar” foox00x00barx00x00
“fox00” “aa” foox00x01x00x00aax00x00

22
Column Mappings
Defines how an Avro record’s fields are mapped to an
HBase table row.
Mapping Type Description
column Maps a record field value directly to a column
counter Similar to column, except supports atomic increment
keyAsColumn Maps key/value field types to a column family where each key
entry is a column qualifier and value entry is the cell value.
key Record field’s value is part of the composite key
occVersion Enables optimistic concurrency control on the dataset.

23
Column Mappings: Header Definition
Event.avsc
{
"type" : "record",
"name" : "Event",
“mapping”: [
{ “source”: “id”, “type”: “key” },
{ “source”: “ts”, “type”: “key” },
{ “source”: “source”, “type”: “column”, “value”: “meta:source”},
{ “source”: “atts”, “type”: keyAsColumn”, “value”: “atts:” }
],
"fields" : [
{ "name" : "id", "type" : "long” },
{ "name" : "ts", "type" : "long” },
{ "name" : "source", "type" : "string" },
{ “name” : “atts”,
“type”: { “type”: “map”, “value”: “string” } }
]
}
• Mapping definition attribute
can be added right to the Avro
record schema
• Still a valid Avro schema –
Avro’s schema parser will
ignore unknown attributes in
record header.

24
Column Mappings: Field Definition
Event.avsc
{
"type" : "record",
"name" : "Event",
"fields" : [
{ "name”: "id", "type”: "long”, “mapping”: { “type”: “key” }},
{ "name”: "ts", "type" : "long”, “mapping”: { “type”: “key” }},
{ "name”: "source", "type”: "string”,
“mapping”: { “type”: “column”, “value”: “meta:source” }},
“type”: { “type”: “map”, “value”: “string” },
“mapping”: { “type”: “keyAsColumn”, “value”: “atts:” }}
]
}
• Mapping definition attributes
can be defined directly on the
Avro schema fields.
• Still a valid Avro schema –
Avro’s schema parser will
ignore unknown attributes on
fields.

25
Column Mappings: External Definition
Event.avsc
{
"type" : "record",
"name" : "Event",
"fields" : [
{ "name”: "id", "type”: "long” },
{ "name”: "ts", "type" : "long” },
{ "name”: "source", "type”: "string” },
“type”: { “type”: “map”, “value”: “string” }}
]
}
• Mapping definition attributes
can be defined in an external
file.
• Perfect if you don’t want to
update existing Avro schemas.
EventMapping.json
[
{ “source”: “atts”, “type”: keyAsColumn”, “value”: “atts:” }
]

26
Column Mapping Types: “column”
• Maps a field to a fully qualified column
• Fields serialized using Avro binary encoding except…
• Integer serialized as 4 byte int
• Long serialized as 8 byte long
• String serialized as UTF8 bytes
• Allows atomic increment and append on these
types, which length prefixed and zig-zag encoding
would not.
Row Key Column Family: meta Column Family: atts
Key Part 1 Key Part 2 Qualfier: source Qualifier: ip Qualifier: level
1 1396322485 server1 192.168.0.100 ERROR
Event Instance:
{
“id”: 1,
“ts”: 1396322485,
“source”: “server1”,
“atts”: {
“ip”: “192.168.0.100”,
“level”: “ERROR”
}
}

27
Column Mapping Types: “keyAsColumn”
• Allowed for Map and Record types
• Splits apart a Map by its entries, using keys as the
qualifier, and storing values in the cell.
• Splits apart a Record by its fields, using field names
as the qualifier, and storing the values in the cell.
• Fields serialized using Avro’s binary encoding
• Allows pattern for atomic updates to the
keyAsColumn field.
1 1396322485 server1 192.168.0.100 ERROR
Event Instance:
{
“id”: 1,
“ts”: 1396322485,
“atts”: {
“ip”: “192.168.0.100”,
}
}

28
Column Mapping Types: “key”
• Allowed for simple types – int, long, float, double,
boolean, string, bytes
• Can be defined on multiple fields to support
multi-part keys
• Rows are ordered lexicographically by key
mapping fields in the order they are defined
1 1396322485 server1 192.168.0.100 ERROR
Event Instance:
{
“id”: 1,
“ts”: 1396322485,
“atts”: {
“ip”: “192.168.0.100”,
}
}

29
1
2
3
4
public E get(Key key);
public boolean put(E entity);
public long increment(Key key,
String fieldName, long amount);
public void delete(Key key);
RandomAccessDataset
Adds a number of methods to the Dataset interface for
random access operations.

30
Random Access Dataset Example
Website.avsc
{
"type" : "record",
"fields" : [
{ “name”: “size”, “type”: “int” },
]
}
WebsitesPartitionStrat.json
[
{ “source”: “url”, “type”: “id” }
]
WebsiteVersionsPartitionStrat.json
[
{ “source”: “url”, “type”: “id” },
{ “source”: “timestamp”, “type”: “id” }
]
WebsiteColumnMapping.json
[
{ “source”: “url”, “type”: “column”,
“value”: “meta:url” },
{ “source”: “timestamp”, “type”: “column”,
“value”: “meta:timestamp” },
{ “source”: “size”, “type”: “column”,
“value”: “meta:size” },
{ “source”: “content”, “type”: “column”,
“value”: “content:content” }
]

31
Random Access Dataset Example
private RandomAccessDataset<Website> websitesDataset = …;
private RandomAccessDataset<Website> websiteVersionsDataset = …;
public void calculateNextFetch(String url) {
Key key = new Key.Builder(websitesDataset).add("url", url).build();
Website website = websites.get(key);
DatasetReader<Website> websiteVersionReader =
websiteVersionsDataset.with("url", url).newReader();
long ts = computeNextFetchTime(websiteVersionReader);
website.setNextFetchTime(ts);
websites.put(website);
}

32
Kite HBase Module
Advanced Features

33
Concurrency Control
• HBase doesn’t have native support for transactions.
• This missing feature can be problematic to newbies.
• Single Row Puts are atomic, so best practice is to prefer de-
normalizing data into wide rows.
• This doesn’t help for Get-Update-Put operations though…

34
Optimistic Concurrency Control
• Prevents multiple
processes performing
row updates from
colliding
• Enabled with an
“occVersion” column
mapping type.
{
"type" : "record",
"name" : "Event",
“mapping”: [
{ “source”: “version”, “type”: occVersion” }
],
"fields" : [
{ "name" : "id", "type" : "long” },
{ "name" : "ts", "type" : "long” },
{ "name" : "source", "type" : "string" },
{ “name” : “version”, “type” : “long” }
]
}

35
Optimistic Concurrency Control Continued…
• The version field is used to track the version in the row.
• Uses checkAndPut under the hood to ensure the row hasn’t been updated.
• Can’t put to an existing row without first fetching it.
• If conflict occurs, put() on RandomAccessDataset will return false.
• Successful put() increments the version.
• Up to the developer how to handle a conflict.
• Enables data protection for long running edits, like shared editing in a web
application.

36
Other Notable Advanced Features
• Schema Migrations
• Users have the ability to add or remove fields from the Avro record
schemas.
• Kite SDK keeps the historical set of Avro schemas in a specially designated
HBase table.
• Kite SDK will verify that only valid schema migrations can occur.
• Composite Datasets
• Users can create multiple datasets for a single HBase table.
• This allows developers to atomically Get and Put multiple types of Avro
records to a single row.
• Kite SDK will verify that dataset column mappings don’t clash.

HBase Data Modeling and Access Patterns with Kite SDK

More Related Content

What's hot

Viewers also liked

Similar to HBase Data Modeling and Access Patterns with Kite SDK

More from HBaseCon

Recently uploaded

HBase Data Modeling and Access Patterns with Kite SDK