SlideShare a Scribd company logo
1 of 37
1
HBase Data Modeling and Access Patterns with Kite SDK
Adam Warrington
Sr. Manager Customer Ops Tools Team
2
Developing on top of Apache Hadoop
©2014 Cloudera, Inc. All rights reserved.2
• Apache Hadoop is an incredibly powerful platform on which to
develop data applications.
• Scale
• it provides the infrastructure needed to process big data at scale.
• Flexibility
• General purpose platform on top of which one can build almost any type of
big data application.
• Diverse Ecosystem
• Multitude of storage engines, tools for ETL, machine learning, analysis, and
data science.
• This comes at a cost…
3
Developing on top of Apache Hadoop: The Cost
©2014 Cloudera, Inc. All rights reserved.3
• The API is very basic and low level.
• Developers are required to build plumbing and
infrastructure to create even a basic system.
• Repeat process for every system you create.
• Have to understand the quirks of each system.
• The barrier to entry is high for many enterprise Java
developers in the industry.
4
What is Kite SDK?
©2014 Cloudera, Inc. All rights reserved.
• Kite SDK aims to solve this problem by building a higher level
API on top of the Hadoop ecosystem
• Kite exists as a client-side library for writing Hadoop Data
Applications
• Modular
• Datasets: standard storage
• Morphlines: ETL as configuration
• Data Management Tools
5
What is Kite SDK?
©2014 Cloudera, Inc. All rights reserved.
• Kite SDK aims to solve this problem by building a higher level
API on top of the Hadoop ecosystem
• Kite exists as a client-side library for writing Hadoop Data
Applications
• Modular
• Datasets: standard storage
• Morphlines: ETL as configuration
• Data Management Tools
• Today’s talk will focus on the Datasets Module
6
Kite Datasets
©2014 Cloudera, Inc. All rights reserved.
• Motivation
• Focus on your data, not managing it
• Goals
• Think in terms of data, not files
• Describe your data and Kite does the right thing
• Consistency - should work across the platform
• Reliability
7
Kite Datasets
©2014 Cloudera, Inc. All rights reserved.
At the heart of the Kite Datasets module is a unified storage
interface.
• Dataset – a collection of entities
• DatasetRepository – physical storage location for datasets
• DatasetDescriptor – holds dataset metadata (schema, format)
• DatasetWriter – write entities to a dataset in a stream
• DatasetReader – read entities from a dataset
8
Kite Partition Strategies
©2014 Cloudera, Inc. All rights reserved.8
PartitionStrategy defines how to map an entity to
partitions in HDFS or row keys in HBase
PartitionStrategy p = new PartitionStrategy.Builder()
.year("timestamp")
.month("timestamp")
.day("timestamp").build();
/user/hive/warehouse/events
/year=2014/month=05/day=05
/FlumeData.1375659013795
/FlumeData.1375659013796
9
Kite Datasets Example
©2014 Cloudera, Inc. All rights reserved.
Event.avsc
{
"type" : "record",
"name" : ”Event",
"namespace" : "com.example”,
"fields" : [
{ "name”: ”id", "type”: ”long” },
{ “name”: “timestamp”, “type”: “long” },
{ “name”: “source”, “type”: “string” }
]
}
Log4j Configuration
log4j.appender.flume = org.kitesdk.data.flume.Log4jAppender
log4j.appender.flume.Hostname = localhost
log4j.appender.flume.Port = 41415
log4j.appender.flume.DatasetRepositoryUri = repo:hive
log4j.appender.flume.DatasetName = events
10
Kite Datasets Example Continued
©2014 Cloudera, Inc. All rights reserved.
Dataset Creation
DatasetRepository repo = DatasetRepositories.open("repo:hive");
DatasetDescriptor descriptor = new DatasetDescriptor.Builder()
schema(Event.avsc).build();
repo.create("events", descriptor);
Java Code
Logger logger = Logger.getLogger(...);
Event event = new Event();
event.setId(id);
event.setTimestamp(System.currentTimeMillis());
event.setSource(source);
logger.info(event);
11
Kite Datasets Example Continued
©2014 Cloudera, Inc. All rights reserved.
/user
/hive
/warehouse
/events
/FlumeData.1375659013795
/FlumeData.1375659013796
Avro
files
Resulting File Layout
12
Kite HBase Module
Overview
13
HBase Storage Format
©2014 Cloudera, Inc. All rights reserved.13
HBase storage concepts are fundamentally different
from file formats on HDFS
• Ordered Rows
• Column Families
• Random Access Operations
14
HBase Storage Format
©2014 Cloudera, Inc. All rights reserved.14
New concepts added to the Dataset API:
• Composite Keys – support for entity ordering with
composite keys
• Column mapping – define how data is split across
column families and columns in a table
• Random Access Dataset Methods– support for Get,
Put, and Delete operations on the Dataset interface
15
Composite Key Engineering
©2014 Cloudera, Inc. All rights reserved.15
• Properly engineered row keys is crucial for optimizing
HBase scans.
• HBase tables sort using lexicographical ordering of key
byte arrays
• Composite keys are a common use case, but hard to
get correct.
16
Composite Key Engineering With Partition Strategies
©2014 Cloudera, Inc. All rights reserved.16
• We already have a way to split records across storage buckets with a PartitionStrategy.
• Let’s re-use that concept.
• Example: Define a PartitionStrategy optimized for historical web page scans
Website.avsc
{
"type" : "record",
"name" : ”Website",
"namespace" : "com.example”,
"fields" : [
{ "name”: ”url", "type”: ”string” },
{ “name”: “timestamp”, “type”: “long” },
{ "name”: ”content", "type" : ”string” }
]
}
Partition Strategy Builder
PartitionStrategy p =
new PartitionStrategy.Builder()
.identity(”url")
.identity(”timestamp")
.build();
17
Composite Key Engineering With Partition Strategies
©2014 Cloudera, Inc. All rights reserved.17
Or with the Partition Strategy JSON format
Website.avsc
{
"type" : "record",
"name" : ”Website",
"namespace" : "com.example”,
"fields" : [
{ "name”: ”url", "type”: ”string” },
{ “name”: “timestamp”, “type”: “long” },
{ "name”: ”content", "type" : ”string” }
]
}
WebsitePartitionStrat.json
[
{ “source”: “url”, “type”: “id” },
{ “source”: “timestamp”, “type”: “id” }
]
18
Key Memcmp Encoding
©2014 Cloudera, Inc. All rights reserved.18
• Encode composite key parts so serialized byte array
will sort lexicographically by key fields in order.
{
“id”: 1,
“ts”: 100,
…
}
{
“id”: 2,
“ts”: 50,
…
}
{
“id”: 2,
“ts”: 102,
…
}
< <
19
Key Memcmp Encoding (Integer and Long)
©2014 Cloudera, Inc. All rights reserved.19
Value Bytes
1 0x00000001
0 0x00000000
-1 0xFFFFFFFFF
-2 0xFFFFFFFFE
Standard integer and long
serialization sorts across negative
and positive numbers wrong
So we flip the sign bit when
serializing an integer or long
Value Bytes
1 0x80000001
0 0x80000000
-1 0x7FFFFFFFF
-2 0x7FFFFFFFE
20
Key Memcmp Encoding (Variable Length Types)
©2014 Cloudera, Inc. All rights reserved.20
Value1 Value2 Bytes
“foo” “bar” x03foox03bar
“foo” “zr” x03foox02zr
“zo” “bar” 0xFFFFFFFFF
Binary Avro encoding is length
prefixed. This can sort composite
keys wrong.
So we terminated Strings with
a terminating character.
Value1 Value2 Bytes
“foo” “bar” foox00barx00
“foo” “zr” foox00zrx00
“zo” “bar” zox00barx00
21
Key Memcmp Encoding (Variable Length Types)
©2014 Cloudera, Inc. All rights reserved.21
• How do we handle a x00 byte present in the variable length type?
• Convert x00 byte to x00x01, and use x00x00 as terminating
character.
Value1 Value2 Bytes
“fo” “bar” foox00x00barx00x00
“fox00” “aa” foox00x01x00x00aax00x00
22
Column Mappings
©2014 Cloudera, Inc. All rights reserved.22
Defines how an Avro record’s fields are mapped to an
HBase table row.
Mapping Type Description
column Maps a record field value directly to a column
counter Similar to column, except supports atomic increment
keyAsColumn Maps key/value field types to a column family where each key
entry is a column qualifier and value entry is the cell value.
key Record field’s value is part of the composite key
occVersion Enables optimistic concurrency control on the dataset.
23
Column Mappings: Header Definition
©2014 Cloudera, Inc. All rights reserved.23
Event.avsc
{
"type" : "record",
"name" : "Event",
"namespace" : "com.example”,
“mapping”: [
{ “source”: “id”, “type”: “key” },
{ “source”: “ts”, “type”: “key” },
{ “source”: “source”, “type”: “column”, “value”: “meta:source”},
{ “source”: “atts”, “type”: keyAsColumn”, “value”: “atts:” }
],
"fields" : [
{ "name" : "id", "type" : "long” },
{ "name" : "ts", "type" : "long” },
{ "name" : "source", "type" : "string" },
{ “name” : “atts”,
“type”: { “type”: “map”, “value”: “string” } }
]
}
• Mapping definition attribute
can be added right to the Avro
record schema
• Still a valid Avro schema –
Avro’s schema parser will
ignore unknown attributes in
record header.
24
Column Mappings: Field Definition
©2014 Cloudera, Inc. All rights reserved.24
Event.avsc
{
"type" : "record",
"name" : "Event",
"namespace" : "com.example”,
"fields" : [
{ "name”: "id", "type”: "long”, “mapping”: { “type”: “key” }},
{ "name”: "ts", "type" : "long”, “mapping”: { “type”: “key” }},
{ "name”: "source", "type”: "string”,
“mapping”: { “type”: “column”, “value”: “meta:source” }},
{ “name” : “atts”,
“type”: { “type”: “map”, “value”: “string” },
“mapping”: { “type”: “keyAsColumn”, “value”: “atts:” }}
]
}
• Mapping definition attributes
can be defined directly on the
Avro schema fields.
• Still a valid Avro schema –
Avro’s schema parser will
ignore unknown attributes on
fields.
25
Column Mappings: External Definition
©2014 Cloudera, Inc. All rights reserved.25
Event.avsc
{
"type" : "record",
"name" : "Event",
"namespace" : "com.example”,
"fields" : [
{ "name”: "id", "type”: "long” },
{ "name”: "ts", "type" : "long” },
{ "name”: "source", "type”: "string” },
{ “name” : “atts”,
“type”: { “type”: “map”, “value”: “string” }}
]
}
• Mapping definition attributes
can be defined in an external
file.
• Perfect if you don’t want to
update existing Avro schemas.
EventMapping.json
[
{ “source”: “id”, “type”: “key” },
{ “source”: “ts”, “type”: “key” },
{ “source”: “source”, “type”: “column”, “value”: “meta:source”},
{ “source”: “atts”, “type”: keyAsColumn”, “value”: “atts:” }
]
26
Column Mapping Types: “column”
©2014 Cloudera, Inc. All rights reserved.26
• Maps a field to a fully qualified column
• Fields serialized using Avro binary encoding except…
• Integer serialized as 4 byte int
• Long serialized as 8 byte long
• String serialized as UTF8 bytes
• Allows atomic increment and append on these
types, which length prefixed and zig-zag encoding
would not.
Row Key Column Family: meta Column Family: atts
Key Part 1 Key Part 2 Qualfier: source Qualifier: ip Qualifier: level
1 1396322485 server1 192.168.0.100 ERROR
Event Instance:
{
“id”: 1,
“ts”: 1396322485,
“source”: “server1”,
“atts”: {
“ip”: “192.168.0.100”,
“level”: “ERROR”
}
}
27
Column Mapping Types: “keyAsColumn”
©2014 Cloudera, Inc. All rights reserved.27
• Allowed for Map and Record types
• Splits apart a Map by its entries, using keys as the
qualifier, and storing values in the cell.
• Splits apart a Record by its fields, using field names
as the qualifier, and storing the values in the cell.
• Fields serialized using Avro’s binary encoding
• Allows pattern for atomic updates to the
keyAsColumn field.
Row Key Column Family: meta Column Family: atts
Key Part 1 Key Part 2 Qualfier: source Qualifier: ip Qualifier: level
1 1396322485 server1 192.168.0.100 ERROR
Event Instance:
{
“id”: 1,
“ts”: 1396322485,
“source”: “server1”,
“atts”: {
“ip”: “192.168.0.100”,
“level”: “ERROR”
}
}
28
Column Mapping Types: “key”
©2014 Cloudera, Inc. All rights reserved.28
• Allowed for simple types – int, long, float, double,
boolean, string, bytes
• Can be defined on multiple fields to support
multi-part keys
• Rows are ordered lexicographically by key
mapping fields in the order they are defined
Row Key Column Family: meta Column Family: atts
Key Part 1 Key Part 2 Qualfier: source Qualifier: ip Qualifier: level
1 1396322485 server1 192.168.0.100 ERROR
Event Instance:
{
“id”: 1,
“ts”: 1396322485,
“source”: “server1”,
“atts”: {
“ip”: “192.168.0.100”,
“level”: “ERROR”
}
}
29
1
2
3
4
public E get(Key key);
public boolean put(E entity);
public long increment(Key key,
String fieldName, long amount);
public void delete(Key key);
RandomAccessDataset
©2014 Cloudera, Inc. All rights reserved.29
Adds a number of methods to the Dataset interface for
random access operations.
30
Random Access Dataset Example
©2014 Cloudera, Inc. All rights reserved.
Website.avsc
{
"type" : "record",
"name" : ”Website",
"namespace" : "com.example”,
"fields" : [
{ "name”: ”url", "type”: ”string” },
{ “name”: “timestamp”, “type”: “long” },
{ “name”: “size”, “type”: “int” },
{ "name”: ”content", "type" : ”string” }
]
}
WebsitesPartitionStrat.json
[
{ “source”: “url”, “type”: “id” }
]
WebsiteVersionsPartitionStrat.json
[
{ “source”: “url”, “type”: “id” },
{ “source”: “timestamp”, “type”: “id” }
]
WebsiteColumnMapping.json
[
{ “source”: “url”, “type”: “column”,
“value”: “meta:url” },
{ “source”: “timestamp”, “type”: “column”,
“value”: “meta:timestamp” },
{ “source”: “size”, “type”: “column”,
“value”: “meta:size” },
{ “source”: “content”, “type”: “column”,
“value”: “content:content” }
]
31
Random Access Dataset Example
©2014 Cloudera, Inc. All rights reserved.
private RandomAccessDataset<Website> websitesDataset = …;
private RandomAccessDataset<Website> websiteVersionsDataset = …;
public void calculateNextFetch(String url) {
Key key = new Key.Builder(websitesDataset).add("url", url).build();
Website website = websites.get(key);
DatasetReader<Website> websiteVersionReader =
websiteVersionsDataset.with("url", url).newReader();
long ts = computeNextFetchTime(websiteVersionReader);
website.setNextFetchTime(ts);
websites.put(website);
}
32
Kite HBase Module
Advanced Features
33
Concurrency Control
©2014 Cloudera, Inc. All rights reserved.33
• HBase doesn’t have native support for transactions.
• This missing feature can be problematic to newbies.
• Single Row Puts are atomic, so best practice is to prefer de-
normalizing data into wide rows.
• This doesn’t help for Get-Update-Put operations though…
34
Optimistic Concurrency Control
©2014 Cloudera, Inc. All rights reserved.34
• Prevents multiple
processes performing
row updates from
colliding
• Enabled with an
“occVersion” column
mapping type.
{
"type" : "record",
"name" : "Event",
"namespace" : "com.example”,
“mapping”: [
{ “source”: “id”, “type”: “key” },
{ “source”: “ts”, “type”: “key” },
{ “source”: “source”, “type”: “column”, “value”: “meta:source”},
{ “source”: “version”, “type”: occVersion” }
],
"fields" : [
{ "name" : "id", "type" : "long” },
{ "name" : "ts", "type" : "long” },
{ "name" : "source", "type" : "string" },
{ “name” : “version”, “type” : “long” }
]
}
35
Optimistic Concurrency Control Continued…
©2014 Cloudera, Inc. All rights reserved.35
• The version field is used to track the version in the row.
• Uses checkAndPut under the hood to ensure the row hasn’t been updated.
• Can’t put to an existing row without first fetching it.
• If conflict occurs, put() on RandomAccessDataset will return false.
• Successful put() increments the version.
• Up to the developer how to handle a conflict.
• Enables data protection for long running edits, like shared editing in a web
application.
36
Other Notable Advanced Features
©2014 Cloudera, Inc. All rights reserved.36
• Schema Migrations
• Users have the ability to add or remove fields from the Avro record
schemas.
• Kite SDK keeps the historical set of Avro schemas in a specially designated
HBase table.
• Kite SDK will verify that only valid schema migrations can occur.
• Composite Datasets
• Users can create multiple datasets for a single HBase table.
• This allows developers to atomically Get and Put multiple types of Avro
records to a single row.
• Kite SDK will verify that dataset column mappings don’t clash.
37 ©2014 Cloudera, Inc. All rights reserved.
Adam Warrington
@adamwar

More Related Content

What's hot

Apache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerApache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerHBaseCon
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementDataWorks Summit
 
HBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at PinterestHBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at PinterestCloudera, Inc.
 
HBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance EvaluationHBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance EvaluationSchubert Zhang
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHBaseCon
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...Cloudera, Inc.
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightHBaseCon
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...Cloudera, Inc.
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaCloudera, Inc.
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBaseCon
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.
 
Digital Library Collection Management using HBase
Digital Library Collection Management using HBaseDigital Library Collection Management using HBase
Digital Library Collection Management using HBaseHBaseCon
 
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon
 

What's hot (20)

Apache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerApache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at Cerner
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
 
HBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at PinterestHBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at Pinterest
 
HBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance EvaluationHBase 0.20.0 Performance Evaluation
HBase 0.20.0 Performance Evaluation
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial Industry
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
 
Digital Library Collection Management using HBase
Digital Library Collection Management using HBaseDigital Library Collection Management using HBase
Digital Library Collection Management using HBase
 
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
 

Viewers also liked

Real-time HBase: Lessons from the Cloud
Real-time HBase: Lessons from the CloudReal-time HBase: Lessons from the Cloud
Real-time HBase: Lessons from the CloudHBaseCon
 
Kite SDK: Working with Datasets
Kite SDK: Working with DatasetsKite SDK: Working with Datasets
Kite SDK: Working with DatasetsCloudera, Inc.
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Apache HBase Improvements and Practices at Xiaomi
Apache HBase Improvements and Practices at XiaomiApache HBase Improvements and Practices at Xiaomi
Apache HBase Improvements and Practices at XiaomiHBaseCon
 
Apache HBase at Airbnb
Apache HBase at Airbnb Apache HBase at Airbnb
Apache HBase at Airbnb HBaseCon
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the BasicsHBaseCon
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme MakeoverHBaseCon
 
Breaking the Sound Barrier with Persistent Memory
Breaking the Sound Barrier with Persistent Memory Breaking the Sound Barrier with Persistent Memory
Breaking the Sound Barrier with Persistent Memory HBaseCon
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseHBaseCon
 
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...HBaseCon
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
HBaseCon 2015: Blackbird Collections - In-situ  Stream Processing in HBaseHBaseCon 2015: Blackbird Collections - In-situ  Stream Processing in HBase
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBaseHBaseCon
 
Design Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and KijiDesign Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and KijiHBaseCon
 
HBase at Xiaomi
HBase at XiaomiHBase at Xiaomi
HBase at XiaomiHBaseCon
 
Rolling Out Apache HBase for Mobile Offerings at Visa
Rolling Out Apache HBase for Mobile Offerings at Visa Rolling Out Apache HBase for Mobile Offerings at Visa
Rolling Out Apache HBase for Mobile Offerings at Visa HBaseCon
 
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon
 
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems Cloudera, Inc.
 

Viewers also liked (20)

Real-time HBase: Lessons from the Cloud
Real-time HBase: Lessons from the CloudReal-time HBase: Lessons from the Cloud
Real-time HBase: Lessons from the Cloud
 
Kite SDK: Working with Datasets
Kite SDK: Working with DatasetsKite SDK: Working with Datasets
Kite SDK: Working with Datasets
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Apache HBase Improvements and Practices at Xiaomi
Apache HBase Improvements and Practices at XiaomiApache HBase Improvements and Practices at Xiaomi
Apache HBase Improvements and Practices at Xiaomi
 
Apache HBase at Airbnb
Apache HBase at Airbnb Apache HBase at Airbnb
Apache HBase at Airbnb
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
 
Breaking the Sound Barrier with Persistent Memory
Breaking the Sound Barrier with Persistent Memory Breaking the Sound Barrier with Persistent Memory
Breaking the Sound Barrier with Persistent Memory
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBase
 
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
 
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a FlurryHBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: HBase Operations in a Flurry
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
HBaseCon 2015: Blackbird Collections - In-situ  Stream Processing in HBaseHBaseCon 2015: Blackbird Collections - In-situ  Stream Processing in HBase
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
 
Design Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and KijiDesign Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and Kiji
 
HBase at Xiaomi
HBase at XiaomiHBase at Xiaomi
HBase at Xiaomi
 
Rolling Out Apache HBase for Mobile Offerings at Visa
Rolling Out Apache HBase for Mobile Offerings at Visa Rolling Out Apache HBase for Mobile Offerings at Visa
Rolling Out Apache HBase for Mobile Offerings at Visa
 
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
 
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
 

Similar to HBase Data Modeling and Access Patterns with Kite SDK

HBaseCon 2014-Just the Basics
HBaseCon 2014-Just the BasicsHBaseCon 2014-Just the Basics
HBaseCon 2014-Just the BasicsJesse Anderson
 
Kite SDK introduction for Portland Big Data
Kite SDK introduction for Portland Big DataKite SDK introduction for Portland Big Data
Kite SDK introduction for Portland Big Data_blue
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
ASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataJohn Beresniewicz
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsCloudera, Inc.
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Jeremy Walsh
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Cloudera, Inc.
 
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...Dataconomy Media
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017Ike Ellis
 
Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015eddiebaggott
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 

Similar to HBase Data Modeling and Access Patterns with Kite SDK (20)

HBaseCon 2014-Just the Basics
HBaseCon 2014-Just the BasicsHBaseCon 2014-Just the Basics
HBaseCon 2014-Just the Basics
 
Kite SDK introduction for Portland Big Data
Kite SDK introduction for Portland Big DataKite SDK introduction for Portland Big Data
Kite SDK introduction for Portland Big Data
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
ASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH data
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data Applications
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
 
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Framew...
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Spark etl
Spark etlSpark etl
Spark etl
 
Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017
 
Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 

More from HBaseCon

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on KubernetesHBaseCon
 
hbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beamhbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on BeamHBaseCon
 
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at HuaweiHBaseCon
 
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinteresthbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in PinterestHBaseCon
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程HBaseCon
 
hbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Neteasehbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at NeteaseHBaseCon
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践HBaseCon
 
hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台HBaseCon
 
hbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.comhbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.comHBaseCon
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architectureHBaseCon
 
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huaweihbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at HuaweiHBaseCon
 
hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMihbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMiHBaseCon
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0HBaseCon
 
HBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBaseHBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBaseHBaseCon
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon
 
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon
 

More from HBaseCon (20)

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
 
hbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beamhbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beam
 
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
 
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinteresthbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
 
hbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Neteasehbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Netease
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践
 
hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台
 
hbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.comhbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.com
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
 
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huaweihbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
 
hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMihbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMi
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
 
HBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBaseHBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBase
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
 
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBase
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBase
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at Didi
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase Client
 

Recently uploaded

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
1C_PNS.pdf Philippines National standard
1C_PNS.pdf Philippines National standard1C_PNS.pdf Philippines National standard
1C_PNS.pdf Philippines National standardraffietividad53
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 

Recently uploaded (20)

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
1C_PNS.pdf Philippines National standard
1C_PNS.pdf Philippines National standard1C_PNS.pdf Philippines National standard
1C_PNS.pdf Philippines National standard
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 

HBase Data Modeling and Access Patterns with Kite SDK

  • 1. 1 HBase Data Modeling and Access Patterns with Kite SDK Adam Warrington Sr. Manager Customer Ops Tools Team
  • 2. 2 Developing on top of Apache Hadoop ©2014 Cloudera, Inc. All rights reserved.2 • Apache Hadoop is an incredibly powerful platform on which to develop data applications. • Scale • it provides the infrastructure needed to process big data at scale. • Flexibility • General purpose platform on top of which one can build almost any type of big data application. • Diverse Ecosystem • Multitude of storage engines, tools for ETL, machine learning, analysis, and data science. • This comes at a cost…
  • 3. 3 Developing on top of Apache Hadoop: The Cost ©2014 Cloudera, Inc. All rights reserved.3 • The API is very basic and low level. • Developers are required to build plumbing and infrastructure to create even a basic system. • Repeat process for every system you create. • Have to understand the quirks of each system. • The barrier to entry is high for many enterprise Java developers in the industry.
  • 4. 4 What is Kite SDK? ©2014 Cloudera, Inc. All rights reserved. • Kite SDK aims to solve this problem by building a higher level API on top of the Hadoop ecosystem • Kite exists as a client-side library for writing Hadoop Data Applications • Modular • Datasets: standard storage • Morphlines: ETL as configuration • Data Management Tools
  • 5. 5 What is Kite SDK? ©2014 Cloudera, Inc. All rights reserved. • Kite SDK aims to solve this problem by building a higher level API on top of the Hadoop ecosystem • Kite exists as a client-side library for writing Hadoop Data Applications • Modular • Datasets: standard storage • Morphlines: ETL as configuration • Data Management Tools • Today’s talk will focus on the Datasets Module
  • 6. 6 Kite Datasets ©2014 Cloudera, Inc. All rights reserved. • Motivation • Focus on your data, not managing it • Goals • Think in terms of data, not files • Describe your data and Kite does the right thing • Consistency - should work across the platform • Reliability
  • 7. 7 Kite Datasets ©2014 Cloudera, Inc. All rights reserved. At the heart of the Kite Datasets module is a unified storage interface. • Dataset – a collection of entities • DatasetRepository – physical storage location for datasets • DatasetDescriptor – holds dataset metadata (schema, format) • DatasetWriter – write entities to a dataset in a stream • DatasetReader – read entities from a dataset
  • 8. 8 Kite Partition Strategies ©2014 Cloudera, Inc. All rights reserved.8 PartitionStrategy defines how to map an entity to partitions in HDFS or row keys in HBase PartitionStrategy p = new PartitionStrategy.Builder() .year("timestamp") .month("timestamp") .day("timestamp").build(); /user/hive/warehouse/events /year=2014/month=05/day=05 /FlumeData.1375659013795 /FlumeData.1375659013796
  • 9. 9 Kite Datasets Example ©2014 Cloudera, Inc. All rights reserved. Event.avsc { "type" : "record", "name" : ”Event", "namespace" : "com.example”, "fields" : [ { "name”: ”id", "type”: ”long” }, { “name”: “timestamp”, “type”: “long” }, { “name”: “source”, “type”: “string” } ] } Log4j Configuration log4j.appender.flume = org.kitesdk.data.flume.Log4jAppender log4j.appender.flume.Hostname = localhost log4j.appender.flume.Port = 41415 log4j.appender.flume.DatasetRepositoryUri = repo:hive log4j.appender.flume.DatasetName = events
  • 10. 10 Kite Datasets Example Continued ©2014 Cloudera, Inc. All rights reserved. Dataset Creation DatasetRepository repo = DatasetRepositories.open("repo:hive"); DatasetDescriptor descriptor = new DatasetDescriptor.Builder() schema(Event.avsc).build(); repo.create("events", descriptor); Java Code Logger logger = Logger.getLogger(...); Event event = new Event(); event.setId(id); event.setTimestamp(System.currentTimeMillis()); event.setSource(source); logger.info(event);
  • 11. 11 Kite Datasets Example Continued ©2014 Cloudera, Inc. All rights reserved. /user /hive /warehouse /events /FlumeData.1375659013795 /FlumeData.1375659013796 Avro files Resulting File Layout
  • 13. 13 HBase Storage Format ©2014 Cloudera, Inc. All rights reserved.13 HBase storage concepts are fundamentally different from file formats on HDFS • Ordered Rows • Column Families • Random Access Operations
  • 14. 14 HBase Storage Format ©2014 Cloudera, Inc. All rights reserved.14 New concepts added to the Dataset API: • Composite Keys – support for entity ordering with composite keys • Column mapping – define how data is split across column families and columns in a table • Random Access Dataset Methods– support for Get, Put, and Delete operations on the Dataset interface
  • 15. 15 Composite Key Engineering ©2014 Cloudera, Inc. All rights reserved.15 • Properly engineered row keys is crucial for optimizing HBase scans. • HBase tables sort using lexicographical ordering of key byte arrays • Composite keys are a common use case, but hard to get correct.
  • 16. 16 Composite Key Engineering With Partition Strategies ©2014 Cloudera, Inc. All rights reserved.16 • We already have a way to split records across storage buckets with a PartitionStrategy. • Let’s re-use that concept. • Example: Define a PartitionStrategy optimized for historical web page scans Website.avsc { "type" : "record", "name" : ”Website", "namespace" : "com.example”, "fields" : [ { "name”: ”url", "type”: ”string” }, { “name”: “timestamp”, “type”: “long” }, { "name”: ”content", "type" : ”string” } ] } Partition Strategy Builder PartitionStrategy p = new PartitionStrategy.Builder() .identity(”url") .identity(”timestamp") .build();
  • 17. 17 Composite Key Engineering With Partition Strategies ©2014 Cloudera, Inc. All rights reserved.17 Or with the Partition Strategy JSON format Website.avsc { "type" : "record", "name" : ”Website", "namespace" : "com.example”, "fields" : [ { "name”: ”url", "type”: ”string” }, { “name”: “timestamp”, “type”: “long” }, { "name”: ”content", "type" : ”string” } ] } WebsitePartitionStrat.json [ { “source”: “url”, “type”: “id” }, { “source”: “timestamp”, “type”: “id” } ]
  • 18. 18 Key Memcmp Encoding ©2014 Cloudera, Inc. All rights reserved.18 • Encode composite key parts so serialized byte array will sort lexicographically by key fields in order. { “id”: 1, “ts”: 100, … } { “id”: 2, “ts”: 50, … } { “id”: 2, “ts”: 102, … } < <
  • 19. 19 Key Memcmp Encoding (Integer and Long) ©2014 Cloudera, Inc. All rights reserved.19 Value Bytes 1 0x00000001 0 0x00000000 -1 0xFFFFFFFFF -2 0xFFFFFFFFE Standard integer and long serialization sorts across negative and positive numbers wrong So we flip the sign bit when serializing an integer or long Value Bytes 1 0x80000001 0 0x80000000 -1 0x7FFFFFFFF -2 0x7FFFFFFFE
  • 20. 20 Key Memcmp Encoding (Variable Length Types) ©2014 Cloudera, Inc. All rights reserved.20 Value1 Value2 Bytes “foo” “bar” x03foox03bar “foo” “zr” x03foox02zr “zo” “bar” 0xFFFFFFFFF Binary Avro encoding is length prefixed. This can sort composite keys wrong. So we terminated Strings with a terminating character. Value1 Value2 Bytes “foo” “bar” foox00barx00 “foo” “zr” foox00zrx00 “zo” “bar” zox00barx00
  • 21. 21 Key Memcmp Encoding (Variable Length Types) ©2014 Cloudera, Inc. All rights reserved.21 • How do we handle a x00 byte present in the variable length type? • Convert x00 byte to x00x01, and use x00x00 as terminating character. Value1 Value2 Bytes “fo” “bar” foox00x00barx00x00 “fox00” “aa” foox00x01x00x00aax00x00
  • 22. 22 Column Mappings ©2014 Cloudera, Inc. All rights reserved.22 Defines how an Avro record’s fields are mapped to an HBase table row. Mapping Type Description column Maps a record field value directly to a column counter Similar to column, except supports atomic increment keyAsColumn Maps key/value field types to a column family where each key entry is a column qualifier and value entry is the cell value. key Record field’s value is part of the composite key occVersion Enables optimistic concurrency control on the dataset.
  • 23. 23 Column Mappings: Header Definition ©2014 Cloudera, Inc. All rights reserved.23 Event.avsc { "type" : "record", "name" : "Event", "namespace" : "com.example”, “mapping”: [ { “source”: “id”, “type”: “key” }, { “source”: “ts”, “type”: “key” }, { “source”: “source”, “type”: “column”, “value”: “meta:source”}, { “source”: “atts”, “type”: keyAsColumn”, “value”: “atts:” } ], "fields" : [ { "name" : "id", "type" : "long” }, { "name" : "ts", "type" : "long” }, { "name" : "source", "type" : "string" }, { “name” : “atts”, “type”: { “type”: “map”, “value”: “string” } } ] } • Mapping definition attribute can be added right to the Avro record schema • Still a valid Avro schema – Avro’s schema parser will ignore unknown attributes in record header.
  • 24. 24 Column Mappings: Field Definition ©2014 Cloudera, Inc. All rights reserved.24 Event.avsc { "type" : "record", "name" : "Event", "namespace" : "com.example”, "fields" : [ { "name”: "id", "type”: "long”, “mapping”: { “type”: “key” }}, { "name”: "ts", "type" : "long”, “mapping”: { “type”: “key” }}, { "name”: "source", "type”: "string”, “mapping”: { “type”: “column”, “value”: “meta:source” }}, { “name” : “atts”, “type”: { “type”: “map”, “value”: “string” }, “mapping”: { “type”: “keyAsColumn”, “value”: “atts:” }} ] } • Mapping definition attributes can be defined directly on the Avro schema fields. • Still a valid Avro schema – Avro’s schema parser will ignore unknown attributes on fields.
  • 25. 25 Column Mappings: External Definition ©2014 Cloudera, Inc. All rights reserved.25 Event.avsc { "type" : "record", "name" : "Event", "namespace" : "com.example”, "fields" : [ { "name”: "id", "type”: "long” }, { "name”: "ts", "type" : "long” }, { "name”: "source", "type”: "string” }, { “name” : “atts”, “type”: { “type”: “map”, “value”: “string” }} ] } • Mapping definition attributes can be defined in an external file. • Perfect if you don’t want to update existing Avro schemas. EventMapping.json [ { “source”: “id”, “type”: “key” }, { “source”: “ts”, “type”: “key” }, { “source”: “source”, “type”: “column”, “value”: “meta:source”}, { “source”: “atts”, “type”: keyAsColumn”, “value”: “atts:” } ]
  • 26. 26 Column Mapping Types: “column” ©2014 Cloudera, Inc. All rights reserved.26 • Maps a field to a fully qualified column • Fields serialized using Avro binary encoding except… • Integer serialized as 4 byte int • Long serialized as 8 byte long • String serialized as UTF8 bytes • Allows atomic increment and append on these types, which length prefixed and zig-zag encoding would not. Row Key Column Family: meta Column Family: atts Key Part 1 Key Part 2 Qualfier: source Qualifier: ip Qualifier: level 1 1396322485 server1 192.168.0.100 ERROR Event Instance: { “id”: 1, “ts”: 1396322485, “source”: “server1”, “atts”: { “ip”: “192.168.0.100”, “level”: “ERROR” } }
  • 27. 27 Column Mapping Types: “keyAsColumn” ©2014 Cloudera, Inc. All rights reserved.27 • Allowed for Map and Record types • Splits apart a Map by its entries, using keys as the qualifier, and storing values in the cell. • Splits apart a Record by its fields, using field names as the qualifier, and storing the values in the cell. • Fields serialized using Avro’s binary encoding • Allows pattern for atomic updates to the keyAsColumn field. Row Key Column Family: meta Column Family: atts Key Part 1 Key Part 2 Qualfier: source Qualifier: ip Qualifier: level 1 1396322485 server1 192.168.0.100 ERROR Event Instance: { “id”: 1, “ts”: 1396322485, “source”: “server1”, “atts”: { “ip”: “192.168.0.100”, “level”: “ERROR” } }
  • 28. 28 Column Mapping Types: “key” ©2014 Cloudera, Inc. All rights reserved.28 • Allowed for simple types – int, long, float, double, boolean, string, bytes • Can be defined on multiple fields to support multi-part keys • Rows are ordered lexicographically by key mapping fields in the order they are defined Row Key Column Family: meta Column Family: atts Key Part 1 Key Part 2 Qualfier: source Qualifier: ip Qualifier: level 1 1396322485 server1 192.168.0.100 ERROR Event Instance: { “id”: 1, “ts”: 1396322485, “source”: “server1”, “atts”: { “ip”: “192.168.0.100”, “level”: “ERROR” } }
  • 29. 29 1 2 3 4 public E get(Key key); public boolean put(E entity); public long increment(Key key, String fieldName, long amount); public void delete(Key key); RandomAccessDataset ©2014 Cloudera, Inc. All rights reserved.29 Adds a number of methods to the Dataset interface for random access operations.
  • 30. 30 Random Access Dataset Example ©2014 Cloudera, Inc. All rights reserved. Website.avsc { "type" : "record", "name" : ”Website", "namespace" : "com.example”, "fields" : [ { "name”: ”url", "type”: ”string” }, { “name”: “timestamp”, “type”: “long” }, { “name”: “size”, “type”: “int” }, { "name”: ”content", "type" : ”string” } ] } WebsitesPartitionStrat.json [ { “source”: “url”, “type”: “id” } ] WebsiteVersionsPartitionStrat.json [ { “source”: “url”, “type”: “id” }, { “source”: “timestamp”, “type”: “id” } ] WebsiteColumnMapping.json [ { “source”: “url”, “type”: “column”, “value”: “meta:url” }, { “source”: “timestamp”, “type”: “column”, “value”: “meta:timestamp” }, { “source”: “size”, “type”: “column”, “value”: “meta:size” }, { “source”: “content”, “type”: “column”, “value”: “content:content” } ]
  • 31. 31 Random Access Dataset Example ©2014 Cloudera, Inc. All rights reserved. private RandomAccessDataset<Website> websitesDataset = …; private RandomAccessDataset<Website> websiteVersionsDataset = …; public void calculateNextFetch(String url) { Key key = new Key.Builder(websitesDataset).add("url", url).build(); Website website = websites.get(key); DatasetReader<Website> websiteVersionReader = websiteVersionsDataset.with("url", url).newReader(); long ts = computeNextFetchTime(websiteVersionReader); website.setNextFetchTime(ts); websites.put(website); }
  • 33. 33 Concurrency Control ©2014 Cloudera, Inc. All rights reserved.33 • HBase doesn’t have native support for transactions. • This missing feature can be problematic to newbies. • Single Row Puts are atomic, so best practice is to prefer de- normalizing data into wide rows. • This doesn’t help for Get-Update-Put operations though…
  • 34. 34 Optimistic Concurrency Control ©2014 Cloudera, Inc. All rights reserved.34 • Prevents multiple processes performing row updates from colliding • Enabled with an “occVersion” column mapping type. { "type" : "record", "name" : "Event", "namespace" : "com.example”, “mapping”: [ { “source”: “id”, “type”: “key” }, { “source”: “ts”, “type”: “key” }, { “source”: “source”, “type”: “column”, “value”: “meta:source”}, { “source”: “version”, “type”: occVersion” } ], "fields" : [ { "name" : "id", "type" : "long” }, { "name" : "ts", "type" : "long” }, { "name" : "source", "type" : "string" }, { “name” : “version”, “type” : “long” } ] }
  • 35. 35 Optimistic Concurrency Control Continued… ©2014 Cloudera, Inc. All rights reserved.35 • The version field is used to track the version in the row. • Uses checkAndPut under the hood to ensure the row hasn’t been updated. • Can’t put to an existing row without first fetching it. • If conflict occurs, put() on RandomAccessDataset will return false. • Successful put() increments the version. • Up to the developer how to handle a conflict. • Enables data protection for long running edits, like shared editing in a web application.
  • 36. 36 Other Notable Advanced Features ©2014 Cloudera, Inc. All rights reserved.36 • Schema Migrations • Users have the ability to add or remove fields from the Avro record schemas. • Kite SDK keeps the historical set of Avro schemas in a specially designated HBase table. • Kite SDK will verify that only valid schema migrations can occur. • Composite Datasets • Users can create multiple datasets for a single HBase table. • This allows developers to atomically Get and Put multiple types of Avro records to a single row. • Kite SDK will verify that dataset column mappings don’t clash.
  • 37. 37 ©2014 Cloudera, Inc. All rights reserved. Adam Warrington @adamwar