Data Evolution on HBase with Kiji
Who am I?
How do we store data in HBase?
• HBase

provides us with a single value type: byte[]

• In

an application it’s necessary ...
What about when we want to
get our data back from HBase?
• HBase

is unaware of what we put in a cell.	


• What’s

in the...
Sometime soon…
• The

data structure has changed	


• Ok, so
• What

we update our library with the changes.	


happens wh...
Instead, use a serialization library
with evolvable records
• Examples:	

• Avro	

• Thrift	

• Protocol
• Have

Buffers	
...
A little bit about Avro
• Datum
• Rules

structure defined by schema	


for compatible and incompatible schema changes.	


...
Ideal vs. Reality
Schema v1

Schema v1

Schema v2

Schema v2a

Schema v3

Schema v3a

Schema v4

Schema v2b

Schema v4aa S...
A little (more) about Avro
• Schemas
• Specific
• Avro




can be defined in JSON or IDL format.	


& Generic API	


Schema ...
(even more) about Avro
• You

don't have to use compiled record classes.	


• Can

use GenericReader API to deserialize re...
Adding New Fields
• Old




Schema:


record Pet {

string name; // Ms. Kitty

}
!

• New




Schema:


record Pet {

stri...
Remove Fields
• Old




Schema:


record Pet {

string name;

string kind;

}

• New




Schema:


record Pet {

string na...
Remove Fields
• Old




Schema:


• New




record Pet {

string name = “”;

string kind = “animal”;

}

Schema:


record ...
Type Promotion
• Old




Schema:


• New




record PetOwner {

int ownerId;

string name;

}

Schema:


record PetOwner {...
Enter
• Human-friendly
• Uses Avro

• Provides

table layouts 	


for serialization	


• Supports
• Schema

Schema

primit...
KijiTables >> Plain
• Layout

defined using JSON or DDL	


• Formatted
• Schemas

row keys	


stored in metadata table	


•...
Example Table Definition
CREATE TABLE foo WITH DESCRIPTION 'some data'

ROW KEY FORMAT (pet_id LONG)

WITH LOCALITY GROUP d...
Beyond Client-side Validation
• Server-side

Schema validation	


• Ensure

your reader is able to read stored records. 	
...
Schema Validation
• Three

available modes:	


• Developer	

• Strict	

• None

(Not recommended in most cases, but still ...
Developer Mode
!
!

• Don’t
• New

need an ALTER statement to write with a new schema	


schemas are automatically registe...
Strict Mode
!
!

• New

schemas must be registered with an ALTER statement.	


• Incompatible

time.	


readers and writer...
ALTER Examples
•

ALTER TABLE t SET VALIDATION = STRICT;

•

ALTER TABLE t ADD WRITER SCHEMA "long" FOR COLUMN
info:foo;

...
Avoid Common Pitfalls w/
KijiSchema Validation
1. Record has string field with default value	

2. Field removed (compatible...
• Apache

v2 Licensed Open Source	


• Includes

KijiSchema as well as components for writing	


•

MapReduce

•

Hive Ada...
BentoBox
• Complete
• Single

development environment for Kiji & HBase	


process Hadoop & HBase cluster	


• We

accept c...
Questions?
Adam Kunicki	

adam@wibidata.com	

@ramblingpolak on Twitter
Upcoming SlideShare
Loading in...5
×

Data Evolution on HBase (with Kiji)

729

Published on

Data changes over time often requiring carefully planned changes to database tables and application code. KijiSchema integrates best practices with serialization, schema design & evolution, and metadata management common in NoSQL storage solutions. In particular, KijiSchema provides strong guarantees of schema evolution and validation of reads and writes issued by application code.

We'll be looking at how you can take advantage of KijiSchema in your HBase applications, especially if you're new to HBase.

The Kiji Project is a modular, open-source framework that enables developers and analysts to collect, analyze and use data in real-time applications.

Developers are using Kiji to build:

• Product and content recommendation systems
• Risk analysis and fraud monitoring
• Customer profile and segmentation applications
• Energy usage analytics & reporting

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
729
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
14
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Data Evolution on HBase (with Kiji)

  1. 1. Data Evolution on HBase with Kiji
  2. 2. Who am I?
  3. 3. How do we store data in HBase? • HBase provides us with a single value type: byte[] • In an application it’s necessary to store various data types in a cell: e.g. Java primitives, Java objects… • The description of data we store in an HBase cell is the schema. • Write application code or a library to convert our data to/ from byte[].
  4. 4. What about when we want to get our data back from HBase? • HBase is unaware of what we put in a cell. • What’s in the bytes[]?
 x0010312985x00 column=B:B:D, timestamp=1381493621000, value=x0Bx00xB0x9AxA4x9BxB3Px02x80xF6xC4xD5xE6'x00x00<<b> • Already wrote a library to serialize/deserialize this data, so everything’s great, right?
  5. 5. Sometime soon… • The data structure has changed • Ok, so • What we update our library with the changes. happens when we try to read back old data? • Raise an exception? • Write a bunch of if (…) else if (…) code to determine the correct format?
  6. 6. Instead, use a serialization library with evolvable records • Examples: • Avro • Thrift • Protocol • Have Buffers some notion of compatible changes to help us avoid common pitfalls.
  7. 7. A little bit about Avro • Datum • Rules structure defined by schema for compatible and incompatible schema changes. • Backward-compatibility, Forward-compatibility • Assumes a linear evolution of schema • Reality: Schema evolution is more complicated.
  8. 8. Ideal vs. Reality Schema v1 Schema v1 Schema v2 Schema v2a Schema v3 Schema v3a Schema v4 Schema v2b Schema v4aa Schema v4ab
  9. 9. A little (more) about Avro • Schemas • Specific • Avro 
 can be defined in JSON or IDL format. & Generic API Schema Example (IDL):
 record Pet {
 string name;
 int age;
 string owner_name;
 }
  10. 10. (even more) about Avro • You don't have to use compiled record classes. • Can use GenericReader API to deserialize records with a specified schema. • Makes it easier to migrate data when you do have to make an incompatible schema change. (sleep on it)
  11. 11. Adding New Fields • Old 
 Schema:
 record Pet {
 string name; // Ms. Kitty
 } ! • New 
 Schema:
 record Pet {
 string name;
 // kitten
 string kind = “animal”;
 }
  12. 12. Remove Fields • Old 
 Schema:
 record Pet {
 string name;
 string kind;
 } • New 
 Schema:
 record Pet {
 string name;
 string kind;
 } What happens when an old reader reads this new record? Can’t find kind
  13. 13. Remove Fields • Old 
 Schema:
 • New 
 record Pet {
 string name = “”;
 string kind = “animal”;
 } Schema:
 record Pet {
 string name = “”;
 string kind = “animal”;
 } When an ‘old’ reader encounters a new record, the default value will be used. ! Protip: Always provide default field values so you don’t kill your kittens.
  14. 14. Type Promotion • Old 
 Schema:
 • New 
 record PetOwner {
 int ownerId;
 string name;
 } Schema:
 record PetOwner {
 long ownerId;
 string name;
 } ! Detailed specification: http://avro.apache.org/docs/1.7.5/spec.html#Schema+Resolution
  15. 15. Enter • Human-friendly • Uses Avro • Provides table layouts for serialization • Supports • Schema Schema primitive types and complex records is stored as part of the datum schema validation & audit trail
  16. 16. KijiTables >> Plain • Layout defined using JSON or DDL • Formatted • Schemas row keys stored in metadata table • Schema Validation • Basically, an on read & write enhanced HBase table tables
  17. 17. Example Table Definition CREATE TABLE foo WITH DESCRIPTION 'some data'
 ROW KEY FORMAT (pet_id LONG)
 WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' (
 MAXVERSIONS = INFINITY,
 TTL = FOREVER,
 INMEMORY = false,
 COMPRESSED WITH SNAPPY,
 FAMILY info WITH DESCRIPTION 'basic information' (
 metadata CLASS com.mycompany.avro.Pet
 ); • Visit www.kiji.org for more details
  18. 18. Beyond Client-side Validation • Server-side Schema validation • Ensure your reader is able to read stored records. • Ensure you make compatible schema changes • Ensure you don’t accidentally introduce a new schema
  19. 19. Schema Validation • Three available modes: • Developer • Strict • None (Not recommended in most cases, but still possible.)
  20. 20. Developer Mode ! ! • Don’t • New need an ALTER statement to write with a new schema schemas are automatically registered on write • Incompatible writers are rejected at run-time, so still safe • Convenience when developing
  21. 21. Strict Mode ! ! • New schemas must be registered with an ALTER statement. • Incompatible time. readers and writers are rejected at registration • Production-safe
  22. 22. ALTER Examples • ALTER TABLE t SET VALIDATION = STRICT; • ALTER TABLE t ADD WRITER SCHEMA "long" FOR COLUMN info:foo;
 
 In column: 'info:foo' Reader schema: "int" is incompatible with writer schema: "long".
  23. 23. Avoid Common Pitfalls w/ KijiSchema Validation 1. Record has string field with default value 2. Field removed (compatible) 3. New field with same name added but different type (compatible from perspective of 2) 4. Incompatible between 1 and 3!
  24. 24. • Apache v2 Licensed Open Source • Includes KijiSchema as well as components for writing • MapReduce • Hive Adapter • Scalding flows for data science • REST API supporting on-demand computation for real-time web applications
  25. 25. BentoBox • Complete • Single development environment for Kiji & HBase process Hadoop & HBase cluster • We accept community contributions • Try it today: www.kiji.org • User Mailing List
 Developer Mailing List
  26. 26. Questions? Adam Kunicki adam@wibidata.com @ramblingpolak on Twitter
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×