Data Evolution on HBase (with Kiji)

1,243 views
1,009 views

Published on

Data changes over time often requiring carefully planned changes to database tables and application code. KijiSchema integrates best practices with serialization, schema design & evolution, and metadata management common in NoSQL storage solutions. In particular, KijiSchema provides strong guarantees of schema evolution and validation of reads and writes issued by application code.

We'll be looking at how you can take advantage of KijiSchema in your HBase applications, especially if you're new to HBase.

The Kiji Project is a modular, open-source framework that enables developers and analysts to collect, analyze and use data in real-time applications.

Developers are using Kiji to build:

• Product and content recommendation systems
• Risk analysis and fraud monitoring
• Customer profile and segmentation applications
• Energy usage analytics & reporting

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,243
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
16
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Data Evolution on HBase (with Kiji)

  1. 1. Data Evolution on HBase with Kiji
  2. 2. Who am I?
  3. 3. How do we store data in HBase? • HBase provides us with a single value type: byte[] • In an application it’s necessary to store various data types in a cell: e.g. Java primitives, Java objects… • The description of data we store in an HBase cell is the schema. • Write application code or a library to convert our data to/ from byte[].
  4. 4. What about when we want to get our data back from HBase? • HBase is unaware of what we put in a cell. • What’s in the bytes[]?
 x0010312985x00 column=B:B:D, timestamp=1381493621000, value=x0Bx00xB0x9AxA4x9BxB3Px02x80xF6xC4xD5xE6'x00x00<<b> • Already wrote a library to serialize/deserialize this data, so everything’s great, right?
  5. 5. Sometime soon… • The data structure has changed • Ok, so • What we update our library with the changes. happens when we try to read back old data? • Raise an exception? • Write a bunch of if (…) else if (…) code to determine the correct format?
  6. 6. Instead, use a serialization library with evolvable records • Examples: • Avro • Thrift • Protocol • Have Buffers some notion of compatible changes to help us avoid common pitfalls.
  7. 7. A little bit about Avro • Datum • Rules structure defined by schema for compatible and incompatible schema changes. • Backward-compatibility, Forward-compatibility • Assumes a linear evolution of schema • Reality: Schema evolution is more complicated.
  8. 8. Ideal vs. Reality Schema v1 Schema v1 Schema v2 Schema v2a Schema v3 Schema v3a Schema v4 Schema v2b Schema v4aa Schema v4ab
  9. 9. A little (more) about Avro • Schemas • Specific • Avro 
 can be defined in JSON or IDL format. & Generic API Schema Example (IDL):
 record Pet {
 string name;
 int age;
 string owner_name;
 }
  10. 10. (even more) about Avro • You don't have to use compiled record classes. • Can use GenericReader API to deserialize records with a specified schema. • Makes it easier to migrate data when you do have to make an incompatible schema change. (sleep on it)
  11. 11. Adding New Fields • Old 
 Schema:
 record Pet {
 string name; // Ms. Kitty
 } ! • New 
 Schema:
 record Pet {
 string name;
 // kitten
 string kind = “animal”;
 }
  12. 12. Remove Fields • Old 
 Schema:
 record Pet {
 string name;
 string kind;
 } • New 
 Schema:
 record Pet {
 string name;
 string kind;
 } What happens when an old reader reads this new record? Can’t find kind
  13. 13. Remove Fields • Old 
 Schema:
 • New 
 record Pet {
 string name = “”;
 string kind = “animal”;
 } Schema:
 record Pet {
 string name = “”;
 string kind = “animal”;
 } When an ‘old’ reader encounters a new record, the default value will be used. ! Protip: Always provide default field values so you don’t kill your kittens.
  14. 14. Type Promotion • Old 
 Schema:
 • New 
 record PetOwner {
 int ownerId;
 string name;
 } Schema:
 record PetOwner {
 long ownerId;
 string name;
 } ! Detailed specification: http://avro.apache.org/docs/1.7.5/spec.html#Schema+Resolution
  15. 15. Enter • Human-friendly • Uses Avro • Provides table layouts for serialization • Supports • Schema Schema primitive types and complex records is stored as part of the datum schema validation & audit trail
  16. 16. KijiTables >> Plain • Layout defined using JSON or DDL • Formatted • Schemas row keys stored in metadata table • Schema Validation • Basically, an on read & write enhanced HBase table tables
  17. 17. Example Table Definition CREATE TABLE foo WITH DESCRIPTION 'some data'
 ROW KEY FORMAT (pet_id LONG)
 WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' (
 MAXVERSIONS = INFINITY,
 TTL = FOREVER,
 INMEMORY = false,
 COMPRESSED WITH SNAPPY,
 FAMILY info WITH DESCRIPTION 'basic information' (
 metadata CLASS com.mycompany.avro.Pet
 ); • Visit www.kiji.org for more details
  18. 18. Beyond Client-side Validation • Server-side Schema validation • Ensure your reader is able to read stored records. • Ensure you make compatible schema changes • Ensure you don’t accidentally introduce a new schema
  19. 19. Schema Validation • Three available modes: • Developer • Strict • None (Not recommended in most cases, but still possible.)
  20. 20. Developer Mode ! ! • Don’t • New need an ALTER statement to write with a new schema schemas are automatically registered on write • Incompatible writers are rejected at run-time, so still safe • Convenience when developing
  21. 21. Strict Mode ! ! • New schemas must be registered with an ALTER statement. • Incompatible time. readers and writers are rejected at registration • Production-safe
  22. 22. ALTER Examples • ALTER TABLE t SET VALIDATION = STRICT; • ALTER TABLE t ADD WRITER SCHEMA "long" FOR COLUMN info:foo;
 
 In column: 'info:foo' Reader schema: "int" is incompatible with writer schema: "long".
  23. 23. Avoid Common Pitfalls w/ KijiSchema Validation 1. Record has string field with default value 2. Field removed (compatible) 3. New field with same name added but different type (compatible from perspective of 2) 4. Incompatible between 1 and 3!
  24. 24. • Apache v2 Licensed Open Source • Includes KijiSchema as well as components for writing • MapReduce • Hive Adapter • Scalding flows for data science • REST API supporting on-demand computation for real-time web applications
  25. 25. BentoBox • Complete • Single development environment for Kiji & HBase process Hadoop & HBase cluster • We accept community contributions • Try it today: www.kiji.org • User Mailing List
 Developer Mailing List
  26. 26. Questions? Adam Kunicki adam@wibidata.com @ramblingpolak on Twitter

×