Kite SDK: HBase Datasets
Ryan Blue, Software Engineer
What problem is Kite solving?
©2014 Cloudera, Inc. All rights reserved.
• Accessibility
• Hadoop is flexible, but low leve...
Kite SDK
©2014 Cloudera, Inc. All rights reserved.
• A set of off-the-shelf tools
• Based on experience and best practices...
Kite Datasets: Motivation
©2014 Cloudera, Inc. All rights reserved.
Focus on using your data, not managing it
• You should...
Kite Datasets: Motivation
©2014 Cloudera, Inc. All rights reserved.
Application
Database
Data files
Your code
Provided
Mai...
Kite Datasets: Motivation
©2014 Cloudera, Inc. All rights reserved.
Application Application
Database
Data files
Data files...
Kite Datasets: Motivation
©2014 Cloudera, Inc. All rights reserved.
Application ApplicationApplication
Database
Data files...
Kite Datasets: Goals
©2014 Cloudera, Inc. All rights reserved.
• Think in terms of data, not files
• Describe your data an...
Kite Datasets: Compatibility
©2014 Cloudera, Inc. All rights reserved.
Project HDFS (avro) HDFS (parquet) HBase
Flume Sink...
Kite Datasets: What is it?
©2014 Cloudera, Inc. All rights reserved.
• A high-level API for data management
• Work with re...
Kite Datasets: Example
©2014 Cloudera, Inc. All rights reserved.
1. Describe your data
dataset obj-schema org.movielens.Ra...
Kite Datasets: Example
©2014 Cloudera, Inc. All rights reserved.
1. Describe your data
dataset obj-schema org.movielens.Ra...
Kite Datasets: Example
©2014 Cloudera, Inc. All rights reserved.
1. Describe your data
dataset obj-schema org.movielens.Ra...
Kite Datasets: Example
©2014 Cloudera, Inc. All rights reserved.
datasets/
└── ratings/
├── year=1997/
│ ├── month=09/
│ │...
Kite SDK: HBase Datasets
Ryan Blue, Software Engineer
Kite HBase: Background
©2014 Cloudera, Inc. All rights reserved.
Application ApplicationApplication
Database
Data files
Da...
Kite HBase: Background
©2014 Cloudera, Inc. All rights reserved.
• Rows identified by keys, managed by HBase
• Columns are...
• Uniform interaction with HBase and HDFS datasets
• Need to make keys from records
• Need configuration to map fields to ...
Kite HBase: Partitioning
©2014 Cloudera, Inc. All rights reserved.
• Use partition strategy to define unique keys
• Kite b...
Kite HBase: Partitioning
©2014 Cloudera, Inc. All rights reserved.
• Partition strategy produces a storage key
• HDFS part...
Kite HBase: Example partitioning
©2014 Cloudera, Inc. All rights reserved.
• Define key format from data
$ ./dataset parti...
Kite HBase: Example partitioning
©2014 Cloudera, Inc. All rights reserved.
• Define key format from data
$ ./dataset parti...
Kite HBase: Example partitioning
©2014 Cloudera, Inc. All rights reserved.
$ ./dataset partition-config --schema user.avsc...
Kite HBase: Example partitioning
©2014 Cloudera, Inc. All rights reserved.
$ ./dataset partition-config --schema user.avsc...
Kite HBase: Partitioning
©2014 Cloudera, Inc. All rights reserved.
• Use partition strategy to define unique keys
• Kite b...
Kite HBase: Field mapping
©2014 Cloudera, Inc. All rights reserved.
• Configure the column family and qualifier for a fiel...
Kite HBase: Basic column mapping
©2014 Cloudera, Inc. All rights reserved.
column
{ "source": "firstName", "type": "column...
Kite HBase: Counter mapping
©2014 Cloudera, Inc. All rights reserved.
column
{ "source": "firstName", "type": "column",
"f...
Kite HBase: Key mapping
©2014 Cloudera, Inc. All rights reserved.
key (stored in the row key using identity)
{ "source": "...
{
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "email",
"type" : "string"
}, ... ]
}
Kite HBase: Example
©2...
{
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "email",
"type" : "string"
}, ... ]
}
Kite HBase: Example
©2...
{
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "email",
"type" : "string"
}, ... ]
}
Kite HBase: Example
©2...
{
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "lastName",
"type" : "string"
}, ... ]
}
Kite HBase: Example...
{
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "lastName",
"type" : "string"
}, ... ]
}
Kite HBase: Example...
{
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "lastName",
"type" : "string"
}, ... ]
}
Kite HBase: Example...
{
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "visits",
"type" : "long"
}, ... ]
}
Kite HBase: Example
©20...
• Working with a dataset in HBase does not change
• Readers / writers are backed by scans
• CLI tools work:
dataset csv-im...
RandomAccessDataset<User> users = ...;
Key buzzEmailKey = new Key.Builder()
.add("email", "buzz@pixar.com")
.build();
User...
• Versioning and concurrency
• Additional occVersion type, like a counter
• Rejects a put if the record has changed
• Key-...
• Translation between objects and byte arrays in Kite
• Configuration to define key format
• Configuration to define how f...
Questions
©2014 Cloudera, Inc. All rights reserved.
Ryan Blue: blue@cloudera.com
Kite mailing list: cdk-dev@cloudera.org
Upcoming SlideShare
Loading in...5
×

Kite SDK: Working with Datasets

976

Published on

The Kite SDK is an open source set of libraries, tools, examples, and documentation focused on helping developers build systems on top of the Apache Hadoop ecosystem. Learn (via examples) how Kite makes it easier to work with data in HDFS and Apache HBase as records and datasets, just as you would with a relational database.

Published in: Software
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
976
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Kite SDK: Working with Datasets

  1. 1. Kite SDK: HBase Datasets Ryan Blue, Software Engineer
  2. 2. What problem is Kite solving? ©2014 Cloudera, Inc. All rights reserved. • Accessibility • Hadoop is flexible, but low level • Should be easy to use, without being an expert
  3. 3. Kite SDK ©2014 Cloudera, Inc. All rights reserved. • A set of off-the-shelf tools • Based on experience and best practices • Lets you focus on your problem • Helps you solve new challenges
  4. 4. Kite Datasets: Motivation ©2014 Cloudera, Inc. All rights reserved. Focus on using your data, not managing it • You shouldn’t have to maintain data files • This is the first thing you need
  5. 5. Kite Datasets: Motivation ©2014 Cloudera, Inc. All rights reserved. Application Database Data files Your code Provided Maintained by the database
  6. 6. Kite Datasets: Motivation ©2014 Cloudera, Inc. All rights reserved. Application Application Database Data files Data files HBase Your code
  7. 7. Kite Datasets: Motivation ©2014 Cloudera, Inc. All rights reserved. Application ApplicationApplication Database Data files Data files Kite Data HBase Data files HBase Maintained by the Kite
  8. 8. Kite Datasets: Goals ©2014 Cloudera, Inc. All rights reserved. • Think in terms of data, not files • Describe your data and Kite does the right thing • Should work consistently across the platform • Reliable
  9. 9. Kite Datasets: Compatibility ©2014 Cloudera, Inc. All rights reserved. Project HDFS (avro) HDFS (parquet) HBase Flume Sink 1.0 1.0 1.0 MapReduce 1.0 1.0 1.0 Crunch 1.0 1.0 1.0 Hive 1.0 1.0 1.1 Impala 1.0 1.0 * * depends on common HBase encoding format
  10. 10. Kite Datasets: What is it? ©2014 Cloudera, Inc. All rights reserved. • A high-level API for data management • Work with records and datasets • Not files, directories, or byte arrays • Standard descriptions for records and storage • Schemas describe records • Partition strategies describe layout • Opinionated
  11. 11. Kite Datasets: Example ©2014 Cloudera, Inc. All rights reserved. 1. Describe your data dataset obj-schema org.movielens.Rating --jar app.jar --output rating.avsc
  12. 12. Kite Datasets: Example ©2014 Cloudera, Inc. All rights reserved. 1. Describe your data dataset obj-schema org.movielens.Rating --jar app.jar --output rating.avsc 1. Describe your layout dataset partition-config ts:year ts:month ts:day --schema rating.avsc --output ymd.json
  13. 13. Kite Datasets: Example ©2014 Cloudera, Inc. All rights reserved. 1. Describe your data dataset obj-schema org.movielens.Rating --jar app.jar --output rating.avsc 1. Describe your layout dataset partition-config ts:year ts:month ts:day --schema rating.avsc --output ymd.json 1. Create a dataset dataset create ratings --schema rating.avsc --partition-by ymd.json
  14. 14. Kite Datasets: Example ©2014 Cloudera, Inc. All rights reserved. datasets/ └── ratings/ ├── year=1997/ │ ├── month=09/ │ │ ├── day=20/ │ │ ├── ... │ │ └── day=30/ │ ├── month=10/ │ │ ├── day=01/ │ │ ├── ...
  15. 15. Kite SDK: HBase Datasets Ryan Blue, Software Engineer
  16. 16. Kite HBase: Background ©2014 Cloudera, Inc. All rights reserved. Application ApplicationApplication Database Data files Data files Kite Data HBase Data files HBase Maintained by the Kite
  17. 17. Kite HBase: Background ©2014 Cloudera, Inc. All rights reserved. • Rows identified by keys, managed by HBase • Columns are organized as cells • Cells are identified by column family, qualifier • The catch: everything is a byte array family name ... row key last first ... buzz@pixar.com Lightyear Buzz ...
  18. 18. • Uniform interaction with HBase and HDFS datasets • Need to make keys from records • Need configuration to map fields to cells Kite HBase ©2014 Cloudera, Inc. All rights reserved.
  19. 19. Kite HBase: Partitioning ©2014 Cloudera, Inc. All rights reserved. • Use partition strategy to define unique keys • Kite builds the key from each record • Kite translates keys to HBase row id bytes
  20. 20. Kite HBase: Partitioning ©2014 Cloudera, Inc. All rights reserved. • Partition strategy produces a storage key • HDFS partitioning uses a group key 1403028411014 => (2014, 6, 17) • HBase partitioning uses a unique key • Grouping is done dynamically by HBase 1403028411014 => (1403028411014)
  21. 21. Kite HBase: Example partitioning ©2014 Cloudera, Inc. All rights reserved. • Define key format from data $ ./dataset partition-config --schema user.avsc email:copy
  22. 22. Kite HBase: Example partitioning ©2014 Cloudera, Inc. All rights reserved. • Define key format from data $ ./dataset partition-config --schema user.avsc email:copy [ { "source" : "email", "type" : "identity", "name" : "email_copy" } ]
  23. 23. Kite HBase: Example partitioning ©2014 Cloudera, Inc. All rights reserved. $ ./dataset partition-config --schema user.avsc email:hash[16] email:copy
  24. 24. Kite HBase: Example partitioning ©2014 Cloudera, Inc. All rights reserved. $ ./dataset partition-config --schema user.avsc email:hash[16] email:copy [ { "source" : "email", "type" : "hash", "buckets" : 16, "name" : "email_hash" }, { "source" : "email", "type" : "identity", "name" : "email_copy" } ]
  25. 25. Kite HBase: Partitioning ©2014 Cloudera, Inc. All rights reserved. • Use partition strategy to define unique keys • Kite builds the key from each record • Kite translates keys to HBase row id bytes • Some operations require keys
  26. 26. Kite HBase: Field mapping ©2014 Cloudera, Inc. All rights reserved. • Configure the column family and qualifier for a field { "email": "buzz@pixar.com", "firstName": "Buzz", ... } family name ... row key last first ... buzz@pixar.com Lightyear Buzz ...
  27. 27. Kite HBase: Basic column mapping ©2014 Cloudera, Inc. All rights reserved. column { "source": "firstName", "type": "column", "family": "name", "qualifier": "first" }
  28. 28. Kite HBase: Counter mapping ©2014 Cloudera, Inc. All rights reserved. column { "source": "firstName", "type": "column", "family": "name", "qualifier": "first" } counter (can be incremented) { "source": "visits", "type": "counter" "family": "counts", "qualifier": "visits"}
  29. 29. Kite HBase: Key mapping ©2014 Cloudera, Inc. All rights reserved. key (stored in the row key using identity) { "source": "email", "type": "key" }
  30. 30. { "type" : "record", "name" : "User", "fields" : [ { "name" : "email", "type" : "string" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved.
  31. 31. { "type" : "record", "name" : "User", "fields" : [ { "name" : "email", "type" : "string" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved. [ { "source": "email", "type": "key" }, ... ]
  32. 32. { "type" : "record", "name" : "User", "fields" : [ { "name" : "email", "type" : "string" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved. family name counts prefs row key last first visits flash buzz@pixar.co m Lightyear Buzz 315 true [ { "source": "email", "type": "key" }, ... ]
  33. 33. { "type" : "record", "name" : "User", "fields" : [ { "name" : "lastName", "type" : "string" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved.
  34. 34. { "type" : "record", "name" : "User", "fields" : [ { "name" : "lastName", "type" : "string" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved. [ { "source": "lastName", "type": "column", "family": "name", "qualifier": "last" }, ... ]
  35. 35. { "type" : "record", "name" : "User", "fields" : [ { "name" : "lastName", "type" : "string" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved. family name counts prefs row key last first visits flash buzz@pixar.com Lightyear Buzz 315 true [ { "source": "lastName", "type": "column", "family": "name", "qualifier": "last" }, ... ]
  36. 36. { "type" : "record", "name" : "User", "fields" : [ { "name" : "visits", "type" : "long" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved. family name counts prefs row key last first visits flash buzz@pixar.com Lightyear Buzz 315 true [ { "source": "visits", "type": "counter", "family": "counts", "qualifier": "visits" }, ... ]
  37. 37. • Working with a dataset in HBase does not change • Readers / writers are backed by scans • CLI tools work: dataset csv-import pixar_users.csv users --use-hbase • Additional methods on RandomAccessDataset • get, put, delete, increment Kite HBase: Interaction ©2014 Cloudera, Inc. All rights reserved.
  38. 38. RandomAccessDataset<User> users = ...; Key buzzEmailKey = new Key.Builder() .add("email", "buzz@pixar.com") .build(); User buzz = users.get(buzzEmailKey); buzz.addPreference("flash", true); users.put(buzz); Kite HBase: Interaction using keys ©2014 Cloudera, Inc. All rights reserved.
  39. 39. • Versioning and concurrency • Additional occVersion type, like a counter • Rejects a put if the record has changed • Key-as-column mapping • Stores maps or records in a column family • Uses the key or field name as the qualifier Kite HBase: More features ©2014 Cloudera, Inc. All rights reserved.
  40. 40. • Translation between objects and byte arrays in Kite • Configuration to define key format • Configuration to define how fields are stored • Decreases the code and time required to experiment • Key format and column mappings are hard • Try out configurations to find the right one Kite HBase: Conclusion ©2014 Cloudera, Inc. All rights reserved.
  41. 41. Questions ©2014 Cloudera, Inc. All rights reserved. Ryan Blue: blue@cloudera.com Kite mailing list: cdk-dev@cloudera.org

×