• Save
Kite SDK: Working with Datasets
 

Like this? Share it with your network

Share

Kite SDK: Working with Datasets

on

  • 310 views

The Kite SDK is an open source set of libraries, tools, examples, and documentation focused on helping developers build systems on top of the Apache Hadoop ecosystem. Learn (via examples) how Kite ...

The Kite SDK is an open source set of libraries, tools, examples, and documentation focused on helping developers build systems on top of the Apache Hadoop ecosystem. Learn (via examples) how Kite makes it easier to work with data in HDFS and Apache HBase as records and datasets, just as you would with a relational database.

Statistics

Views

Total Views
310
Views on SlideShare
276
Embed Views
34

Actions

Likes
1
Downloads
0
Comments
0

3 Embeds 34

http://www.cloudera.com 27
http://author01.core.cloudera.com 4
http://cloudera.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Kite SDK: Working with Datasets Presentation Transcript

  • 1. Kite SDK: HBase Datasets Ryan Blue, Software Engineer
  • 2. What problem is Kite solving? ©2014 Cloudera, Inc. All rights reserved. • Accessibility • Hadoop is flexible, but low level • Should be easy to use, without being an expert
  • 3. Kite SDK ©2014 Cloudera, Inc. All rights reserved. • A set of off-the-shelf tools • Based on experience and best practices • Lets you focus on your problem • Helps you solve new challenges
  • 4. Kite Datasets: Motivation ©2014 Cloudera, Inc. All rights reserved. Focus on using your data, not managing it • You shouldn’t have to maintain data files • This is the first thing you need
  • 5. Kite Datasets: Motivation ©2014 Cloudera, Inc. All rights reserved. Application Database Data files Your code Provided Maintained by the database
  • 6. Kite Datasets: Motivation ©2014 Cloudera, Inc. All rights reserved. Application Application Database Data files Data files HBase Your code
  • 7. Kite Datasets: Motivation ©2014 Cloudera, Inc. All rights reserved. Application ApplicationApplication Database Data files Data files Kite Data HBase Data files HBase Maintained by the Kite
  • 8. Kite Datasets: Goals ©2014 Cloudera, Inc. All rights reserved. • Think in terms of data, not files • Describe your data and Kite does the right thing • Should work consistently across the platform • Reliable
  • 9. Kite Datasets: Compatibility ©2014 Cloudera, Inc. All rights reserved. Project HDFS (avro) HDFS (parquet) HBase Flume Sink 1.0 1.0 1.0 MapReduce 1.0 1.0 1.0 Crunch 1.0 1.0 1.0 Hive 1.0 1.0 1.1 Impala 1.0 1.0 * * depends on common HBase encoding format
  • 10. Kite Datasets: What is it? ©2014 Cloudera, Inc. All rights reserved. • A high-level API for data management • Work with records and datasets • Not files, directories, or byte arrays • Standard descriptions for records and storage • Schemas describe records • Partition strategies describe layout • Opinionated
  • 11. Kite Datasets: Example ©2014 Cloudera, Inc. All rights reserved. 1. Describe your data dataset obj-schema org.movielens.Rating --jar app.jar --output rating.avsc
  • 12. Kite Datasets: Example ©2014 Cloudera, Inc. All rights reserved. 1. Describe your data dataset obj-schema org.movielens.Rating --jar app.jar --output rating.avsc 1. Describe your layout dataset partition-config ts:year ts:month ts:day --schema rating.avsc --output ymd.json
  • 13. Kite Datasets: Example ©2014 Cloudera, Inc. All rights reserved. 1. Describe your data dataset obj-schema org.movielens.Rating --jar app.jar --output rating.avsc 1. Describe your layout dataset partition-config ts:year ts:month ts:day --schema rating.avsc --output ymd.json 1. Create a dataset dataset create ratings --schema rating.avsc --partition-by ymd.json
  • 14. Kite Datasets: Example ©2014 Cloudera, Inc. All rights reserved. datasets/ └── ratings/ ├── year=1997/ │ ├── month=09/ │ │ ├── day=20/ │ │ ├── ... │ │ └── day=30/ │ ├── month=10/ │ │ ├── day=01/ │ │ ├── ...
  • 15. Kite SDK: HBase Datasets Ryan Blue, Software Engineer
  • 16. Kite HBase: Background ©2014 Cloudera, Inc. All rights reserved. Application ApplicationApplication Database Data files Data files Kite Data HBase Data files HBase Maintained by the Kite
  • 17. Kite HBase: Background ©2014 Cloudera, Inc. All rights reserved. • Rows identified by keys, managed by HBase • Columns are organized as cells • Cells are identified by column family, qualifier • The catch: everything is a byte array family name ... row key last first ... buzz@pixar.com Lightyear Buzz ...
  • 18. • Uniform interaction with HBase and HDFS datasets • Need to make keys from records • Need configuration to map fields to cells Kite HBase ©2014 Cloudera, Inc. All rights reserved.
  • 19. Kite HBase: Partitioning ©2014 Cloudera, Inc. All rights reserved. • Use partition strategy to define unique keys • Kite builds the key from each record • Kite translates keys to HBase row id bytes
  • 20. Kite HBase: Partitioning ©2014 Cloudera, Inc. All rights reserved. • Partition strategy produces a storage key • HDFS partitioning uses a group key 1403028411014 => (2014, 6, 17) • HBase partitioning uses a unique key • Grouping is done dynamically by HBase 1403028411014 => (1403028411014)
  • 21. Kite HBase: Example partitioning ©2014 Cloudera, Inc. All rights reserved. • Define key format from data $ ./dataset partition-config --schema user.avsc email:copy
  • 22. Kite HBase: Example partitioning ©2014 Cloudera, Inc. All rights reserved. • Define key format from data $ ./dataset partition-config --schema user.avsc email:copy [ { "source" : "email", "type" : "identity", "name" : "email_copy" } ]
  • 23. Kite HBase: Example partitioning ©2014 Cloudera, Inc. All rights reserved. $ ./dataset partition-config --schema user.avsc email:hash[16] email:copy
  • 24. Kite HBase: Example partitioning ©2014 Cloudera, Inc. All rights reserved. $ ./dataset partition-config --schema user.avsc email:hash[16] email:copy [ { "source" : "email", "type" : "hash", "buckets" : 16, "name" : "email_hash" }, { "source" : "email", "type" : "identity", "name" : "email_copy" } ]
  • 25. Kite HBase: Partitioning ©2014 Cloudera, Inc. All rights reserved. • Use partition strategy to define unique keys • Kite builds the key from each record • Kite translates keys to HBase row id bytes • Some operations require keys
  • 26. Kite HBase: Field mapping ©2014 Cloudera, Inc. All rights reserved. • Configure the column family and qualifier for a field { "email": "buzz@pixar.com", "firstName": "Buzz", ... } family name ... row key last first ... buzz@pixar.com Lightyear Buzz ...
  • 27. Kite HBase: Basic column mapping ©2014 Cloudera, Inc. All rights reserved. column { "source": "firstName", "type": "column", "family": "name", "qualifier": "first" }
  • 28. Kite HBase: Counter mapping ©2014 Cloudera, Inc. All rights reserved. column { "source": "firstName", "type": "column", "family": "name", "qualifier": "first" } counter (can be incremented) { "source": "visits", "type": "counter" "family": "counts", "qualifier": "visits"}
  • 29. Kite HBase: Key mapping ©2014 Cloudera, Inc. All rights reserved. key (stored in the row key using identity) { "source": "email", "type": "key" }
  • 30. { "type" : "record", "name" : "User", "fields" : [ { "name" : "email", "type" : "string" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved.
  • 31. { "type" : "record", "name" : "User", "fields" : [ { "name" : "email", "type" : "string" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved. [ { "source": "email", "type": "key" }, ... ]
  • 32. { "type" : "record", "name" : "User", "fields" : [ { "name" : "email", "type" : "string" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved. family name counts prefs row key last first visits flash buzz@pixar.co m Lightyear Buzz 315 true [ { "source": "email", "type": "key" }, ... ]
  • 33. { "type" : "record", "name" : "User", "fields" : [ { "name" : "lastName", "type" : "string" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved.
  • 34. { "type" : "record", "name" : "User", "fields" : [ { "name" : "lastName", "type" : "string" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved. [ { "source": "lastName", "type": "column", "family": "name", "qualifier": "last" }, ... ]
  • 35. { "type" : "record", "name" : "User", "fields" : [ { "name" : "lastName", "type" : "string" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved. family name counts prefs row key last first visits flash buzz@pixar.com Lightyear Buzz 315 true [ { "source": "lastName", "type": "column", "family": "name", "qualifier": "last" }, ... ]
  • 36. { "type" : "record", "name" : "User", "fields" : [ { "name" : "visits", "type" : "long" }, ... ] } Kite HBase: Example ©2014 Cloudera, Inc. All rights reserved. family name counts prefs row key last first visits flash buzz@pixar.com Lightyear Buzz 315 true [ { "source": "visits", "type": "counter", "family": "counts", "qualifier": "visits" }, ... ]
  • 37. • Working with a dataset in HBase does not change • Readers / writers are backed by scans • CLI tools work: dataset csv-import pixar_users.csv users --use-hbase • Additional methods on RandomAccessDataset • get, put, delete, increment Kite HBase: Interaction ©2014 Cloudera, Inc. All rights reserved.
  • 38. RandomAccessDataset<User> users = ...; Key buzzEmailKey = new Key.Builder() .add("email", "buzz@pixar.com") .build(); User buzz = users.get(buzzEmailKey); buzz.addPreference("flash", true); users.put(buzz); Kite HBase: Interaction using keys ©2014 Cloudera, Inc. All rights reserved.
  • 39. • Versioning and concurrency • Additional occVersion type, like a counter • Rejects a put if the record has changed • Key-as-column mapping • Stores maps or records in a column family • Uses the key or field name as the qualifier Kite HBase: More features ©2014 Cloudera, Inc. All rights reserved.
  • 40. • Translation between objects and byte arrays in Kite • Configuration to define key format • Configuration to define how fields are stored • Decreases the code and time required to experiment • Key format and column mappings are hard • Try out configurations to find the right one Kite HBase: Conclusion ©2014 Cloudera, Inc. All rights reserved.
  • 41. Questions ©2014 Cloudera, Inc. All rights reserved. Ryan Blue: blue@cloudera.com Kite mailing list: cdk-dev@cloudera.org