Kite SDK introduction for Portland Big Data

712 views
574 views

Published on

Kite SDK is a set of tools for building big data applications on Hadoop.

Published in: Data & Analytics
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
712
On SlideShare
0
From Embeds
0
Number of Embeds
24
Actions
Shares
0
Downloads
25
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Kite SDK introduction for Portland Big Data

  1. 1. Kite SDK: It’s for developers Ryan Blue, Software Engineer
  2. 2. Resources ©2014 Cloudera, Inc. All rights reserved. • Kite guide • http://tiny.cloudera.com/KiteGuide • Dataset overview and intro • http://tiny.cloudera.com/Datasets • Command-line tutorial • http://tiny.cloudera.com/KiteCLI • Kite repository and examples • https://github.com/kite-sdk/kite • https://github.com/kite-sdk/kite-examples
  3. 3. Agenda ©2014 Cloudera, Inc. All rights reserved. • Kite background • Kite data
  4. 4. What problem does Kite solve? ©2014 Cloudera, Inc. All rights reserved. • Accessibility for getting started • Easy to get started, without being an expert • Use before understanding • Save time for experienced developers • Off-the-shelf tools for common tasks • Quickly iterate and test configurations
  5. 5. Kite Datasets: Motivation ©2014 Cloudera, Inc. All rights reserved. • Focus on using data, not managing files • Developers shouldn’t have to maintain data files • Use through configuration, not code • Need consistency across the platform
  6. 6. Kite Datasets: Motivation ©2014 Cloudera, Inc. All rights reserved. Application Database Data files User code Provided Maintained by the database
  7. 7. Kite Datasets: Motivation ©2014 Cloudera, Inc. All rights reserved. Application Application Database Data files Data files HBase User code
  8. 8. Kite Datasets: Motivation ©2014 Cloudera, Inc. All rights reserved. Application ApplicationApplication Database Data files Data files Kite Data HBase Data files HBase Maintained by the Kite
  9. 9. Kite Datasets: Goals ©2014 Cloudera, Inc. All rights reserved. • Think in terms of data: datasets, views, records • Describe data, layout and Kite does the right thing • Should work consistently across the platform • Reliable
  10. 10. Kite Datasets: Compatibility ©2014 Cloudera, Inc. All rights reserved. Project HDFS (avro) HDFS (parquet) HBase Kite 1.0 1.0 1.0 Flume Sink 1.0 1.0 1.0 MapReduce 1.0 1.0 1.0 Crunch 1.0 1.0 1.0 Hive 1.0 1.0 1.1 Impala 1.0 1.0 * * depends on common HBase encoding format
  11. 11. Current compatibility (0.15.0) ©2014 Cloudera, Inc. All rights reserved. Project HDFS (avro) HDFS (parquet) HBase Kite 1.0 1.0 1.0 Flume Sink 1.0 1.0 1.0 MapReduce 1.0 1.0 1.0 Crunch 1.0 1.0 1.0 Hive 1.0 1.0 1.1 Impala 1.0 1.0 * * depends on common HBase encoding format
  12. 12. Agenda ©2014 Cloudera, Inc. All rights reserved. • Kite background • Kite data Application Kite Data Data files HBase Maintained by the Kite
  13. 13. Datasets ©2014 Cloudera, Inc. All rights reserved. • A collection of records or entities • Like a Hive or relational table • Generic, reflected, or generated objects • Identified by URI • dataset:hdfs:/data/ratings • dataset:hive:/data/ratings • dataset:hbase:zk1/ratings ratings = Datasets.load("dataset:hive:/data/ratings")
  14. 14. Dataset configuration, JSON ©2014 Cloudera, Inc. All rights reserved. • Schema (Avro) • Record fields, like a table definition
  15. 15. Dataset configuration, JSON ©2014 Cloudera, Inc. All rights reserved. • Schema (Avro) • Record fields, like a table definition • Partition strategy • Layout or key definition from record fields
  16. 16. Configuring partitioning ©2014 Cloudera, Inc. All rights reserved. • Partition strategy [ { "source" : "timestamp", "type" : "year" }, { "source" : "timestamp", "type" : "month" }, { "source" : "timestamp", "type" : "day" } ] datasets/ └── ratings/ ├── year=1997/ │ ├── month=09/ │ │ ├── day=20/ │ │ ├── ... │ │ └── day=30/ │ ├── month=10/ │ │ ├── day=01/ │ │ ├── ...
  17. 17. Configuring key building ©2014 Cloudera, Inc. All rights reserved. • Partition strategy for HBase [ { "source" : "email", "type" : "hash", "buckets": 32 }, { "source" : "email", "type" : "identity" } ] (22, "buzz@pixar.com") x80x00x00x16buzz@pixar.comx00x00
  18. 18. Dataset configuration, JSON ©2014 Cloudera, Inc. All rights reserved. • Schema (Avro) • Record fields, like a table definition • Partition strategy • Layout or key definition from record fields • Column mapping (HBase) • Where to store record fields
  19. 19. { "type" : "record", "name" : "User", "fields" : [ { "name" : "email", "type" : "string" }, ... ] } Mapping example ©2014 Cloudera, Inc. All rights reserved. family name counts prefs row key last first visits flash buzz@pixar.com Lightyear Buzz 315 true [ { "source": "email", "type": "key" }, ... ]
  20. 20. { "type" : "record", "name" : "User", "fields" : [ { "name" : "lastName", "type" : "string" }, ... ] } Mapping example ©2014 Cloudera, Inc. All rights reserved. family name counts prefs row key last first visits flash buzz@pixar.com Lightyear Buzz 315 true [ { "source": "lastName", "type": "column", "family": "name", "qualifier": "last" }, ... ]
  21. 21. Command-line demo? ©2014 Cloudera, Inc. All rights reserved. 1. Describe your data dataset obj-schema org.movielens.Rating --jar app.jar --output rating.avsc 2. Describe your layout dataset partition-config ts:year ts:month ts:day --schema rating.avsc --output ymd.json 3. Create a dataset dataset create ratings --schema rating.avsc --partition-by ymd.json
  22. 22. Command-line tool ©2014 Cloudera, Inc. All rights reserved. • Executable jar download • Inspects the environment • Must be used on-cluster • Classpath for HBase, Hive, etc. • Debugging: debug=true ./dataset -v <command> • Requires MAPRED_HOME variable on CDH5
  23. 23. Resources ©2014 Cloudera, Inc. All rights reserved. • Kite guide • http://tiny.cloudera.com/KiteGuide • Dataset overview and intro • http://tiny.cloudera.com/Datasets • Command-line tutorial • http://tiny.cloudera.com/KiteCLI • Kite repository and examples • https://github.com/kite-sdk/kite • https://github.com/kite-sdk/kite-examples
  24. 24. Questions ©2014 Cloudera, Inc. All rights reserved. Ryan Blue: blue@cloudera.com Kite mailing list: cdk-dev@cloudera.org
  25. 25. Maven parent POM ©2014 Cloudera, Inc. All rights reserved. • Automatic Kite and Hadoop dependencies • Inherit from kite-app-parent-cdh4 • CDH4 only, CDH5 support in 0.16.0 <parent> <groupId>org.kitesdk</groupId> <artifactId>kite-app-parent-cdh4</artifactId> <version>0.15.0</version> </parent>
  26. 26. Maven Plugin ©2014 Cloudera, Inc. All rights reserved. • Maven plugin manages datasets for an application • Configured by app-parent POM • Handles create, update, etc. in maven goals
  27. 27. MapReduce ©2014 Cloudera, Inc. All rights reserved. • DatasetKeyInputFormat • DatasetKeyOutputFormat • Values are always null View eventsBeforeToday = Datasets .load("dataset:hive:/data/events") .toBefore("timestamp", startOfToday()); DatasetKeyInputFormat.configure(mrJob).readFrom(eventsBeforeToday);
  28. 28. Crunch ©2014 Cloudera, Inc. All rights reserved. • CrunchDatasets.asSource • CrunchDatasets.asTarget PCollection<Event> getPipeline().read( CrunchDatasets.asSource(eventsBeforeToday); • Handle-existing support in 0.16.0 • Configure dependencies with Kite parent POM
  29. 29. DatasetSink ©2014 Cloudera, Inc. All rights reserved. • Write to HDFS Avro and HBase • http://tiny.cloudera.com/DatasetSink • Proxy user support • Automatic partitioning agent.sinks.name.type = org.apache.flume.sink.kite.DatasetSink agent.sinks.name.kite.repo.uri = repo:hdfs:/datasets agent.sinks.name.kite.dataset.name = events agent.sinks.name.auth.proxyUser = cloudera

×