Kite (Big Data Applications Meetup @ Cask)

© Cloudera, Inc. All rights reserved.
Kite SDK: Helping Hadoop
projects work together
Ryan Blue 23 June 2015

Quick poll
●Who has seen the movie Fear and Loathing in Las Vegas?

Oh, no. What did we do?
●Last thing I remember, we were at a NoSQL party
●I don’t remember much . . .
●Did we build a database?

Dinosaur tails and tape recorders
●Dinosaur tail: some tools work with tables, some with files
●Tape recorder: tables came later, and it wasn’t too bad to deal with it
●Result: table formats are reimplemented everywhere, and
jobs commonly drop files into folders that back a database table

●Dinosaur tail: if you can dream up a file format, someone is using it in Hadoop
●Tape recorder: unstructured data was part of the appeal
●Result: it is easy to choose a format with lurking application problems

●Dinosaur tail: the de-facto table format mixes metadata into directory names
●Tape recorder: this format was intended to be simple and be a coarse index
●Result: needs an elaborate locking scheme to guarantee safety, which
would cause low-latency queries to be slow

●Dinosaur tail: schemas are missing key features
●Tape recorder: schema on read? I honestly don’t remember
●Result: schema evolution, data types, and behavior vary, and
table schemas are sometimes missing

Building Hadoop applications is hard
●Early choices have big consequences for performance and compatibility
●Components and formats work slightly differently
●Table support is still done manually in most projects
●SQL engines can’t trust the files in a table
●Types are missing

How can we fix it?
●Collaborate on (strict) data storage specs and consistent schemas
●Implement table-level everywhere, not file-level
●Include partition handling for storage and retrieval
●Build a standard API so that storage can be versioned and evolved
●Build a common set of tools
●Improve the table format

What is Kite?
●A table-level API that allows storage to be versioned and evolved
●A common set of tools built around that API
●Datasets are identified by URI
●Defined by an Avro schema and partition configuration
●Compatible with Hive and Impala
●Provide an API for table-level access in MR and Spark

How does Kite differ from Cask?
●Kite is focused on storage
● How should objects be serialized?
● Provides compatibility across the ecosystem
●Cask is focused on application patterns
. . . yes, there is some overlap

Current efforts
●Date, time, and timestamp standardization in Avro and Parquet
●A new table format with snapshot isolation
●An HBase encoding specification for portability

Demo!

Thank you
blue@cloudera.com
http://ingest.tips/
cdk-dev@cloudera.org

Kite (Big Data Applications Meetup @ Cask)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kite (Big Data Applications Meetup @ Cask)

Similar to Kite (Big Data Applications Meetup @ Cask) (20)

Recently uploaded

Recently uploaded (20)

Kite (Big Data Applications Meetup @ Cask)