Cloudera Developer Kit (CDK)

11
Headline Goes Here
Speaker Name or Subhead Goes Here
Cloudera Developer Kit:
Hadoop Application Development Made Easier
E. Sammer | Engineering Manager
Big Data Gurus - 2013/09/16 - @esammer

22
“[I]t’s not enough to just build a scalable
and stable system; the system also has
to be easy enough for thousands of
internal developers of all types and all
skill levels to use.”
http://gigaom.com/data/how-disney-built-a-big-data-platform-on-a-startup-budget/

3
Hadoop is incredibly powerful
3

4
Hadoop is incredibly flexible
4

5
Hadoop is incredibly low-level
5

6
Hadoop is incredibly complex
6

7
A typical system (zoom 100:1)
7

8
8

9
9

10
What you actually care about
Getting data from A to B
Using it later
10

11
Infrastructure details
Serialization, file formats, and compression
Metadata capture and maintenance
Dataset organization and partitioning
Durability and delivery guarantees
Well-defined failure semantics
Performance and health instrumentation
11

12
Cloudera Development Kit
Make Hadoop accessible to the enterprise developer
Codify expert patterns and practices
Make the “right thing” easy and obvious
Address the most common cases
Let developers focus on business logical, not infrastructure
12

13
Cloudera Development Kit
An open source set of libraries, guides, and examples for
building data-oriented systems and applications
Provides higher level APIs atop existing components of CDH
Supports piecemeal adoption via loosely coupled modules
13

14
CDK Data Module
High level APIs for interacting with datasets in HDFS
Configuration-based format and schema management
Consistent data model and serialization semantics
Metadata system integration and support
Automatic dataset partitioning and file management
14

1515
DatasetRepository repo = new FileSystemDatasetRepository.Builder()
.fileSystem(FileSystem.get(new Configuration()))
.directory(new Path(“/data”))
.get();
Dataset events = repo.create(“events”,
new DatasetDescriptor.Builder()
.schema(new File(“event.avsc”))
.partitionStrategy(
new PartitionStrategy.Builder().hash(“userId”, 53).get()
).get()
);
DatasetWriter<GenericRecord> writer = events.getWriter();
writer.open();
writer.write(
new GenericRecordBuilder(schema)
.set(“userId”, 1)
.set(“timeStamp”, System.currentTimeMillis())
.build()
);
writer.close();
/data
/events
/.metadata
/schema.avsc
/descriptor.properties
/userId=0
/10000000.avro
/10000001.avro
/userId=1
/20000000.avro
/userId=2
/30000000.avro
Code
Data

16
CDK Morphlines Module
Pluggable, configuration-driven data transform library
Born out of Cloudera Search, but general purpose
Configure record transform stages in a container library
Use the library in Flume, MapReduce jobs, Storm, and other
Java applications
14

17
Other Modules
Maven plugin
Package, deploy, and execute “apps”
Execute dataset operations
Examples
POJO, generic, and generated entity ingest
Dataset administrative operations
Crunch and MR integration
...
14

18
Future
HBase
Extending data APIs to support random access
Same automatic serialization, schema management, etc.
Higher-order data management
Common tasks
Think background compaction, conversion, etc.
Integration with existing middleware frameworks
Give us all your good ideas (and code)!
14

19
Getting started
CDK code repo: github.com/cloudera/cdk
CDK example repo: github.com/cloudera/cdk-examples
Binary artifacts available from Cloudera’s Maven repository
Community forums: community.cloudera.com
Mailing list: groups.google.com/a/cloudera.org/d/forum/cdk-dev
JIRA: issues.cloudera.org/browse/CDK
17

20
Questions?
I also wrote a book.
We’re going to give a few copies away.
17

Cloudera Developer Kit (CDK)

More Related Content

What's hot

Similar to Cloudera Developer Kit (CDK)

More from bigdatagurus_meetup

Recently uploaded

Cloudera Developer Kit (CDK)