11
Headline Goes Here
Speaker Name or Subhead Goes Here
Cloudera Developer Kit:
Hadoop Application Development Made Easier
E. Sammer | Engineering Manager
Big Data Gurus - 2013/09/16 - @esammer
22
“[I]t’s not enough to just build a scalable
and stable system; the system also has
to be easy enough for thousands of
internal developers of all types and all
skill levels to use.”
http://gigaom.com/data/how-disney-built-a-big-data-platform-on-a-startup-budget/
3
Hadoop is incredibly powerful
3
4
Hadoop is incredibly flexible
4
5
Hadoop is incredibly low-level
5
6
Hadoop is incredibly complex
6
7
A typical system (zoom 100:1)
7
8
A typical system (zoom 10:1)
8
9
A typical system (zoom 5:1)
9
10
What you actually care about
Getting data from A to B
Using it later
10
11
Infrastructure details
Serialization, file formats, and compression
Metadata capture and maintenance
Dataset organization and partitioning
Durability and delivery guarantees
Well-defined failure semantics
Performance and health instrumentation
11
12
Cloudera Development Kit
Make Hadoop accessible to the enterprise developer
Codify expert patterns and practices
Make the “right thing” easy and obvious
Address the most common cases
Let developers focus on business logical, not infrastructure
12
13
Cloudera Development Kit
An open source set of libraries, guides, and examples for
building data-oriented systems and applications
Provides higher level APIs atop existing components of CDH
Supports piecemeal adoption via loosely coupled modules
13
14
CDK Data Module
High level APIs for interacting with datasets in HDFS
Configuration-based format and schema management
Consistent data model and serialization semantics
Metadata system integration and support
Automatic dataset partitioning and file management
14
1515
DatasetRepository repo = new FileSystemDatasetRepository.Builder()
.fileSystem(FileSystem.get(new Configuration()))
.directory(new Path(“/data”))
.get();
Dataset events = repo.create(“events”,
new DatasetDescriptor.Builder()
.schema(new File(“event.avsc”))
.partitionStrategy(
new PartitionStrategy.Builder().hash(“userId”, 53).get()
).get()
);
DatasetWriter<GenericRecord> writer = events.getWriter();
writer.open();
writer.write(
new GenericRecordBuilder(schema)
.set(“userId”, 1)
.set(“timeStamp”, System.currentTimeMillis())
.build()
);
writer.close();
/data
/events
/.metadata
/schema.avsc
/descriptor.properties
/userId=0
/10000000.avro
/10000001.avro
/userId=1
/20000000.avro
/userId=2
/30000000.avro
Code
Data
16
CDK Morphlines Module
Pluggable, configuration-driven data transform library
Born out of Cloudera Search, but general purpose
Configure record transform stages in a container library
Use the library in Flume, MapReduce jobs, Storm, and other
Java applications
14
17
Other Modules
Maven plugin
Package, deploy, and execute “apps”
Execute dataset operations
Examples
POJO, generic, and generated entity ingest
Dataset administrative operations
Crunch and MR integration
...
14
18
Future
HBase
Extending data APIs to support random access
Same automatic serialization, schema management, etc.
Higher-order data management
Common tasks
Think background compaction, conversion, etc.
Integration with existing middleware frameworks
Give us all your good ideas (and code)!
14
19
Getting started
CDK code repo: github.com/cloudera/cdk
CDK example repo: github.com/cloudera/cdk-examples
Binary artifacts available from Cloudera’s Maven repository
Community forums: community.cloudera.com
Mailing list: groups.google.com/a/cloudera.org/d/forum/cdk-dev
JIRA: issues.cloudera.org/browse/CDK
17
20
Questions?
I also wrote a book.
We’re going to give a few copies away.
17
2118

Cloudera Developer Kit (CDK)

  • 1.
    11 Headline Goes Here SpeakerName or Subhead Goes Here Cloudera Developer Kit: Hadoop Application Development Made Easier E. Sammer | Engineering Manager Big Data Gurus - 2013/09/16 - @esammer
  • 2.
    22 “[I]t’s not enoughto just build a scalable and stable system; the system also has to be easy enough for thousands of internal developers of all types and all skill levels to use.” http://gigaom.com/data/how-disney-built-a-big-data-platform-on-a-startup-budget/
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
    7 A typical system(zoom 100:1) 7
  • 8.
    8 A typical system(zoom 10:1) 8
  • 9.
    9 A typical system(zoom 5:1) 9
  • 10.
    10 What you actuallycare about Getting data from A to B Using it later 10
  • 11.
    11 Infrastructure details Serialization, fileformats, and compression Metadata capture and maintenance Dataset organization and partitioning Durability and delivery guarantees Well-defined failure semantics Performance and health instrumentation 11
  • 12.
    12 Cloudera Development Kit MakeHadoop accessible to the enterprise developer Codify expert patterns and practices Make the “right thing” easy and obvious Address the most common cases Let developers focus on business logical, not infrastructure 12
  • 13.
    13 Cloudera Development Kit Anopen source set of libraries, guides, and examples for building data-oriented systems and applications Provides higher level APIs atop existing components of CDH Supports piecemeal adoption via loosely coupled modules 13
  • 14.
    14 CDK Data Module Highlevel APIs for interacting with datasets in HDFS Configuration-based format and schema management Consistent data model and serialization semantics Metadata system integration and support Automatic dataset partitioning and file management 14
  • 15.
    1515 DatasetRepository repo =new FileSystemDatasetRepository.Builder() .fileSystem(FileSystem.get(new Configuration())) .directory(new Path(“/data”)) .get(); Dataset events = repo.create(“events”, new DatasetDescriptor.Builder() .schema(new File(“event.avsc”)) .partitionStrategy( new PartitionStrategy.Builder().hash(“userId”, 53).get() ).get() ); DatasetWriter<GenericRecord> writer = events.getWriter(); writer.open(); writer.write( new GenericRecordBuilder(schema) .set(“userId”, 1) .set(“timeStamp”, System.currentTimeMillis()) .build() ); writer.close(); /data /events /.metadata /schema.avsc /descriptor.properties /userId=0 /10000000.avro /10000001.avro /userId=1 /20000000.avro /userId=2 /30000000.avro Code Data
  • 16.
    16 CDK Morphlines Module Pluggable,configuration-driven data transform library Born out of Cloudera Search, but general purpose Configure record transform stages in a container library Use the library in Flume, MapReduce jobs, Storm, and other Java applications 14
  • 17.
    17 Other Modules Maven plugin Package,deploy, and execute “apps” Execute dataset operations Examples POJO, generic, and generated entity ingest Dataset administrative operations Crunch and MR integration ... 14
  • 18.
    18 Future HBase Extending data APIsto support random access Same automatic serialization, schema management, etc. Higher-order data management Common tasks Think background compaction, conversion, etc. Integration with existing middleware frameworks Give us all your good ideas (and code)! 14
  • 19.
    19 Getting started CDK coderepo: github.com/cloudera/cdk CDK example repo: github.com/cloudera/cdk-examples Binary artifacts available from Cloudera’s Maven repository Community forums: community.cloudera.com Mailing list: groups.google.com/a/cloudera.org/d/forum/cdk-dev JIRA: issues.cloudera.org/browse/CDK 17
  • 20.
    20 Questions? I also wrotea book. We’re going to give a few copies away. 17
  • 21.