Cloudera Development Kit (CDK): Hadoop Application Development Made Easier


Published on

A set of best practices has emerged for building applications on top of Hadoop, thanks to the broad adoption of Apache Hadoop across various industries. However, for many developers, particularly those who are relatively new to Hadoop, it's a challenge to learn what those best practices are and how to apply them.

Cloudera has created a new open source project called the Cloudera Development Kit (CDK), to help these developers get new projects off the ground more easily. The CDK is both a framework and long-term initiative for documenting proven development practices and providing helpful doc and APIs that will make Hadoop application development as easy as possible.

This on-demand webinar will teach you:
- About the current CDK release and its targeted use cases
- How the CDK will be managed and extended over time
- Why the CDK will have a long-term impact on Hadoop adoption

Published in: Technology

Cloudera Development Kit (CDK): Hadoop Application Development Made Easier

  1. 1. 11Headline Goes HereSpeaker Name or Subhead Goes HereCloudera Developer Kit:Hadoop Application Development Made EasierE. Sammer | Engineering ManagerMay 2013
  2. 2. 22“[I]t’s not enough to just build ascalable and stable system; the systemalso has to be easy enough forthousands of internal developers of alltypes and all skill levels to use.”
  3. 3. 3Hadoop is incredibly powerful3
  4. 4. 4Hadoop is incredibly flexible4
  5. 5. 5Hadoop is incredibly low-level5
  6. 6. 6Hadoop is incredibly complex6
  7. 7. 7A typical system (zoom 100:1)7
  8. 8. 8A typical system (zoom 10:1)8
  9. 9. 9A typical system (zoom 5:1)9
  10. 10. 10What you actually care aboutGetting data from A to BUsing it later10
  11. 11. 11Infrastructure detailsSerialization, file formats, and compressionMetadata capture and maintenanceDataset organization and partitioningDurability and delivery guaranteesWell-defined failure semanticsPerformance and health instrumentation11
  12. 12. 12Cloudera Development KitMake Hadoop accessible to the enterprise developerCodify expert patterns and practicesMake the “right thing” easy and obviousAddress the most common casesLet developers focus on business logical, not infrastructure12
  13. 13. 13Cloudera Development KitAn open source set of libraries, guides, and examples forbuilding data-oriented systems and applicationsProvides higher level APIs atop existing components of CDHSupports piecemeal adoption via loosely coupled modules13
  14. 14. 14CDK Data ModuleHigh level APIs for interacting with datasets in HDFSConfiguration-based format and schema managementConsistent data model and serialization semanticsMetadata system integration and supportAutomatic dataset partitioning and file management14
  15. 15. 1515DatasetRepository repo = new FileSystemDatasetRepository.Builder().fileSystem(FileSystem.get(new Configuration())).directory(new Path(“/data”)).get();Dataset events = repo.create(“events”,new DatasetDescriptor.Builder().schema(new File(“event.avsc”)).partitionStrategy(new PartitionStrategy.Builder().hash(“userId”, 53).get()).get());DatasetWriter<GenericRecord> writer = events.getWriter();;writer.write(new GenericRecordBuilder(schema).set(“userId”, 1).set(“timeStamp”, System.currentTimeMillis()).build());writer.close();/data/events/.metadata/schema.avsc/
  16. 16. 16Under developmentConfiguration-based record transformation and filtering engineData pipeline deployment, discovery, and managementWorking with customers, partners, and the community on newmodules and features16
  17. 17. 17Getting startedCDK code repo: example repo: artifacts available from Cloudera’s Maven repositoryMailing list:
  18. 18. 18• Submit questions in the Q&A panel• Watch this webinar on-demand at• Follow Cloudera @Cloudera• Follow Cloudera Engineering@ClouderaEng• Thank you for attending!Learn more about the CDK on GitHub
  19. 19. 1919