1
New tools for building applications
on Apache Hadoop
Eli Collins
Software Engineer, Cloudera
@elicollins
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese an...
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the sprea...
Agenda
• Context – building better products w/ data
• Analytics-driven development
• Diverse data sources & formats
• Tool...
Serialization & formats w/ Apache Avro
• Expressive
• Records, arrays, unions, enums
• Efficient
• Compact binary, compres...
Complex pipelines w/ Apache Crunch
• Not all data formats are a natural fit for Pig & Hive
• Workaround – large, custom UD...
Crunch – advantages
• It’s just Java
• Full programming language
• No need to learn or switch between languages
• Natural ...
Crunch – core concepts
PCollection: distributed, unordered collection of
elements w/ parallelDo operator.
PTable: sub-inte...
Crunch – word count
7
public class WordCount {
public static void main(String[] args) throws Exception {
Pipeline pipeline...
Scrunch – Scala wrapper
8
class WordCountExample {
val pipeline = new Pipeline[WordCountExample]
def wordCount(fileName: S...
Cloudera ML
• Open source libraries and tools to help data
scientists perform common tasks
• Data preparation
• Model eval...
Cloudera ML (cont)
• Built using Crunch
• Vector format – leverages Mahout’s Vector interface
& classes
• Record format – ...
Cloudera Development Kit (CDK)
• Open source libraries, tools & docs that make
building systems on Hadoop easier
• Provide...
CDK – loosely coupled modules
CDK is prescriptive but..
• Modules can be used independently, as needed
• Doesn’t force you...
CDK – data module
• Easier to work with data sets on Hadoop file systems
• Automatic serialization/de-serialization of Jav...
CDK – example data module usage
DatasetRepository repo = new FileSystemDatasetRepository.Builder()
.fileSystem(FileSystem....
CDK – example directory contents
/data
/events
/.metadata
/schema.avsc
/descriptor.properties
/userId-0
/xxxx.avro
/xxxx.a...
CDK – what’s new & coming
• Log application events to a dataset w/ the log4j API &
Flume as the transport
• Datasets expos...
CDK – more info
• github.com/cloudera/cdk
• github.com/cloudera/cdk-examples
• Binary artifacts in Cloudera’s maven repo
•...
A guide to Python frameworks for Hadoop
• Uri Laserson, data scientist @ Cloudera
• Streaming, mrjob, dumbo, hadoopy, pydo...
19
Thank You!
Eli Collins
@elicollins
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/hadoop-
frameworks-api
Upcoming SlideShare
Loading in …5
×

Building Applications using Apache Hadoop

980 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/16WDJ8b.

Eli Collins overviews how to build new applications with Hadoop and how to integrate Hadoop with existing applications, providing an update on the state of Hadoop ecosystem, frameworks and APIs.Filmed at qconnewyork.com.

Eli Collins is the tech lead for Cloudera's Platform team, an active contributor to Apache Hadoop and member of its project management committee (PMC) at the Apache Software Foundation. Eli holds Bachelor's and Master's degrees in Computer Science from New York University and the University of Wisconsin-Madison, respectively.

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
980
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Building Applications using Apache Hadoop

  1. 1. 1 New tools for building applications on Apache Hadoop Eli Collins Software Engineer, Cloudera @elicollins
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /hadoop-frameworks-api
  3. 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. Agenda • Context – building better products w/ data • Analytics-driven development • Diverse data sources & formats • Tools that make it easier to build apps on Hadoop • Apache Avro • Apache Crunch • Cloudera ML • Cloudera CDK 2
  5. 5. Serialization & formats w/ Apache Avro • Expressive • Records, arrays, unions, enums • Efficient • Compact binary, compressed, splittable • Interoperable • Langs: C, C++, C#, Java, Perl, Python, Ruby, PHP • Tools: MR, Pig, Hive, Crunch, Flume, Sqoop, etc • Dynamic • Can read & write w/o generating code first • Evolvable 3
  6. 6. Complex pipelines w/ Apache Crunch • Not all data formats are a natural fit for Pig & Hive • Workaround – large, custom UDFs (or MR) • Crunch • API for MapReduce in Java (& Scala) • Based on Google’s FlumeJava paper • Combine small # primitives & light-weight UDFs 4
  7. 7. Crunch – advantages • It’s just Java • Full programming language • No need to learn or switch between languages • Natural type system • Hadoop writables & Avro native support • Modular library for reuse • Create glue code for data transformation that can be combined with a ML algorithm into a single MR job 5
  8. 8. Crunch – core concepts PCollection: distributed, unordered collection of elements w/ parallelDo operator. PTable: sub-interface of PCollection. Distributed, sorted map. Also has groupBy operator to aggr values by key. Pipeline: coordinates the building and execution of underlying MapReduce jobs. 6
  9. 9. Crunch – word count 7 public class WordCount { public static void main(String[] args) throws Exception { Pipeline pipeline = new MRPipeline(WordCount.class); PCollection lines = pipeline.readTextFile(args[0]); PCollection words = lines.parallelDo("my splitter", new DoFn() { public void process(String line, Emitter emitter) { for (String word : line.split("s+")) { emitter.emit(word); } } }, Writables.strings()); PTable counts = Aggregate.count(words); pipeline.writeTextFile(counts, args[1]); pipeline.run(); } }
  10. 10. Scrunch – Scala wrapper 8 class WordCountExample { val pipeline = new Pipeline[WordCountExample] def wordCount(fileName: String) = { pipeline.read(from.textFile(fileName)) .flatMap(_.toLowerCase.split("W+")) .filter(!_.isEmpty()) .count } } Based on Google’s Cascade project
  11. 11. Cloudera ML • Open source libraries and tools to help data scientists perform common tasks • Data preparation • Model evaluation • Built-in commands • summarize, sample, normalize, pivot, etc • K-means clustering on Hadoop • Scalable k-means++ by Bahmani et al • Other implementations as well 9
  12. 12. Cloudera ML (cont) • Built using Crunch • Vector format – leverages Mahout’s Vector interface & classes • Record format – thin wrapper on Avro’s GenericRecord/Schema and HCatRecord/Schema interfaces • More at github.com/cloudera/ml 10
  13. 13. Cloudera Development Kit (CDK) • Open source libraries, tools & docs that make building systems on Hadoop easier • Provides higher-level APIs atop existing CDH components • Codify patterns for common use cases • Doing the right thing should be easy & obvious 11
  14. 14. CDK – loosely coupled modules CDK is prescriptive but.. • Modules can be used independently, as needed • Doesn’t force you into a programing paradigm • Doesn’t make you to adopt a ton of dependencies 12
  15. 15. CDK – data module • Easier to work with data sets on Hadoop file systems • Automatic serialization/de-serialization of Java POJOs and Avro records • Automatic compression, file & directory layout • Automatic partitioning • Metadata plugin provider (Hive/HCatalog) 13
  16. 16. CDK – example data module usage DatasetRepository repo = new FileSystemDatasetRepository.Builder() .fileSystem(FileSystem.get(new Configuration()) .directory(new Path(“/data”)).get(); DatasetDescriptor desc new DatasetDescriptor.Builder() .schema(new File(“event.avsc”)) .partitionStrategy( new PartitionStrategy.Builder().hash(“userId”, 53)).get(); Dataset events = repo.create(“events”, desc); DatasetWriter<GenericRecord> writer = events.getWriter(); writer.open(); writer.write( new GenericRecordBuilder(desc.getSchema()) .set(“userId”, 1) .set(“timeStamp”, System.currentTimeMillis()).build()); writer.close(); repo.drop(“events”); 14
  17. 17. CDK – example directory contents /data /events /.metadata /schema.avsc /descriptor.properties /userId-0 /xxxx.avro /xxxx.avro /userId-1 /xxxx.avro /userId-2 15 A dataset Per-dataset metadata provider Partioned dataset “entities”: Snappy compressed Avro data files containing individual records The dataset repository
  18. 18. CDK – what’s new & coming • Log application events to a dataset w/ the log4j API & Flume as the transport • Datasets exposed as Crunch sources and targets • Date partitioning (year/month/day/hour/min) • More examples • Morphlines (library for record-transformation) • More dataset repositories & languages 16
  19. 19. CDK – more info • github.com/cloudera/cdk • github.com/cloudera/cdk-examples • Binary artifacts in Cloudera’s maven repo • Mailing list: cdk-dev@cloudera.org 17
  20. 20. A guide to Python frameworks for Hadoop • Uri Laserson, data scientist @ Cloudera • Streaming, mrjob, dumbo, hadoopy, pydoop & more • Thursday June 13th 7pm @ Foursquare (NYC HUG) Interested in more topics like this? follow @ClouderaEng 18
  21. 21. 19 Thank You! Eli Collins @elicollins
  22. 22. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/hadoop- frameworks-api

×