Your SlideShare is downloading. ×
Building Applications using Apache Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Building Applications using Apache Hadoop

487
views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/16WDJ8b. …

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/16WDJ8b.

Eli Collins overviews how to build new applications with Hadoop and how to integrate Hadoop with existing applications, providing an update on the state of Hadoop ecosystem, frameworks and APIs.Filmed at qconnewyork.com.

Eli Collins is the tech lead for Cloudera's Platform team, an active contributor to Apache Hadoop and member of its project management committee (PMC) at the Apache Software Foundation. Eli holds Bachelor's and Master's degrees in Computer Science from New York University and the University of Wisconsin-Madison, respectively.

Published in: Technology, Education

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
487
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1 New tools for building applications on Apache Hadoop Eli Collins Software Engineer, Cloudera @elicollins
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /hadoop-frameworks-api
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Agenda • Context – building better products w/ data • Analytics-driven development • Diverse data sources & formats • Tools that make it easier to build apps on Hadoop • Apache Avro • Apache Crunch • Cloudera ML • Cloudera CDK 2
  • 5. Serialization & formats w/ Apache Avro • Expressive • Records, arrays, unions, enums • Efficient • Compact binary, compressed, splittable • Interoperable • Langs: C, C++, C#, Java, Perl, Python, Ruby, PHP • Tools: MR, Pig, Hive, Crunch, Flume, Sqoop, etc • Dynamic • Can read & write w/o generating code first • Evolvable 3
  • 6. Complex pipelines w/ Apache Crunch • Not all data formats are a natural fit for Pig & Hive • Workaround – large, custom UDFs (or MR) • Crunch • API for MapReduce in Java (& Scala) • Based on Google’s FlumeJava paper • Combine small # primitives & light-weight UDFs 4
  • 7. Crunch – advantages • It’s just Java • Full programming language • No need to learn or switch between languages • Natural type system • Hadoop writables & Avro native support • Modular library for reuse • Create glue code for data transformation that can be combined with a ML algorithm into a single MR job 5
  • 8. Crunch – core concepts PCollection: distributed, unordered collection of elements w/ parallelDo operator. PTable: sub-interface of PCollection. Distributed, sorted map. Also has groupBy operator to aggr values by key. Pipeline: coordinates the building and execution of underlying MapReduce jobs. 6
  • 9. Crunch – word count 7 public class WordCount { public static void main(String[] args) throws Exception { Pipeline pipeline = new MRPipeline(WordCount.class); PCollection lines = pipeline.readTextFile(args[0]); PCollection words = lines.parallelDo("my splitter", new DoFn() { public void process(String line, Emitter emitter) { for (String word : line.split("s+")) { emitter.emit(word); } } }, Writables.strings()); PTable counts = Aggregate.count(words); pipeline.writeTextFile(counts, args[1]); pipeline.run(); } }
  • 10. Scrunch – Scala wrapper 8 class WordCountExample { val pipeline = new Pipeline[WordCountExample] def wordCount(fileName: String) = { pipeline.read(from.textFile(fileName)) .flatMap(_.toLowerCase.split("W+")) .filter(!_.isEmpty()) .count } } Based on Google’s Cascade project
  • 11. Cloudera ML • Open source libraries and tools to help data scientists perform common tasks • Data preparation • Model evaluation • Built-in commands • summarize, sample, normalize, pivot, etc • K-means clustering on Hadoop • Scalable k-means++ by Bahmani et al • Other implementations as well 9
  • 12. Cloudera ML (cont) • Built using Crunch • Vector format – leverages Mahout’s Vector interface & classes • Record format – thin wrapper on Avro’s GenericRecord/Schema and HCatRecord/Schema interfaces • More at github.com/cloudera/ml 10
  • 13. Cloudera Development Kit (CDK) • Open source libraries, tools & docs that make building systems on Hadoop easier • Provides higher-level APIs atop existing CDH components • Codify patterns for common use cases • Doing the right thing should be easy & obvious 11
  • 14. CDK – loosely coupled modules CDK is prescriptive but.. • Modules can be used independently, as needed • Doesn’t force you into a programing paradigm • Doesn’t make you to adopt a ton of dependencies 12
  • 15. CDK – data module • Easier to work with data sets on Hadoop file systems • Automatic serialization/de-serialization of Java POJOs and Avro records • Automatic compression, file & directory layout • Automatic partitioning • Metadata plugin provider (Hive/HCatalog) 13
  • 16. CDK – example data module usage DatasetRepository repo = new FileSystemDatasetRepository.Builder() .fileSystem(FileSystem.get(new Configuration()) .directory(new Path(“/data”)).get(); DatasetDescriptor desc new DatasetDescriptor.Builder() .schema(new File(“event.avsc”)) .partitionStrategy( new PartitionStrategy.Builder().hash(“userId”, 53)).get(); Dataset events = repo.create(“events”, desc); DatasetWriter<GenericRecord> writer = events.getWriter(); writer.open(); writer.write( new GenericRecordBuilder(desc.getSchema()) .set(“userId”, 1) .set(“timeStamp”, System.currentTimeMillis()).build()); writer.close(); repo.drop(“events”); 14
  • 17. CDK – example directory contents /data /events /.metadata /schema.avsc /descriptor.properties /userId-0 /xxxx.avro /xxxx.avro /userId-1 /xxxx.avro /userId-2 15 A dataset Per-dataset metadata provider Partioned dataset “entities”: Snappy compressed Avro data files containing individual records The dataset repository
  • 18. CDK – what’s new & coming • Log application events to a dataset w/ the log4j API & Flume as the transport • Datasets exposed as Crunch sources and targets • Date partitioning (year/month/day/hour/min) • More examples • Morphlines (library for record-transformation) • More dataset repositories & languages 16
  • 19. CDK – more info • github.com/cloudera/cdk • github.com/cloudera/cdk-examples • Binary artifacts in Cloudera’s maven repo • Mailing list: cdk-dev@cloudera.org 17
  • 20. A guide to Python frameworks for Hadoop • Uri Laserson, data scientist @ Cloudera • Streaming, mrjob, dumbo, hadoopy, pydoop & more • Thursday June 13th 7pm @ Foursquare (NYC HUG) Interested in more topics like this? follow @ClouderaEng 18
  • 21. 19 Thank You! Eli Collins @elicollins
  • 22. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/hadoop- frameworks-api