1
New tools for building applications
on Apache Hadoop
Eli Collins
Software Engineer, Cloudera
@elicollins
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/hadoop-frameworks-api
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Agenda
• Context – building better products w/ data
• Analytics-driven development
• Diverse data sources & formats
• Tools that make it easier to build apps on Hadoop
• Apache Avro
• Apache Crunch
• Cloudera ML
• Cloudera CDK
2
Serialization & formats w/ Apache Avro
• Expressive
• Records, arrays, unions, enums
• Efficient
• Compact binary, compressed, splittable
• Interoperable
• Langs: C, C++, C#, Java, Perl, Python, Ruby, PHP
• Tools: MR, Pig, Hive, Crunch, Flume, Sqoop, etc
• Dynamic
• Can read & write w/o generating code first
• Evolvable
3
Complex pipelines w/ Apache Crunch
• Not all data formats are a natural fit for Pig & Hive
• Workaround – large, custom UDFs (or MR)
• Crunch
• API for MapReduce in Java (& Scala)
• Based on Google’s FlumeJava paper
• Combine small # primitives & light-weight UDFs
4
Crunch – advantages
• It’s just Java
• Full programming language
• No need to learn or switch between languages
• Natural type system
• Hadoop writables & Avro native support
• Modular library for reuse
• Create glue code for data transformation
that can be combined with a ML algorithm
into a single MR job
5
Crunch – core concepts
PCollection: distributed, unordered collection of
elements w/ parallelDo operator.
PTable: sub-interface of PCollection. Distributed, sorted
map. Also has groupBy operator to aggr values by key.
Pipeline: coordinates the building and execution of
underlying MapReduce jobs.
6
Crunch – word count
7
public class WordCount {
public static void main(String[] args) throws Exception {
Pipeline pipeline = new MRPipeline(WordCount.class);
PCollection lines = pipeline.readTextFile(args[0]);
PCollection words = lines.parallelDo("my splitter", new DoFn() {
public void process(String line, Emitter emitter) {
for (String word : line.split("s+")) {
emitter.emit(word);
}
}
}, Writables.strings());
PTable counts = Aggregate.count(words);
pipeline.writeTextFile(counts, args[1]);
pipeline.run();
}
}
Scrunch – Scala wrapper
8
class WordCountExample {
val pipeline = new Pipeline[WordCountExample]
def wordCount(fileName: String) = {
pipeline.read(from.textFile(fileName))
.flatMap(_.toLowerCase.split("W+"))
.filter(!_.isEmpty())
.count
}
}
Based on Google’s Cascade project
Cloudera ML
• Open source libraries and tools to help data
scientists perform common tasks
• Data preparation
• Model evaluation
• Built-in commands
• summarize, sample, normalize, pivot, etc
• K-means clustering on Hadoop
• Scalable k-means++ by Bahmani et al
• Other implementations as well
9
Cloudera ML (cont)
• Built using Crunch
• Vector format – leverages Mahout’s Vector interface
& classes
• Record format – thin wrapper on Avro’s
GenericRecord/Schema and HCatRecord/Schema
interfaces
• More at github.com/cloudera/ml
10
Cloudera Development Kit (CDK)
• Open source libraries, tools & docs that make
building systems on Hadoop easier
• Provides higher-level APIs atop existing CDH components
• Codify patterns for common use cases
• Doing the right thing should be easy & obvious
11
CDK – loosely coupled modules
CDK is prescriptive but..
• Modules can be used independently, as needed
• Doesn’t force you into a programing paradigm
• Doesn’t make you to adopt a ton of dependencies
12
CDK – data module
• Easier to work with data sets on Hadoop file systems
• Automatic serialization/de-serialization of Java POJOs
and Avro records
• Automatic compression, file & directory layout
• Automatic partitioning
• Metadata plugin provider (Hive/HCatalog)
13
CDK – example data module usage
DatasetRepository repo = new FileSystemDatasetRepository.Builder()
.fileSystem(FileSystem.get(new Configuration())
.directory(new Path(“/data”)).get();
DatasetDescriptor desc new DatasetDescriptor.Builder()
.schema(new File(“event.avsc”))
.partitionStrategy(
new PartitionStrategy.Builder().hash(“userId”, 53)).get();
Dataset events = repo.create(“events”, desc);
DatasetWriter<GenericRecord> writer = events.getWriter();
writer.open();
writer.write(
new GenericRecordBuilder(desc.getSchema())
.set(“userId”, 1)
.set(“timeStamp”, System.currentTimeMillis()).build());
writer.close();
repo.drop(“events”);
14
CDK – example directory contents
/data
/events
/.metadata
/schema.avsc
/descriptor.properties
/userId-0
/xxxx.avro
/xxxx.avro
/userId-1
/xxxx.avro
/userId-2
15
A dataset
Per-dataset metadata provider
Partioned dataset “entities”:
Snappy compressed Avro data
files containing individual records
The dataset repository
CDK – what’s new & coming
• Log application events to a dataset w/ the log4j API &
Flume as the transport
• Datasets exposed as Crunch sources and targets
• Date partitioning (year/month/day/hour/min)
• More examples
• Morphlines (library for record-transformation)
• More dataset repositories & languages
16
CDK – more info
• github.com/cloudera/cdk
• github.com/cloudera/cdk-examples
• Binary artifacts in Cloudera’s maven repo
• Mailing list: cdk-dev@cloudera.org
17
A guide to Python frameworks for Hadoop
• Uri Laserson, data scientist @ Cloudera
• Streaming, mrjob, dumbo, hadoopy, pydoop & more
• Thursday June 13th 7pm @ Foursquare (NYC HUG)
Interested in more topics like this?
follow @ClouderaEng
18
19
Thank You!
Eli Collins
@elicollins
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/hadoop-
frameworks-api

Building Applications using Apache Hadoop

  • 1.
    1 New tools forbuilding applications on Apache Hadoop Eli Collins Software Engineer, Cloudera @elicollins
  • 2.
    InfoQ.com: News &Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /hadoop-frameworks-api
  • 3.
    Presented at QConNew York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4.
    Agenda • Context –building better products w/ data • Analytics-driven development • Diverse data sources & formats • Tools that make it easier to build apps on Hadoop • Apache Avro • Apache Crunch • Cloudera ML • Cloudera CDK 2
  • 5.
    Serialization & formatsw/ Apache Avro • Expressive • Records, arrays, unions, enums • Efficient • Compact binary, compressed, splittable • Interoperable • Langs: C, C++, C#, Java, Perl, Python, Ruby, PHP • Tools: MR, Pig, Hive, Crunch, Flume, Sqoop, etc • Dynamic • Can read & write w/o generating code first • Evolvable 3
  • 6.
    Complex pipelines w/Apache Crunch • Not all data formats are a natural fit for Pig & Hive • Workaround – large, custom UDFs (or MR) • Crunch • API for MapReduce in Java (& Scala) • Based on Google’s FlumeJava paper • Combine small # primitives & light-weight UDFs 4
  • 7.
    Crunch – advantages •It’s just Java • Full programming language • No need to learn or switch between languages • Natural type system • Hadoop writables & Avro native support • Modular library for reuse • Create glue code for data transformation that can be combined with a ML algorithm into a single MR job 5
  • 8.
    Crunch – coreconcepts PCollection: distributed, unordered collection of elements w/ parallelDo operator. PTable: sub-interface of PCollection. Distributed, sorted map. Also has groupBy operator to aggr values by key. Pipeline: coordinates the building and execution of underlying MapReduce jobs. 6
  • 9.
    Crunch – wordcount 7 public class WordCount { public static void main(String[] args) throws Exception { Pipeline pipeline = new MRPipeline(WordCount.class); PCollection lines = pipeline.readTextFile(args[0]); PCollection words = lines.parallelDo("my splitter", new DoFn() { public void process(String line, Emitter emitter) { for (String word : line.split("s+")) { emitter.emit(word); } } }, Writables.strings()); PTable counts = Aggregate.count(words); pipeline.writeTextFile(counts, args[1]); pipeline.run(); } }
  • 10.
    Scrunch – Scalawrapper 8 class WordCountExample { val pipeline = new Pipeline[WordCountExample] def wordCount(fileName: String) = { pipeline.read(from.textFile(fileName)) .flatMap(_.toLowerCase.split("W+")) .filter(!_.isEmpty()) .count } } Based on Google’s Cascade project
  • 11.
    Cloudera ML • Opensource libraries and tools to help data scientists perform common tasks • Data preparation • Model evaluation • Built-in commands • summarize, sample, normalize, pivot, etc • K-means clustering on Hadoop • Scalable k-means++ by Bahmani et al • Other implementations as well 9
  • 12.
    Cloudera ML (cont) •Built using Crunch • Vector format – leverages Mahout’s Vector interface & classes • Record format – thin wrapper on Avro’s GenericRecord/Schema and HCatRecord/Schema interfaces • More at github.com/cloudera/ml 10
  • 13.
    Cloudera Development Kit(CDK) • Open source libraries, tools & docs that make building systems on Hadoop easier • Provides higher-level APIs atop existing CDH components • Codify patterns for common use cases • Doing the right thing should be easy & obvious 11
  • 14.
    CDK – looselycoupled modules CDK is prescriptive but.. • Modules can be used independently, as needed • Doesn’t force you into a programing paradigm • Doesn’t make you to adopt a ton of dependencies 12
  • 15.
    CDK – datamodule • Easier to work with data sets on Hadoop file systems • Automatic serialization/de-serialization of Java POJOs and Avro records • Automatic compression, file & directory layout • Automatic partitioning • Metadata plugin provider (Hive/HCatalog) 13
  • 16.
    CDK – exampledata module usage DatasetRepository repo = new FileSystemDatasetRepository.Builder() .fileSystem(FileSystem.get(new Configuration()) .directory(new Path(“/data”)).get(); DatasetDescriptor desc new DatasetDescriptor.Builder() .schema(new File(“event.avsc”)) .partitionStrategy( new PartitionStrategy.Builder().hash(“userId”, 53)).get(); Dataset events = repo.create(“events”, desc); DatasetWriter<GenericRecord> writer = events.getWriter(); writer.open(); writer.write( new GenericRecordBuilder(desc.getSchema()) .set(“userId”, 1) .set(“timeStamp”, System.currentTimeMillis()).build()); writer.close(); repo.drop(“events”); 14
  • 17.
    CDK – exampledirectory contents /data /events /.metadata /schema.avsc /descriptor.properties /userId-0 /xxxx.avro /xxxx.avro /userId-1 /xxxx.avro /userId-2 15 A dataset Per-dataset metadata provider Partioned dataset “entities”: Snappy compressed Avro data files containing individual records The dataset repository
  • 18.
    CDK – what’snew & coming • Log application events to a dataset w/ the log4j API & Flume as the transport • Datasets exposed as Crunch sources and targets • Date partitioning (year/month/day/hour/min) • More examples • Morphlines (library for record-transformation) • More dataset repositories & languages 16
  • 19.
    CDK – moreinfo • github.com/cloudera/cdk • github.com/cloudera/cdk-examples • Binary artifacts in Cloudera’s maven repo • Mailing list: cdk-dev@cloudera.org 17
  • 20.
    A guide toPython frameworks for Hadoop • Uri Laserson, data scientist @ Cloudera • Streaming, mrjob, dumbo, hadoopy, pydoop & more • Thursday June 13th 7pm @ Foursquare (NYC HUG) Interested in more topics like this? follow @ClouderaEng 18
  • 21.
  • 22.
    Watch the videowith slide synchronization on InfoQ.com! http://www.infoq.com/presentations/hadoop- frameworks-api