Building Applications using Apache Hadoop

1
New tools for building applications
on Apache Hadoop
Eli Collins
Software Engineer, Cloudera
@elicollins

InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/hadoop-frameworks-api

Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide

Agenda
• Context – building better products w/ data
• Analytics-driven development
• Diverse data sources & formats
• Tools that make it easier to build apps on Hadoop
• Apache Avro
• Apache Crunch
• Cloudera ML
• Cloudera CDK
2

Serialization & formats w/ Apache Avro
• Expressive
• Records, arrays, unions, enums
• Efficient
• Compact binary, compressed, splittable
• Interoperable
• Langs: C, C++, C#, Java, Perl, Python, Ruby, PHP
• Tools: MR, Pig, Hive, Crunch, Flume, Sqoop, etc
• Dynamic
• Can read & write w/o generating code first
• Evolvable
3

Complex pipelines w/ Apache Crunch
• Not all data formats are a natural fit for Pig & Hive
• Workaround – large, custom UDFs (or MR)
• Crunch
• API for MapReduce in Java (& Scala)
• Based on Google’s FlumeJava paper
• Combine small # primitives & light-weight UDFs
4

Crunch – advantages
• It’s just Java
• Full programming language
• No need to learn or switch between languages
• Natural type system
• Hadoop writables & Avro native support
• Modular library for reuse
• Create glue code for data transformation
that can be combined with a ML algorithm
into a single MR job
5

Crunch – core concepts
PCollection: distributed, unordered collection of
elements w/ parallelDo operator.
PTable: sub-interface of PCollection. Distributed, sorted
map. Also has groupBy operator to aggr values by key.
Pipeline: coordinates the building and execution of
underlying MapReduce jobs.
6

Crunch – word count
7
public class WordCount {
public static void main(String[] args) throws Exception {
Pipeline pipeline = new MRPipeline(WordCount.class);
PCollection lines = pipeline.readTextFile(args[0]);
PCollection words = lines.parallelDo("my splitter", new DoFn() {
public void process(String line, Emitter emitter) {
for (String word : line.split("s+")) {
emitter.emit(word);
}
}
}, Writables.strings());
PTable counts = Aggregate.count(words);
pipeline.writeTextFile(counts, args[1]);
pipeline.run();
}
}

Scrunch – Scala wrapper
8
class WordCountExample {
val pipeline = new Pipeline[WordCountExample]
def wordCount(fileName: String) = {
pipeline.read(from.textFile(fileName))
.flatMap(_.toLowerCase.split("W+"))
.filter(!_.isEmpty())
.count
}
}
Based on Google’s Cascade project

Cloudera ML
• Open source libraries and tools to help data
scientists perform common tasks
• Data preparation
• Model evaluation
• Built-in commands
• summarize, sample, normalize, pivot, etc
• K-means clustering on Hadoop
• Scalable k-means++ by Bahmani et al
• Other implementations as well
9

Cloudera ML (cont)
• Built using Crunch
• Vector format – leverages Mahout’s Vector interface
& classes
• Record format – thin wrapper on Avro’s
GenericRecord/Schema and HCatRecord/Schema
interfaces
• More at github.com/cloudera/ml
10

Cloudera Development Kit (CDK)
• Open source libraries, tools & docs that make
building systems on Hadoop easier
• Provides higher-level APIs atop existing CDH components
• Codify patterns for common use cases
• Doing the right thing should be easy & obvious
11

CDK – loosely coupled modules
CDK is prescriptive but..
• Modules can be used independently, as needed
• Doesn’t force you into a programing paradigm
• Doesn’t make you to adopt a ton of dependencies
12

CDK – data module
• Easier to work with data sets on Hadoop file systems
• Automatic serialization/de-serialization of Java POJOs
and Avro records
• Automatic compression, file & directory layout
• Automatic partitioning
• Metadata plugin provider (Hive/HCatalog)
13

CDK – example data module usage
DatasetRepository repo = new FileSystemDatasetRepository.Builder()
.fileSystem(FileSystem.get(new Configuration())
.directory(new Path(“/data”)).get();
DatasetDescriptor desc new DatasetDescriptor.Builder()
.schema(new File(“event.avsc”))
.partitionStrategy(
new PartitionStrategy.Builder().hash(“userId”, 53)).get();
Dataset events = repo.create(“events”, desc);
DatasetWriter<GenericRecord> writer = events.getWriter();
writer.open();
writer.write(
new GenericRecordBuilder(desc.getSchema())
.set(“userId”, 1)
.set(“timeStamp”, System.currentTimeMillis()).build());
writer.close();
repo.drop(“events”);
14

CDK – example directory contents
/data
/events
/.metadata
/schema.avsc
/descriptor.properties
/userId-0
/xxxx.avro
/xxxx.avro
/userId-1
/xxxx.avro
/userId-2
15
A dataset
Per-dataset metadata provider
Partioned dataset “entities”:
Snappy compressed Avro data
files containing individual records
The dataset repository

CDK – what’s new & coming
• Log application events to a dataset w/ the log4j API &
Flume as the transport
• Datasets exposed as Crunch sources and targets
• Date partitioning (year/month/day/hour/min)
• More examples
• Morphlines (library for record-transformation)
• More dataset repositories & languages
16

CDK – more info
• github.com/cloudera/cdk
• github.com/cloudera/cdk-examples
• Binary artifacts in Cloudera’s maven repo
• Mailing list: cdk-dev@cloudera.org
17

A guide to Python frameworks for Hadoop
• Uri Laserson, data scientist @ Cloudera
• Streaming, mrjob, dumbo, hadoopy, pydoop & more
• Thursday June 13th 7pm @ Foursquare (NYC HUG)
Interested in more topics like this?
follow @ClouderaEng
18

19
Thank You!
Eli Collins
@elicollins

Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/hadoop-
frameworks-api

Building Applications using Apache Hadoop

More Related Content

What's hot

Similar to Building Applications using Apache Hadoop

More from C4Media

Recently uploaded

Building Applications using Apache Hadoop