Cascading at the Lyon Hadoop User Group

Cascading
Hadoop User Group Lyon
2015-02-06
Arnaud Cogoluègnes - Zenika

Content
Cascading: what, why, how?
Hadoop basics along the way
No pre-requisites to follow

Cascading, what is it?
Java framework
Apache License, Version 2.0
To build data-oriented applications
e.g. ETL-like applications

Cascading key features
Java API
Mature (runs on MapReduce for years)
Testability
Re-usability
Built-in features (filter, join, aggregator, etc)

Cascading simple flow
Fields usersFields = new Fields(
"name","country","gender"
);
Pipe users = new Pipe("users");
users = new Unique(users,new Fields("name"));
jason US M
arnaud FR M
cynthia US F
mike US M
paul GB M
anna RU F
clare GB F
input file

Connecting flow to source and sink
Fields usersFields = new Fields("name","country","gender");
Tap usersIn = ... // file’s location and structure abstraction
Tap usersOut = ...
FlowDef flowDef = FlowDef.flowDef()
.addSource(users, usersIn)
.addTailSink(users, usersOut);

Taps and schemes
Tap usersIn = new Hfs(
new TextDelimited(usersFields,false,"t"), // structure
"/in" // location
);
Tap usersOut = new Hfs(
new TextDelimited(usersFields, false, "t"),"/out"
);

Executing a MapReduce flow
new Hadoop2MR1FlowConnector().connect(flowDef).complete();

My first MapReduce flow
Tap usersIn = new Hfs(...);
Tap usersOut = new Hfs(...);

Changing the output
Tap usersIn = new Hfs(...);
Tap usersOut = new Hfs( new SequenceFile(usersFields),"/out");

Hadoop 2
HDFS
YARN
MapReduce
Your
app

Blocks, datanodes, namenode
file.csv B1 B2 B3 file is made of 3 blocks (default block size is 128 MB)
B1 B2 B1 B3
B1 B2 B2 B3
DN 1 DN 2
DN 4DN 3
datanodes store files blocks
(here block 3 is under-replicated)
B1 : 1, 2, 3 B2 : 1, 3, 4
B3 : 2, 4
Namenode
namenode handles files metadata and enforces
replication

MapReduce
file.csv B1 B2 B3
Mapper
Mapper
Mapper
B1
B2
B3
Reducer
Reducer
k1,v1
k1,v2
k1 [v1,v2]

Code goes to data
file.csv B1 B2 B3
Mapper
Mapper
Mapper
B1
B2
B3
Reducer
Reducer
k1,v1
k1,v2
k1 [v1,v2]
B1 B2 B1 B3
B1 B2 B2 B3
DN 1 DN 2
DN 4DN 3
DN 1
DN 3
DN 4

Local MapReduce in a test
Not bad

Local connector for testing
Tap usersIn = new FileTap(new TextDelimited(usersFields,false,"t"),"in.txt");
Tap usersOut = new FileTap(
new TextDelimited(usersFields, false, "t"), "out.txt"
);
new LocalFlowConnector().connect(flowDef).complete();

Users by countries
users = new GroupBy(users,new Fields("country"));
users = new Every(users,new Count(new Fields("count")));
Tap usersOut = new FileTap(
new TextDelimited(new Fields("country","count"), false, "t"),"/out.txt"
);
jason US M
arnaud FR M
cynthia US F
mike US M
paul GB M
anna RU F
clare GB F
FR 1
RU 1
GB 2
US 3

Usage by countries?
jason US M
arnaud FR M
cynthia US F
mike US M
paul GB M
anna RU F
clare GB F
jason login
mike newcontract
cynthia login
anna logout
jason newcontract
jason logout
...
logs users

Join logs and users
Fields logsFields = new Fields("username","action");
Pipe logs = new Pipe("logs");
Pipe logsUsers = new CoGroup(
logs,new Fields("username"),
users,new Fields("name")
);

Join logs and users
Pipe logsUsers = new CoGroup(
logs,new Fields("username"),
users,new Fields("name")
);
jason login
mike newcontract
cynthia login
anna logout
jason newcontract
jason logout
...
jason US M
arnaud FR M
cynthia US F
mike US M
paul GB M
anna RU F
clare GB F
anna RU F logout
cynthia US F login
jason US M login
jason US M newcontract
jason US M logout
mike US M newcontract

Usage by country
logsUsers = new GroupBy(logsUsers,new Fields("country"));
logsUsers = new Every(logsUsers,new Count(new Fields("count")));

Usage by countries
Tap usersIn = new FileTap(new TextDelimited(usersFields,false,"t"),"users.txt");
Tap logsIn = new FileTap(new TextDelimited(logsFields,false,"t"),"logs.txt");
Tap usageOut = new FileTap(
new TextDelimited(new Fields("country","count"), false, "t"),
"usage.txt"
);
.addSource(logs,logsIn)
.addTailSink(logsUsers, usageOut);
RU 1
US 5

Repartition join
M
M
M
R
R
jdoe,US
pmartin,FR
jdoe,/products
pmartin,/checkout
jdoe,/account
jdoe,US
jdoe,/products
jdoe,/account
jdoe,/products
jdoe,US
jdoe,/account
jdoe,/products,US
jdoe,/account,US
in-memory
cartesian product

Repartition join optimization
M
M
M
R
R
jdoe,US
pmartin,FR
jdoe,/products
pmartin,/checkout
jdoe,/account
jdoe,US
jdoe,/products
jdoe,/account
jdoe,US
jdoe,/products
jdoe,/account
jdoe,/products,US
jdoe,/account,US
only “users” in memory
(thanks to dataset indicator sorting,
i.e. “secondary sort”)

Optimization in Cascading CoGroup
“During co-grouping, for any given unique grouping key, all of the rightmost
pipes will accumulate the current grouping values into memory so they
may be iterated across for every value in the left hand side pipe.
(...)
There is no accumulation for the left hand side pipe, only for those to the
"right".
Thus, for the pipe that has the largest number of values per unique key
grouping, on average, it should be made the "left hand side" pipe (lhs).”

Replicated/asymmetrical join
M
M
M
jdoe,US
pmartin,FR
jdoe,/products
pmartin,/checkout
jdoe,/account
jdoe,/products,US
jdoe,US
pmartin,FR
jdoe,US
pmartin,FR
jdoe,/account,US
pmartin,/checkout,FR
Loaded in distributed cache
(hence “replicated”)

Function
users = new Each(
users,
new Fields("country"), // argument
new CountryFullnameFunction(new Fields("countryFullname")), // function output
new Fields("name","countryFullname","gender") // what we keep
);
jason United States M
arnaud France M
cynthia United States F
mike United States M
paul United Kingdom M
anna Russia F
clare United Kingdom F

Function (naive) implementation
public static class CountryFullnameFunction extends BaseOperation implements Function {
public CountryFullnameFunction(Fields fields) {
super(fields);
}
@Override
public void operate(FlowProcess flowProcess, FunctionCall functionCall) {
String country = functionCall.getArguments().getString(0);
Locale locale = new Locale("",country);
Tuple tuple = new Tuple();
tuple.add(locale.getDisplayCountry(Locale.ENGLISH));
functionCall.getOutputCollector().add(tuple);
}
}

Functions
super(fields);
}
@Override
// this is executed remotely
// tips: initialize (small) caches, re-use objects, etc.
// functions have callbacks for this
}
}

Re-using objects in a function
Tuple tuple = new Tuple();
super(fields);
}
@Override
String country = functionCall.getArguments().getString(0);
Locale locale = new Locale("",country);
tuple.clear();
tuple.add(locale.getDisplayCountry(Locale.ENGLISH));
functionCall.getOutputCollector().add(tuple);
}
}

Using Avro with Cascading
// Avro is splittable, supports compression,
// and has schemas
Schema schema = new Schema.Parser().parse(schemaAsJson);
AvroScheme avroScheme = new AvroScheme(schema);
Tap tap = new Hfs(avroScheme,"/out");

Using Parquet files
// Parquet is column-oriented
// it supports splits and compression
MessageType type = ... // ~ the schema
Scheme parquetScheme = new ParquetTupleScheme(
fields, // fields to read
fields, // fields to write
type.toString()
);
Tap tap = new Hfs(
parquetScheme,
"/out"
);

Other dialects
Cascalog (Clojure)
Scalding (Scala)
...

Testing with plunger
Data corpus = new DataBuilder(usersFields)
.addTuple("jason","US","M")
(...)
.addTuple("cynthia", "US", "F")
.build();
Pipe users = plunger.newNamedPipe("users", corpus);
users = new GroupBy(users,new Fields("country"));
users = new Every(users,new Count(new Fields("count")));
Plunger plunger = new Plunger();
Bucket bucket = plunger.newBucket(new Fields("country", "count"), users);
Assert.assertEquals(bucket.result().asTupleList().size(),4);

Flow visualization
Flow flow = new LocalFlowConnector().connect(flowDef);
flow.writeDOT("cascading-flow.dot");
digraph G {
1 [label = "Every('users')[Count[decl:[{1}:'count']]]"];
2 [label = "FileTap['TextDelimited[['country', 'count']]']['/tmp/junit1462026100615315705/junit2286442878134169792.tmp']"];
3 [label = "GroupBy('users')[by:['country']]"];
4 [label = "FileTap['TextDelimited[['name', 'country', 'gender']]']['/home/acogoluegnes/prog/hadoop-dev/.
/src/test/resources/cascading/users.txt']"];
5 [label = "[head]n2.6.2nlocal:2.6.2:Concurrent, Inc."];
6 [label = "[tail]"];
1 -> 2 [label = "[{2}:'country', 'count']n[{3}:'name', 'country', 'gender']"];
3 -> 1 [label = "users[{1}:'country']n[{3}:'name', 'country', 'gender']"];
5 -> 4 [label = ""];
2 -> 6 [label = "[{2}:'country', 'count']n[{2}:'country', 'count']"];
4 -> 3 [label = "[{3}:'name', 'country', 'gender']n[{3}:'name', 'country', 'gender']"];
}

Typical processing
Receiving data (bulk or streams)
Processing in batch mode
Feed to real-time systems (RDBMs, NoSQL)

Use cases
Parsing, processing, aggregating data
“Diff-ing” 2 datasets
Joining data

Join generated and reference data
Hadoop
Processing
(join, transformation)
Generated data
Reporting
Reference data

Data handling
Raw data Parsed data
Processing and
insertion
Archives View on data Transformations
Avro, GZIP
Keep it for forever
Parquet, Snappy
Keep 2 years of data
Processing (Cascading)
HDFS Real time DB

Flow handling with Spring Batch
Archiving
Processing Processing Processing
Cleaning
Java, API HDFS
Cascading
MapReduce

Lambda architecture wish list
● Fault-tolerant
● Low latency
● Scalable
● General
● Extensible
● Ad hoc queries
● Minimal maintenance
● Debuggable

Layers
Speed layer
Serving layer
Batch layer

Batch layer
Speed layer
Serving layer
Batch layer
Dataset storage.
Views computation.

Serving layer
Speed layer
Serving layer
Batch layer
Random access to batch views.

Speed layer
Speed layer
Serving layer
Batch layer
Low latency access.

Batch layer
Speed layer
Serving layer
Batch layer
Hadoop (MapReduce, HDFS).
Thrift, Cascalog (i.e. Cascading).

Serving layer
Speed layer
Serving layer
Batch layer
ElephantDB, BerkeleyDB.

Speed layer
Speed layer
Serving layer
Batch layer
Cassandra, Storm, Kafka.

Hive, Pig, Cascading
UDF : User Defined Function
Hive
+
SQL (non-standard)
Low learning curve
Extensible with UDF
-
So-so testability
So-so reusability
No flow control
Spread logic (script, java, shell)
Programming with UDF
Pig
+
Pig Latin
Low learning curve
Extensible with UDF
-
So-so testability
So-so reusability
Spread logic (script, java, shell)
Programming with UDF
Cascading
+
API Java
Unit testable
Flow control (if, try/catch, etc)
Good re-usability
-
Programming needed

SQL on Cascading: Lingual
Pure Cascading underneath
ANSI/ISO standard SQL-99
JDBC Driver
Query any system…
… with an available Cascading Tap

Management & monitoring: Driven
Commercial
Analyze Cascading flows
SaaS and on-site deployment
Image: http://cascading.io/driven/

Image: http://cascading.io/driven/

Future: Cascading 3.0
Major rewriting
Better extensibility
MapReduce planner optimization
Tez and Storm support

Cascading at the Lyon Hadoop User Group

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cascading at the Lyon Hadoop User Group

Similar to Cascading at the Lyon Hadoop User Group (20)

More from acogoluegnes

More from acogoluegnes (7)

Recently uploaded

Recently uploaded (20)

Cascading at the Lyon Hadoop User Group