6. Data flows
Think of an ETL: Extract-Transform-Load
In simple terms, take data from a source, change it
somehow, and stick the result into something (a “sink”)
Data
source
Data
sink
Extract Load
Transformation(s)
7. Data flow implementation
Pretty much everything we do is some flavor of this
Sources: Games, Hadoop, Hive/MySQL, Couchbase,
web service
Transformations: Aggregations, group-bys, combined
fields, filtering, etc.
Sinks: Hadoop, Hive/MySQL, Couchbase
8. Cascading 101 (Part Deux)
JVM data flow framework
Models data flows as abstractions:
Separates details of where and how we get data
from what we do with it
Implements transform operations as SQL or
MapReduce or whatever
11. Cascading terminology
Flow: A path for data with some number of inputs,
some operations, and some outputs
Cascade: A series of connected flows
12. More terminology
Operation: A function applied to data, yielding new
data
Pipe: Moves data from someplace to some other place
Tap: Feeds data from outside the flow into it
and writes data from inside the flow out of it
13. Simplest possible flow
// create the source tap
Tap inTap = new Hfs(new TextDelimited(true, "t"), inPath);
!
// create the sink tap
Tap outTap = new Hfs(new TextDelimited(true, "t"), outPath);
!
// specify a pipe to connect the taps
Pipe copyPipe = new Pipe(“copy");
!
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.addSource(copyPipe, inTap)
.addTailSink(copyPipe, outTap);
!
// run the flow
flowConnector.connect(flowDef).complete();
15. Actually…
Runs entirely in the cluster
Works fine on megabytes, gigabytes, terabytes or
petabytes; i.e., IT SCALES
Completely testable outside of the cluster
Who gets shell access to a namenode to run the bash
or python equivalent?
16. Reliability is
ESSENTIAL
!
if we, and our system, are to
be taken srsly.
Reliability is a feature,
not a goal.
18. Real world use case:
Word counting
Read a simple file format
Count the occurrence of every word in the file
Output a list of all words and their counts
19. doc_id text
doc01 A rain shadow is a dry area on the lee back side
doc02 This sinking, dry air produces a rain shadow, or
doc03 A rain shadow is an area of dry land that lies on
doc04 This is known as the rain shadow effect and is the
doc05 Two Women. Secrets. A Broken Land. [DVD Australia]
Newline-delimited entries
ID and text fields, separated by tabs
Plan: Split lines into words and count them over each line
20. Flow I/O
Tap docTap = new Hfs(new TextDelimited(true, "t"), docPath);
Tap wcTap = new Hfs(new TextDelimited(true, "t"), wcPath);
No surprises here:
docTap reads a file from HDFS
wcTap will write the results to a different HDFS file
21. File parsing
Fields token = new Fields("token");
Fields text = new Fields("text");
RegexSplitGenerator splitter =
new RegexSplitGenerator(token, "[ [](),.]");
Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);
Fields are names for the tuple elements
RegexSplitGenerator applies the regex to input and
yields matches under the “token” field
docPipe takes each “token” generated by the splitter
and outputs them
22. Count the tokens (words)
Pipe wcPipe = new Pipe("wc", docPipe);
wcPipe = new GroupBy(wcPipe, token);
wcPipe = new Every(wcPipe, Fields.ALL, new Count(), Fields.ALL);
wcPipe connects to docPipe, using it for input
Fit a GroupBy function onto wcPipe, grouping by the
token field (the actual words)
for every tuple in wcPipe (every word), count each
occurrence and output the result
23. Create and run the flow
FlowDef flowDef = FlowDef.flowDef()
.setName("wc")
.addSource(docPipe, docTap)
.addTailSink(wcPipe, wcTap);
Flow wcFlow = flowConnector.connect(flowDef).complete();
Define a new flow with name “wc”
Feed the docTap (the original text file) into the
docPipe
Feed the wcTap (the output word counts) into the
wcPipe
Connect to the flowConnector (Hadoop) and go!
24. Cascading flow
100% Java
Databases and processing
are behind class
abstractions
Automatically scalable
Easily testable
26. Testing
Create flows entirely in code on a local machine
Write tests for controlled sample data sets
Run tests as regular old Java without needing access
to actual Hadoopery or databases
Local machine and CI testing are easy!
27. Reusability
Pipe assemblies are designed for reuse
Once created and tested, use them in other flows
Write logic to do something only once
This is *essential* for data integrity as well as
good programming
28. Common code base
Infrastructure writes MR-type jobs in Cascading,
warehouse writes data manipulations in Cascading
Everybody uses the same terms and same tech
Teams understand each other’s code
Can be modified by anyone, not just tool experts
29. Simpler stack
Cascading creates DAG of dependent jobs for us
Removes most of the need for Oozie (ew)
Keeps track of where a flow fails and can rerun from
that point on failure
31. Some bad news
JVM, which means Java (or Scala (or CLOJURE :) :)
Argument: Java is the platform for big data, so we
can’t avoid embracing it.
PyCascading uses Jython, which kinda sucks
32. Some other bad news
Doesn’t have job scheduler
Can figure out dependency graph for jobs, but
nothing to run them on a regular interval
We still need Jenkins or quartz
Concurrent is doing proprietary products (read: $)
for this kind of thing, but they’re months away
33. Other bad news
No real built-in monitoring
Easy to have a flow report what it has done;
hard to watch it in progress
We’d have to roll our own (but we’d have to do that
anyway, so whatevs)
35. Yes, we should try it.
It’s not everything we need, but it’s a lot
Possibly replace MapReduce and Sqoop
Proven tech; this isn’t bleeding edge work
We need an ETL framework and we don’t have time
to write one from scratch.
36. Let’s prototype a couple of jobs and
see what people other than me think.