SlideShare a Scribd company logo
1 of 57
Download to read offline
Cascading
Hadoop User Group Lyon
2015-02-06
Arnaud Cogoluègnes - Zenika
Content
Cascading: what, why, how?
Hadoop basics along the way
No pre-requisites to follow
Cascading, what is it?
Java framework
Apache License, Version 2.0
To build data-oriented applications
e.g. ETL-like applications
Cascading key features
Java API
Mature (runs on MapReduce for years)
Testability
Re-usability
Built-in features (filter, join, aggregator, etc)
Cascading simple flow
Fields usersFields = new Fields(
"name","country","gender"
);
Pipe users = new Pipe("users");
users = new Unique(users,new Fields("name"));
jason US M
arnaud FR M
cynthia US F
mike US M
paul GB M
anna RU F
clare GB F
input file
Connecting flow to source and sink
Fields usersFields = new Fields("name","country","gender");
Pipe users = new Pipe("users");
users = new Unique(users,new Fields("name"));
Tap usersIn = ... // file’s location and structure abstraction
Tap usersOut = ...
FlowDef flowDef = FlowDef.flowDef()
.addSource(users, usersIn)
.addTailSink(users, usersOut);
Taps and schemes
Fields usersFields = new Fields("name","country","gender");
Pipe users = new Pipe("users");
users = new Unique(users,new Fields("name"));
Tap usersIn = new Hfs(
new TextDelimited(usersFields,false,"t"), // structure
"/in" // location
);
Tap usersOut = new Hfs(
new TextDelimited(usersFields, false, "t"),"/out"
);
Executing a MapReduce flow
FlowDef flowDef = FlowDef.flowDef()
.addSource(users, usersIn)
.addTailSink(users, usersOut);
new Hadoop2MR1FlowConnector().connect(flowDef).complete();
My first MapReduce flow
Fields usersFields = new Fields("name","country","gender");
Pipe users = new Pipe("users");
users = new Unique(users,new Fields("name"));
Tap usersIn = new Hfs(...);
Tap usersOut = new Hfs(...);
FlowDef flowDef = FlowDef.flowDef()
.addSource(users, usersIn)
.addTailSink(users, usersOut);
new Hadoop2MR1FlowConnector().connect(flowDef).complete();
Changing the output
Fields usersFields = new Fields("name","country","gender");
Pipe users = new Pipe("users");
users = new Unique(users,new Fields("name"));
Tap usersIn = new Hfs(...);
Tap usersOut = new Hfs( new SequenceFile(usersFields),"/out");
FlowDef flowDef = FlowDef.flowDef()
.addSource(users, usersIn)
.addTailSink(users, usersOut);
new Hadoop2MR1FlowConnector().connect(flowDef).complete();
Hadoop 2
HDFS
YARN
MapReduce
Your
app
Blocks, datanodes, namenode
file.csv B1 B2 B3 file is made of 3 blocks (default block size is 128 MB)
B1 B2 B1 B3
B1 B2 B2 B3
DN 1 DN 2
DN 4DN 3
datanodes store files blocks
(here block 3 is under-replicated)
B1 : 1, 2, 3 B2 : 1, 3, 4
B3 : 2, 4
Namenode
namenode handles files metadata and enforces
replication
MapReduce
file.csv B1 B2 B3
Mapper
Mapper
Mapper
B1
B2
B3
Reducer
Reducer
k1,v1
k1,v2
k1 [v1,v2]
Code goes to data
file.csv B1 B2 B3
Mapper
Mapper
Mapper
B1
B2
B3
Reducer
Reducer
k1,v1
k1,v2
k1 [v1,v2]
B1 B2 B1 B3
B1 B2 B2 B3
DN 1 DN 2
DN 4DN 3
DN 1
DN 3
DN 4
Local MapReduce in a test
Not bad
Local connector
Better
Local connector for testing
Fields usersFields = new Fields("name","country","gender");
Pipe users = new Pipe("users");
users = new Unique(users,new Fields("name"));
Tap usersIn = new FileTap(new TextDelimited(usersFields,false,"t"),"in.txt");
Tap usersOut = new FileTap(
new TextDelimited(usersFields, false, "t"), "out.txt"
);
FlowDef flowDef = FlowDef.flowDef()
.addSource(users, usersIn)
.addTailSink(users, usersOut);
new LocalFlowConnector().connect(flowDef).complete();
Users by countries
Fields usersFields = new Fields("name","country","gender");
Pipe users = new Pipe("users");
users = new GroupBy(users,new Fields("country"));
users = new Every(users,new Count(new Fields("count")));
Tap usersOut = new FileTap(
new TextDelimited(new Fields("country","count"), false, "t"),"/out.txt"
);
jason US M
arnaud FR M
cynthia US F
mike US M
paul GB M
anna RU F
clare GB F
FR 1
RU 1
GB 2
US 3
Usage by countries?
jason US M
arnaud FR M
cynthia US F
mike US M
paul GB M
anna RU F
clare GB F
jason login
mike newcontract
cynthia login
anna logout
jason newcontract
jason logout
...
logs users
Join logs and users
Fields usersFields = new Fields("name","country","gender");
Fields logsFields = new Fields("username","action");
Pipe users = new Pipe("users");
Pipe logs = new Pipe("logs");
Pipe logsUsers = new CoGroup(
logs,new Fields("username"),
users,new Fields("name")
);
Join logs and users
Pipe logsUsers = new CoGroup(
logs,new Fields("username"),
users,new Fields("name")
);
jason login
mike newcontract
cynthia login
anna logout
jason newcontract
jason logout
...
jason US M
arnaud FR M
cynthia US F
mike US M
paul GB M
anna RU F
clare GB F
anna RU F logout
cynthia US F login
jason US M login
jason US M newcontract
jason US M logout
mike US M newcontract
Usage by country
logsUsers = new GroupBy(logsUsers,new Fields("country"));
logsUsers = new Every(logsUsers,new Count(new Fields("count")));
Usage by countries
Tap usersIn = new FileTap(new TextDelimited(usersFields,false,"t"),"users.txt");
Tap logsIn = new FileTap(new TextDelimited(logsFields,false,"t"),"logs.txt");
Tap usageOut = new FileTap(
new TextDelimited(new Fields("country","count"), false, "t"),
"usage.txt"
);
FlowDef flowDef = FlowDef.flowDef()
.addSource(users, usersIn)
.addSource(logs,logsIn)
.addTailSink(logsUsers, usageOut);
RU 1
US 5
Repartition join
M
M
M
R
R
jdoe,US
pmartin,FR
jdoe,/products
pmartin,/checkout
jdoe,/account
jdoe,US
jdoe,/products
jdoe,/account
jdoe,/products
jdoe,US
jdoe,/account
jdoe,/products,US
jdoe,/account,US
in-memory
cartesian product
Repartition join optimization
M
M
M
R
R
jdoe,US
pmartin,FR
jdoe,/products
pmartin,/checkout
jdoe,/account
jdoe,US
jdoe,/products
jdoe,/account
jdoe,US
jdoe,/products
jdoe,/account
jdoe,/products,US
jdoe,/account,US
only “users” in memory
(thanks to dataset indicator sorting,
i.e. “secondary sort”)
Optimization in Cascading CoGroup
“During co-grouping, for any given unique grouping key, all of the rightmost
pipes will accumulate the current grouping values into memory so they
may be iterated across for every value in the left hand side pipe.
(...)
There is no accumulation for the left hand side pipe, only for those to the
"right".
Thus, for the pipe that has the largest number of values per unique key
grouping, on average, it should be made the "left hand side" pipe (lhs).”
Replicated/asymmetrical join
M
M
M
jdoe,US
pmartin,FR
jdoe,/products
pmartin,/checkout
jdoe,/account
jdoe,/products,US
jdoe,US
pmartin,FR
jdoe,US
pmartin,FR
jdoe,/account,US
pmartin,/checkout,FR
Loaded in distributed cache
(hence “replicated”)
Function
users = new Each(
users,
new Fields("country"), // argument
new CountryFullnameFunction(new Fields("countryFullname")), // function output
new Fields("name","countryFullname","gender") // what we keep
);
jason United States M
arnaud France M
cynthia United States F
mike United States M
paul United Kingdom M
anna Russia F
clare United Kingdom F
Function (naive) implementation
public static class CountryFullnameFunction extends BaseOperation implements Function {
public CountryFullnameFunction(Fields fields) {
super(fields);
}
@Override
public void operate(FlowProcess flowProcess, FunctionCall functionCall) {
String country = functionCall.getArguments().getString(0);
Locale locale = new Locale("",country);
Tuple tuple = new Tuple();
tuple.add(locale.getDisplayCountry(Locale.ENGLISH));
functionCall.getOutputCollector().add(tuple);
}
}
Functions
public static class CountryFullnameFunction extends BaseOperation implements Function {
public CountryFullnameFunction(Fields fields) {
super(fields);
}
@Override
public void operate(FlowProcess flowProcess, FunctionCall functionCall) {
// this is executed remotely
// tips: initialize (small) caches, re-use objects, etc.
// functions have callbacks for this
}
}
Re-using objects in a function
public static class CountryFullnameFunction extends BaseOperation implements Function {
Tuple tuple = new Tuple();
public CountryFullnameFunction(Fields fields) {
super(fields);
}
@Override
public void operate(FlowProcess flowProcess, FunctionCall functionCall) {
String country = functionCall.getArguments().getString(0);
Locale locale = new Locale("",country);
tuple.clear();
tuple.add(locale.getDisplayCountry(Locale.ENGLISH));
functionCall.getOutputCollector().add(tuple);
}
}
Using Avro with Cascading
// Avro is splittable, supports compression,
// and has schemas
Schema schema = new Schema.Parser().parse(schemaAsJson);
AvroScheme avroScheme = new AvroScheme(schema);
Tap tap = new Hfs(avroScheme,"/out");
Using Parquet files
// Parquet is column-oriented
// it supports splits and compression
MessageType type = ... // ~ the schema
Scheme parquetScheme = new ParquetTupleScheme(
fields, // fields to read
fields, // fields to write
type.toString()
);
Tap tap = new Hfs(
parquetScheme,
"/out"
);
Other dialects
Cascalog (Clojure)
Scalding (Scala)
...
Testing with plunger
Fields usersFields = new Fields("name","country","gender");
Data corpus = new DataBuilder(usersFields)
.addTuple("jason","US","M")
(...)
.addTuple("cynthia", "US", "F")
.build();
Pipe users = plunger.newNamedPipe("users", corpus);
users = new GroupBy(users,new Fields("country"));
users = new Every(users,new Count(new Fields("count")));
Plunger plunger = new Plunger();
Bucket bucket = plunger.newBucket(new Fields("country", "count"), users);
Assert.assertEquals(bucket.result().asTupleList().size(),4);
Flow visualization
Flow flow = new LocalFlowConnector().connect(flowDef);
flow.writeDOT("cascading-flow.dot");
digraph G {
1 [label = "Every('users')[Count[decl:[{1}:'count']]]"];
2 [label = "FileTap['TextDelimited[['country', 'count']]']['/tmp/junit1462026100615315705/junit2286442878134169792.tmp']"];
3 [label = "GroupBy('users')[by:['country']]"];
4 [label = "FileTap['TextDelimited[['name', 'country', 'gender']]']['/home/acogoluegnes/prog/hadoop-dev/.
/src/test/resources/cascading/users.txt']"];
5 [label = "[head]n2.6.2nlocal:2.6.2:Concurrent, Inc."];
6 [label = "[tail]"];
1 -> 2 [label = "[{2}:'country', 'count']n[{3}:'name', 'country', 'gender']"];
3 -> 1 [label = "users[{1}:'country']n[{3}:'name', 'country', 'gender']"];
5 -> 4 [label = ""];
2 -> 6 [label = "[{2}:'country', 'count']n[{2}:'country', 'count']"];
4 -> 3 [label = "[{3}:'name', 'country', 'gender']n[{3}:'name', 'country', 'gender']"];
}
Typical processing
Receiving data (bulk or streams)
Processing in batch mode
Feed to real-time systems (RDBMs, NoSQL)
Use cases
Parsing, processing, aggregating data
“Diff-ing” 2 datasets
Joining data
Join generated and reference data
Hadoop
Processing
(join, transformation)
Generated data
Reporting
Reference data
Data handling
Raw data Parsed data
Processing and
insertion
Archives View on data Transformations
Avro, GZIP
Keep it for forever
Parquet, Snappy
Keep 2 years of data
Processing (Cascading)
HDFS Real time DB
Flow handling with Spring Batch
Archiving
Processing Processing Processing
Cleaning
Java, API HDFS
Cascading
MapReduce
Lambda architecture
Lambda architecture wish list
● Fault-tolerant
● Low latency
● Scalable
● General
● Extensible
● Ad hoc queries
● Minimal maintenance
● Debuggable
Layers
Speed layer
Serving layer
Batch layer
Batch layer
Speed layer
Serving layer
Batch layer
Dataset storage.
Views computation.
Serving layer
Speed layer
Serving layer
Batch layer
Random access to batch views.
Speed layer
Speed layer
Serving layer
Batch layer
Low latency access.
Batch layer
Speed layer
Serving layer
Batch layer
Hadoop (MapReduce, HDFS).
Thrift, Cascalog (i.e. Cascading).
Serving layer
Speed layer
Serving layer
Batch layer
ElephantDB, BerkeleyDB.
Speed layer
Speed layer
Serving layer
Batch layer
Cassandra, Storm, Kafka.
Hive, Pig, Cascading
UDF : User Defined Function
Hive
+
SQL (non-standard)
Low learning curve
Extensible with UDF
-
So-so testability
So-so reusability
No flow control
Spread logic (script, java, shell)
Programming with UDF
Pig
+
Pig Latin
Low learning curve
Extensible with UDF
-
So-so testability
So-so reusability
Spread logic (script, java, shell)
Programming with UDF
Cascading
+
API Java
Unit testable
Flow control (if, try/catch, etc)
Good re-usability
-
Programming needed
SQL on Cascading: Lingual
Pure Cascading underneath
ANSI/ISO standard SQL-99
JDBC Driver
Query any system…
… with an available Cascading Tap
Management & monitoring: Driven
Commercial
Analyze Cascading flows
SaaS and on-site deployment
Image: http://cascading.io/driven/
Image: http://cascading.io/driven/
Future: Cascading 3.0
Major rewriting
Better extensibility
MapReduce planner optimization
Tez and Storm support
Thank you!

More Related Content

What's hot

Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat SheetHortonworks
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceLivePerson
 
Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Keshav Murthy
 
On Beyond (PostgreSQL) Data Types
On Beyond (PostgreSQL) Data TypesOn Beyond (PostgreSQL) Data Types
On Beyond (PostgreSQL) Data TypesJonathan Katz
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and FastJulian Hyde
 
Introduction to NOSQL And MongoDB
Introduction to NOSQL And MongoDBIntroduction to NOSQL And MongoDB
Introduction to NOSQL And MongoDBBehrouz Bakhtiari
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databasesJulian Hyde
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingToni Cebrián
 
第2回 Hadoop 輪読会
第2回 Hadoop 輪読会第2回 Hadoop 輪読会
第2回 Hadoop 輪読会Toshihiro Suzuki
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterJeffrey Breen
 
MongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineMongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineJason Terpko
 
Qtp Training Deepti 4 Of 4493
Qtp Training Deepti 4 Of 4493Qtp Training Deepti 4 Of 4493
Qtp Training Deepti 4 Of 4493Azhar Satti
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation FrameworkMongoDB
 

What's hot (20)

Db2 imp commands
Db2 imp commandsDb2 imp commands
Db2 imp commands
 
Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat Sheet
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Scalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduceScalding: Reaching Efficient MapReduce
Scalding: Reaching Efficient MapReduce
 
Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.
 
Rmarkdown cheatsheet-2.0
Rmarkdown cheatsheet-2.0Rmarkdown cheatsheet-2.0
Rmarkdown cheatsheet-2.0
 
On Beyond (PostgreSQL) Data Types
On Beyond (PostgreSQL) Data TypesOn Beyond (PostgreSQL) Data Types
On Beyond (PostgreSQL) Data Types
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Introduction to NOSQL And MongoDB
Introduction to NOSQL And MongoDBIntroduction to NOSQL And MongoDB
Introduction to NOSQL And MongoDB
 
Full Text Search in PostgreSQL
Full Text Search in PostgreSQLFull Text Search in PostgreSQL
Full Text Search in PostgreSQL
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
第2回 Hadoop 輪読会
第2回 Hadoop 輪読会第2回 Hadoop 輪読会
第2回 Hadoop 輪読会
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop cluster
 
MongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineMongoDB - Aggregation Pipeline
MongoDB - Aggregation Pipeline
 
Oh, that ubiquitous JSON !
Oh, that ubiquitous JSON !Oh, that ubiquitous JSON !
Oh, that ubiquitous JSON !
 
Qtp Training Deepti 4 Of 4493
Qtp Training Deepti 4 Of 4493Qtp Training Deepti 4 Of 4493
Qtp Training Deepti 4 Of 4493
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 

Similar to Cascading at the Lyon Hadoop User Group

2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locatorAlberto Paro
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locatorAlberto Paro
 
Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, FasterKen Krugler
 
Building Go Web Apps
Building Go Web AppsBuilding Go Web Apps
Building Go Web AppsMark
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Robert Metzger
 
Xadoop - new approaches to data analytics
Xadoop - new approaches to data analyticsXadoop - new approaches to data analytics
Xadoop - new approaches to data analyticsMaxim Grinev
 
Superficial mongo db
Superficial mongo dbSuperficial mongo db
Superficial mongo dbDaeMyung Kang
 
Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Dan Lynn
 
Practical pig
Practical pigPractical pig
Practical pigtrihug
 
Scalding Big (Ad)ta
Scalding Big (Ad)taScalding Big (Ad)ta
Scalding Big (Ad)tab0ris_1
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation FrameworkCaserta
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
 
Kotlin Developer Starter in Android projects
Kotlin Developer Starter in Android projectsKotlin Developer Starter in Android projects
Kotlin Developer Starter in Android projectsBartosz Kosarzycki
 
Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016
Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016
Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016STX Next
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programmingKuldeep Dhole
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Flink Forward
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 

Similar to Cascading at the Lyon Hadoop User Group (20)

Hadoop
HadoopHadoop
Hadoop
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator
 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator
 
Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, Faster
 
Building Go Web Apps
Building Go Web AppsBuilding Go Web Apps
Building Go Web Apps
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Xadoop - new approaches to data analytics
Xadoop - new approaches to data analyticsXadoop - new approaches to data analytics
Xadoop - new approaches to data analytics
 
Superficial mongo db
Superficial mongo dbSuperficial mongo db
Superficial mongo db
 
Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.
 
Practical pig
Practical pigPractical pig
Practical pig
 
Scalding Big (Ad)ta
Scalding Big (Ad)taScalding Big (Ad)ta
Scalding Big (Ad)ta
 
Jug java7
Jug java7Jug java7
Jug java7
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
Apache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmapApache Flink: API, runtime, and project roadmap
Apache Flink: API, runtime, and project roadmap
 
Kotlin Developer Starter in Android projects
Kotlin Developer Starter in Android projectsKotlin Developer Starter in Android projects
Kotlin Developer Starter in Android projects
 
Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016
Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016
Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 

More from acogoluegnes

What's up, RabbitMQ?
What's up, RabbitMQ?What's up, RabbitMQ?
What's up, RabbitMQ?acogoluegnes
 
Modern messaging with RabbitMQ, Spring Cloud and Reactor
Modern messaging with RabbitMQ, Spring Cloud and ReactorModern messaging with RabbitMQ, Spring Cloud and Reactor
Modern messaging with RabbitMQ, Spring Cloud and Reactoracogoluegnes
 
Microservices with Netflix OSS and Spring Cloud - Dev Day Orange
Microservices with Netflix OSS and Spring Cloud -  Dev Day OrangeMicroservices with Netflix OSS and Spring Cloud -  Dev Day Orange
Microservices with Netflix OSS and Spring Cloud - Dev Day Orangeacogoluegnes
 
Microservices with Netflix OSS and Spring Cloud
Microservices with Netflix OSS and Spring CloudMicroservices with Netflix OSS and Spring Cloud
Microservices with Netflix OSS and Spring Cloudacogoluegnes
 
Cartographie du big data
Cartographie du big dataCartographie du big data
Cartographie du big dataacogoluegnes
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionacogoluegnes
 

More from acogoluegnes (7)

What's up, RabbitMQ?
What's up, RabbitMQ?What's up, RabbitMQ?
What's up, RabbitMQ?
 
Modern messaging with RabbitMQ, Spring Cloud and Reactor
Modern messaging with RabbitMQ, Spring Cloud and ReactorModern messaging with RabbitMQ, Spring Cloud and Reactor
Modern messaging with RabbitMQ, Spring Cloud and Reactor
 
Microservices with Netflix OSS and Spring Cloud - Dev Day Orange
Microservices with Netflix OSS and Spring Cloud -  Dev Day OrangeMicroservices with Netflix OSS and Spring Cloud -  Dev Day Orange
Microservices with Netflix OSS and Spring Cloud - Dev Day Orange
 
Microservices with Netflix OSS and Spring Cloud
Microservices with Netflix OSS and Spring CloudMicroservices with Netflix OSS and Spring Cloud
Microservices with Netflix OSS and Spring Cloud
 
Cartographie du big data
Cartographie du big dataCartographie du big data
Cartographie du big data
 
NoSQL et Big Data
NoSQL et Big DataNoSQL et Big Data
NoSQL et Big Data
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

Recently uploaded

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Cascading at the Lyon Hadoop User Group

  • 1. Cascading Hadoop User Group Lyon 2015-02-06 Arnaud Cogoluègnes - Zenika
  • 2. Content Cascading: what, why, how? Hadoop basics along the way No pre-requisites to follow
  • 3. Cascading, what is it? Java framework Apache License, Version 2.0 To build data-oriented applications e.g. ETL-like applications
  • 4. Cascading key features Java API Mature (runs on MapReduce for years) Testability Re-usability Built-in features (filter, join, aggregator, etc)
  • 5. Cascading simple flow Fields usersFields = new Fields( "name","country","gender" ); Pipe users = new Pipe("users"); users = new Unique(users,new Fields("name")); jason US M arnaud FR M cynthia US F mike US M paul GB M anna RU F clare GB F input file
  • 6. Connecting flow to source and sink Fields usersFields = new Fields("name","country","gender"); Pipe users = new Pipe("users"); users = new Unique(users,new Fields("name")); Tap usersIn = ... // file’s location and structure abstraction Tap usersOut = ... FlowDef flowDef = FlowDef.flowDef() .addSource(users, usersIn) .addTailSink(users, usersOut);
  • 7. Taps and schemes Fields usersFields = new Fields("name","country","gender"); Pipe users = new Pipe("users"); users = new Unique(users,new Fields("name")); Tap usersIn = new Hfs( new TextDelimited(usersFields,false,"t"), // structure "/in" // location ); Tap usersOut = new Hfs( new TextDelimited(usersFields, false, "t"),"/out" );
  • 8. Executing a MapReduce flow FlowDef flowDef = FlowDef.flowDef() .addSource(users, usersIn) .addTailSink(users, usersOut); new Hadoop2MR1FlowConnector().connect(flowDef).complete();
  • 9. My first MapReduce flow Fields usersFields = new Fields("name","country","gender"); Pipe users = new Pipe("users"); users = new Unique(users,new Fields("name")); Tap usersIn = new Hfs(...); Tap usersOut = new Hfs(...); FlowDef flowDef = FlowDef.flowDef() .addSource(users, usersIn) .addTailSink(users, usersOut); new Hadoop2MR1FlowConnector().connect(flowDef).complete();
  • 10. Changing the output Fields usersFields = new Fields("name","country","gender"); Pipe users = new Pipe("users"); users = new Unique(users,new Fields("name")); Tap usersIn = new Hfs(...); Tap usersOut = new Hfs( new SequenceFile(usersFields),"/out"); FlowDef flowDef = FlowDef.flowDef() .addSource(users, usersIn) .addTailSink(users, usersOut); new Hadoop2MR1FlowConnector().connect(flowDef).complete();
  • 12. Blocks, datanodes, namenode file.csv B1 B2 B3 file is made of 3 blocks (default block size is 128 MB) B1 B2 B1 B3 B1 B2 B2 B3 DN 1 DN 2 DN 4DN 3 datanodes store files blocks (here block 3 is under-replicated) B1 : 1, 2, 3 B2 : 1, 3, 4 B3 : 2, 4 Namenode namenode handles files metadata and enforces replication
  • 13. MapReduce file.csv B1 B2 B3 Mapper Mapper Mapper B1 B2 B3 Reducer Reducer k1,v1 k1,v2 k1 [v1,v2]
  • 14. Code goes to data file.csv B1 B2 B3 Mapper Mapper Mapper B1 B2 B3 Reducer Reducer k1,v1 k1,v2 k1 [v1,v2] B1 B2 B1 B3 B1 B2 B2 B3 DN 1 DN 2 DN 4DN 3 DN 1 DN 3 DN 4
  • 15. Local MapReduce in a test Not bad
  • 17. Local connector for testing Fields usersFields = new Fields("name","country","gender"); Pipe users = new Pipe("users"); users = new Unique(users,new Fields("name")); Tap usersIn = new FileTap(new TextDelimited(usersFields,false,"t"),"in.txt"); Tap usersOut = new FileTap( new TextDelimited(usersFields, false, "t"), "out.txt" ); FlowDef flowDef = FlowDef.flowDef() .addSource(users, usersIn) .addTailSink(users, usersOut); new LocalFlowConnector().connect(flowDef).complete();
  • 18. Users by countries Fields usersFields = new Fields("name","country","gender"); Pipe users = new Pipe("users"); users = new GroupBy(users,new Fields("country")); users = new Every(users,new Count(new Fields("count"))); Tap usersOut = new FileTap( new TextDelimited(new Fields("country","count"), false, "t"),"/out.txt" ); jason US M arnaud FR M cynthia US F mike US M paul GB M anna RU F clare GB F FR 1 RU 1 GB 2 US 3
  • 19. Usage by countries? jason US M arnaud FR M cynthia US F mike US M paul GB M anna RU F clare GB F jason login mike newcontract cynthia login anna logout jason newcontract jason logout ... logs users
  • 20. Join logs and users Fields usersFields = new Fields("name","country","gender"); Fields logsFields = new Fields("username","action"); Pipe users = new Pipe("users"); Pipe logs = new Pipe("logs"); Pipe logsUsers = new CoGroup( logs,new Fields("username"), users,new Fields("name") );
  • 21. Join logs and users Pipe logsUsers = new CoGroup( logs,new Fields("username"), users,new Fields("name") ); jason login mike newcontract cynthia login anna logout jason newcontract jason logout ... jason US M arnaud FR M cynthia US F mike US M paul GB M anna RU F clare GB F anna RU F logout cynthia US F login jason US M login jason US M newcontract jason US M logout mike US M newcontract
  • 22. Usage by country logsUsers = new GroupBy(logsUsers,new Fields("country")); logsUsers = new Every(logsUsers,new Count(new Fields("count")));
  • 23. Usage by countries Tap usersIn = new FileTap(new TextDelimited(usersFields,false,"t"),"users.txt"); Tap logsIn = new FileTap(new TextDelimited(logsFields,false,"t"),"logs.txt"); Tap usageOut = new FileTap( new TextDelimited(new Fields("country","count"), false, "t"), "usage.txt" ); FlowDef flowDef = FlowDef.flowDef() .addSource(users, usersIn) .addSource(logs,logsIn) .addTailSink(logsUsers, usageOut); RU 1 US 5
  • 26. Optimization in Cascading CoGroup “During co-grouping, for any given unique grouping key, all of the rightmost pipes will accumulate the current grouping values into memory so they may be iterated across for every value in the left hand side pipe. (...) There is no accumulation for the left hand side pipe, only for those to the "right". Thus, for the pipe that has the largest number of values per unique key grouping, on average, it should be made the "left hand side" pipe (lhs).”
  • 28. Function users = new Each( users, new Fields("country"), // argument new CountryFullnameFunction(new Fields("countryFullname")), // function output new Fields("name","countryFullname","gender") // what we keep ); jason United States M arnaud France M cynthia United States F mike United States M paul United Kingdom M anna Russia F clare United Kingdom F
  • 29. Function (naive) implementation public static class CountryFullnameFunction extends BaseOperation implements Function { public CountryFullnameFunction(Fields fields) { super(fields); } @Override public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String country = functionCall.getArguments().getString(0); Locale locale = new Locale("",country); Tuple tuple = new Tuple(); tuple.add(locale.getDisplayCountry(Locale.ENGLISH)); functionCall.getOutputCollector().add(tuple); } }
  • 30. Functions public static class CountryFullnameFunction extends BaseOperation implements Function { public CountryFullnameFunction(Fields fields) { super(fields); } @Override public void operate(FlowProcess flowProcess, FunctionCall functionCall) { // this is executed remotely // tips: initialize (small) caches, re-use objects, etc. // functions have callbacks for this } }
  • 31. Re-using objects in a function public static class CountryFullnameFunction extends BaseOperation implements Function { Tuple tuple = new Tuple(); public CountryFullnameFunction(Fields fields) { super(fields); } @Override public void operate(FlowProcess flowProcess, FunctionCall functionCall) { String country = functionCall.getArguments().getString(0); Locale locale = new Locale("",country); tuple.clear(); tuple.add(locale.getDisplayCountry(Locale.ENGLISH)); functionCall.getOutputCollector().add(tuple); } }
  • 32. Using Avro with Cascading // Avro is splittable, supports compression, // and has schemas Schema schema = new Schema.Parser().parse(schemaAsJson); AvroScheme avroScheme = new AvroScheme(schema); Tap tap = new Hfs(avroScheme,"/out");
  • 33. Using Parquet files // Parquet is column-oriented // it supports splits and compression MessageType type = ... // ~ the schema Scheme parquetScheme = new ParquetTupleScheme( fields, // fields to read fields, // fields to write type.toString() ); Tap tap = new Hfs( parquetScheme, "/out" );
  • 35. Testing with plunger Fields usersFields = new Fields("name","country","gender"); Data corpus = new DataBuilder(usersFields) .addTuple("jason","US","M") (...) .addTuple("cynthia", "US", "F") .build(); Pipe users = plunger.newNamedPipe("users", corpus); users = new GroupBy(users,new Fields("country")); users = new Every(users,new Count(new Fields("count"))); Plunger plunger = new Plunger(); Bucket bucket = plunger.newBucket(new Fields("country", "count"), users); Assert.assertEquals(bucket.result().asTupleList().size(),4);
  • 36. Flow visualization Flow flow = new LocalFlowConnector().connect(flowDef); flow.writeDOT("cascading-flow.dot"); digraph G { 1 [label = "Every('users')[Count[decl:[{1}:'count']]]"]; 2 [label = "FileTap['TextDelimited[['country', 'count']]']['/tmp/junit1462026100615315705/junit2286442878134169792.tmp']"]; 3 [label = "GroupBy('users')[by:['country']]"]; 4 [label = "FileTap['TextDelimited[['name', 'country', 'gender']]']['/home/acogoluegnes/prog/hadoop-dev/. /src/test/resources/cascading/users.txt']"]; 5 [label = "[head]n2.6.2nlocal:2.6.2:Concurrent, Inc."]; 6 [label = "[tail]"]; 1 -> 2 [label = "[{2}:'country', 'count']n[{3}:'name', 'country', 'gender']"]; 3 -> 1 [label = "users[{1}:'country']n[{3}:'name', 'country', 'gender']"]; 5 -> 4 [label = ""]; 2 -> 6 [label = "[{2}:'country', 'count']n[{2}:'country', 'count']"]; 4 -> 3 [label = "[{3}:'name', 'country', 'gender']n[{3}:'name', 'country', 'gender']"]; }
  • 37.
  • 38. Typical processing Receiving data (bulk or streams) Processing in batch mode Feed to real-time systems (RDBMs, NoSQL)
  • 39. Use cases Parsing, processing, aggregating data “Diff-ing” 2 datasets Joining data
  • 40. Join generated and reference data Hadoop Processing (join, transformation) Generated data Reporting Reference data
  • 41. Data handling Raw data Parsed data Processing and insertion Archives View on data Transformations Avro, GZIP Keep it for forever Parquet, Snappy Keep 2 years of data Processing (Cascading) HDFS Real time DB
  • 42. Flow handling with Spring Batch Archiving Processing Processing Processing Cleaning Java, API HDFS Cascading MapReduce
  • 44. Lambda architecture wish list ● Fault-tolerant ● Low latency ● Scalable ● General ● Extensible ● Ad hoc queries ● Minimal maintenance ● Debuggable
  • 46. Batch layer Speed layer Serving layer Batch layer Dataset storage. Views computation.
  • 47. Serving layer Speed layer Serving layer Batch layer Random access to batch views.
  • 48. Speed layer Speed layer Serving layer Batch layer Low latency access.
  • 49. Batch layer Speed layer Serving layer Batch layer Hadoop (MapReduce, HDFS). Thrift, Cascalog (i.e. Cascading).
  • 50. Serving layer Speed layer Serving layer Batch layer ElephantDB, BerkeleyDB.
  • 51. Speed layer Speed layer Serving layer Batch layer Cassandra, Storm, Kafka.
  • 52. Hive, Pig, Cascading UDF : User Defined Function Hive + SQL (non-standard) Low learning curve Extensible with UDF - So-so testability So-so reusability No flow control Spread logic (script, java, shell) Programming with UDF Pig + Pig Latin Low learning curve Extensible with UDF - So-so testability So-so reusability Spread logic (script, java, shell) Programming with UDF Cascading + API Java Unit testable Flow control (if, try/catch, etc) Good re-usability - Programming needed
  • 53. SQL on Cascading: Lingual Pure Cascading underneath ANSI/ISO standard SQL-99 JDBC Driver Query any system… … with an available Cascading Tap
  • 54. Management & monitoring: Driven Commercial Analyze Cascading flows SaaS and on-site deployment Image: http://cascading.io/driven/
  • 56. Future: Cascading 3.0 Major rewriting Better extensibility MapReduce planner optimization Tez and Storm support