Scale your database traffic with Read & Write split using MySQL Router
Chicago Hadoop Users Group: Enterprise Data Workflows
1. “Enterprise Data Workflows
with Cascading”
Paco Nathan
Concurrent, Inc.
San Francisco, CA
@pacoid
zest.to/event63_77
2013-02-12 Copyright @2013, Concurrent, Inc.
Tuesday, 12 February 13 1
You may not have heard about us much, but you use our API in lots of places:
your bank, your airline, your hospital, your mobile device, your social network, etc.
2. Unstructured Data
meets
Enterprise Scale
• an example considered
• system integration:
tearing down silos
• code samples
• data science perspectives:
how we got here
• the workflow abstraction:
many aspects of an app
• developer, analyst, scientist
• summary, references
Tuesday, 12 February 13 2
Background: I’m a data scientist, an engineering director,
spent the past decade building/leading Data teams which created large-scale apps.
This talk is about using Cascading and related DSLs to build Enterprise Data Workflows.
Our emphasis is on leveraging the workflow abstraction for system integration, for mitigating complexity, and for producing simple, robust apps at scale.
We’ll show a little something for the developers, the analysts, and the scientists in the room.
3. Enterprise Data Workflows
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
an example considered
Tuesday, 12 February 13 3
Let’s consider the matter of handling Big Data
from the perspective of building and maintaining Enterprise apps…
4. Enterprise Data Workflows
Customers
an example…
Web
App
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Tuesday, 12 February 13 4
Apache Hadoop rarely ever gets used in isolation
5. Enterprise Data Workflows
Customers
an example… the front end
Web
App
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Tuesday, 12 February 13 5
LOB use cases drive the demand for Big Data apps
6. Enterprise Data Workflows
Customers
an example… the back office
Web
App
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Tuesday, 12 February 13 6
Enterprise organizations have seriously ginormous investments in existing back office practices:
people, infrastructure, processes
7. Enterprise Data Workflows
Customers
an example… the heavy lifting!
Web
App
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Tuesday, 12 February 13 7
“Main Street” firms have invested in Hadoop to address Big Data needs,
off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
8. Enterprise Data Workflows
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
system integration:
tearing down silos
Tuesday, 12 February 13 8
the process of building Enterprise apps is largely about
system integration and business process, meeting in the middle
9. Cascading – definitions
• a pattern language for Enterprise Data Workflows
• simple to build, easy to test, robust in production Customers
• design principles ⟹ ensure best practices at scale Web
App
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Tuesday, 12 February 13 9
A pattern language ensures that best practices are followed by an implementation.
In this case, parallelization of deterministic query plans for reliable, Enterprise-scale workflows on Hadoop, etc.
10. Cascading – usage
• Java API, Scala DSL Scalding, Clojure DSL Cascalog
• ASL 2 license, GitHub src, http://conjars.org Customers
• 5+ yrs production use, multiple Enterprise verticals Web
App
logs Cache
logs
Logs
Support
source
trap sink
tap
tap tap
Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Tuesday, 12 February 13 10
More than 5 year history of large-scale Enterprise deployments
DSLs in Scala, Clojure, Jython, JRuby, Groovy, etc.
Maven repo for third-party contribs
11. quotes…
“Cascading gives Java developers the ability to build
Big Data applications on Hadoop using their existing
skillset … Management can really go out and build a
team around folks that are already very experienced
with Java. Switching over to this is really a very short
exercise.”
CIO, Thor Olavsrud
2012-06-06
cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading
“Masks the complexity of MapReduce, simplifies the
programming, and speeds you on your journey toward
actionable analytics … A vast improvement over native
MapReduce functions or Pig UDFs.”
2012 BOSSIE Awards, James Borck
2012-09-18
infoworld.com/slideshow/65089
Tuesday, 12 February 13 11
Industry analysts are picking up on the staffing costs related to Hadoop, “no free lunch”
12. Cascading – deployments
• case studies: Twitter, Etsy, Climate Corp, Nokia, Factual,
Williams-Sonoma, uSwitch, Airbnb, Square, Harvard, etc. Customers
• partners: Amazon AWS, Microsoft Azure, Hortonworks,
MapR, EMC, SpringSource, Cloudera Web
App
• OSS frameworks built atop by: Twitter, Etsy,
eBay, Climate Corp, uSwitch, YieldBot, etc. logs Cache
logs
Logs
• use cases: ETL, anti-fraud, advertising, Support
recommenders, retail pricing, eCRM, trap
source
sink
tap
marketing funnel, search analytics, tap tap
genomics, climatology, etc. Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Tuesday, 12 February 13 12
Several published case studies about Cascading, Scalding, Cascalog, etc.
Wide range of use cases.
Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading.
Partnerships with all Hadoop vendors.
13. case studies…
(Williams-Sonoma, Neiman Marcus)
concurrentinc.com/case-studies/upstream/
upstreamsoftware.com/blog/bid/86333/
(revenue team, publisher analytics)
concurrentinc.com/case-studies/twitter/
github.com/twitter/scalding/wiki
(infrastructure team)
concurrentinc.com/case-studies/airbnb/
gigaom.com/data/meet-the-combo-behind-etsy-airbnb-and-
climate-corp-hadoop-jobs/
Tuesday, 12 February 13 13
Several customers using Cascading / Scalding / Cascalog have published case studies.
Here are a few.
14. Cascading – taps
• taps integrate other data frameworks, as tuple streams
• these are “plumbing” endpoints in the pattern language Customers
• sources (inputs), sinks (outputs), traps (exceptions)
Web
• where schema and provenance get determined App
• text delimited, JDBC, Memcached,
logs
HBase, Cassandra, MongoDB, etc. logs
Logs
Cache
• data serialization: Avro, Thrift, Support
source
Kryo, JSON, etc. trap
tap
tap sink
tap
• extend in ~4 lines of Java Data
Modeling PMML
Workflow
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Tuesday, 12 February 13 14
Speaking of system integration,
taps provide the simplest approach for integrating different frameworks.
15. Cascading – topologies
• topologies execute workflows on clusters
• flow planner is much like a compiler for queries Customers
• abstraction layers reduce training costs
Web
• Hadoop (MapReduce jobs) App
• local mode (dev/test or special config)
logs Cache
logs
• in-memory data grids (real-time) Logs
Support
• flow planner can be extended
trap
source
sink
tap
to support other topologies tap tap
• blend flows from different Modeling PMML
Data
Workflow
topologies into one app
source
sink
tap
tap
Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting
Tuesday, 12 February 13 15
Another kind of integration involves apps which run partly on a Hadoop cluster, and partly somewhere else.
16. example topologies…
Tuesday, 12 February 13 16
Here are some examples of topologies for distributed computing --
Apache Hadoop being the first supported by Cascading,
followed by local mode, and now a tuple space (IMDG) flow planner in the works.
Several other widely used platforms would also be likely suspects for Cascading flow planners.
17. Cascading – ANSI SQL
• ANSI SQL parser/optimizer atop Cascading flow planner
• JDBC driver to integrate into existing tools and app servers Customers
• surface a relational catalog over a collection of Web
unstructured data App
• launch a SQL shell prompt
to run queries logs
logs Cache
Logs
• enable the analysts without Support
retraining on Hadoop, etc. trap
source
tap sink
tap tap
• transparency for Support,
Data
Ops, Finance, et al. Modeling PMML
Workflow
• combine SQL flows with sink
source
tap
tap
Scalding, Cascalog, etc.
Analytics
• based on collab with Optiq – Cubes customer
Customer
profile DBs
industry-proven code base Hadoop
Prefs
Cluster
• keep the DBAs happy, and Reporting
go home a hero!
Tuesday, 12 February 13 17
Quite a number of projects have started out with Hadoop, then grafted a SQL-like syntax onto it. Somewhere.
We started out with a query planner used in Enterprise, then partnered with Optiq -- the team behind an Enterprise-proven code base for an ANSI SQL parser/optimizer.
In the sense that Splunk handles “machine data”, this SQL implementation provides “machine code”, as the lingua franca of Enterprise system integration.
18. how to query…
abstraction RDBMS JVM Cluster
parser ANSI SQL ANSI SQL
compliant parser compliant parser
optimizer logical plan, logical plan,
optimized based on stats optimized based on stats
planner physical plan API “plumbing”
machine query history, app history,
data table stats tuple stats
topology b-trees, etc. heterogenous, distributed:
Hadoop, in-memory, etc.
visualization ERD flow diagram
schema table schema tuple schema
catalog relational catalog tap usage DB
provenance (manual audit) data set
producers/consumers
Tuesday, 12 February 13 18
When you peel back the onion skin on a SQL query, each of the abstraction layers used in an RDBMS has an analogue (or better) in the context of Enterprise Data Workflows running on
JVM clusters
19. Cascading – machine learning
• export predictive models as PMML
• Cascading compiles to JVM classes for parallelization Customers
• migrate workloads: SAS, Microstrategy,Teradata, etc.
Web
• great OSS tools: R, Weka, KNIME, RapidMiner, etc. App
• run multiple models in parallel
logs
as customer experiments logs
Logs
Cache
• Random Forest, Logistic Regression, Support
source
GLM, Assoc Rules, Decision Trees, trap
tap
tap sink
tap
K-Means, Hierarchical Clustering, etc.
Data
Modeling
• 2 lines of code required for
PMML
Workflow
integration sink
source
tap
tap
• integrate with other libraries: Analytics
Cubes
Matrix API, Algebird, etc. customer
Customer
profile DBs
Prefs
• combine with other flows into Hadoop
Cluster
one app: Java for ETL, Reporting
Scala for data services,
SQL for reporting, etc.
Tuesday, 12 February 13 19
PMML has been around for a while, and export is supported by virtually every analytics platform,
covering a wide variety of predictive modeling algorithms.
Cascading reads PMML, building out workflows under the hood which run efficiently in parallel.
Much cheaper than buying a SAS license for your 2000-node Hadoop cluster ;)
Five companies are collaborating on this open source project, https://github.com/Cascading/cascading.pattern
20. PMML support…
Tuesday, 12 February 13 20
Here are just a few of the tools that people use to create predictive models for export as PMML
21. Cascading – test-driven development
• assert patterns (regex) on the tuple streams
• trap edge cases as “data exceptions” Customers
• adjust assert levels, like log4j levels
Web
• TDD at scale: App
1. start from raw inputs in
logs
the flow graph logs Cache
Logs
2. define stream assertions Support
for each stage of transforms trap
source
tap sink
tap tap
3. verify exceptions, code to
eliminate them Modeling PMML
Data
Workflow
4. rinse, lather, repeat… source
sink
tap
5. when impl is complete, tap
Analytics
app has full test coverage Cubes customer
• TDD follows from Cascalog’s Customer
profile DBs
Prefs
Hadoop
composable subqueries Cluster
Reporting
• redirect traps in production to
Ops, QA, Support, Audit, etc.
Tuesday, 12 February 13 21
TDD is not usually high on the list when people start discussing Big Data apps.
Chris Wensel introduced into Cascading the notion of a “data exception”, and how to set stream assertion levels as part of the business logic of an application.
Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., arguably uses TDD as its methodology, in the transition from ad-hoc queries as logic predicates, then composing
those predicates into large-scale apps.
22. Cascading – API design principles
• specify what is required, not how it must be achieved
• provide the “glue” for system integration
• no surprises
• same JAR, any scale
• plan far ahead (before consuming cluster resources)
• fail the same way twice
Closely related to “functional relational programming”
paradigm from Moseley & Marks 2006
http://goo.gl/SKspn
Tuesday, 12 February 13 22
Overview of the design principles embodied by Cascading as a pattern language…
Some aspects (Cascalog in particular) are closely related to “FRP” from Moseley/Marks 2006
23. Enterprise Data Workflows
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
code samples:
Word Count
Tuesday, 12 February 13 23
Let’s make this real, show some code…
24. the ubiquitous word count
definition:
count how often each word appears in a collection of text documents
this simple program provides an excellent test case for
parallel processing, since it illustrates:
‣ requires a minimal amount of code
‣ demonstrates use of both symbolic and numeric values
‣ shows a dependency graph of tuples as an abstraction
‣ is not many steps away from useful search indexing
‣ serves as a “Hello World” for Hadoop apps
any distributed computing framework which can run Word Count
efficiently in parallel at scale can handle much larger and
more interesting compute problems
Tuesday, 12 February 13 24
25. word count – pseudocode
void map (String doc_id, String text):
for each word w in segment(text):
emit(w, "1");
void reduce (String word, Iterator partial_counts):
int count = 0;
for each pc in partial_counts:
count += Int(pc);
emit(word, String(count));
Tuesday, 12 February 13 25
26. word count – flow diagram
Document
Collection
Tokenize
GroupBy
M token Count
R Word
Count
cascading.org/category/impatient
gist.github.com/3900702
1 map
1 reduce
18 lines code
Tuesday, 12 February 13 26
27. word count – Cascading app
Document
Collection
Tokenize
GroupBy
M token Count
R Word
Count
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter =
new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Tuesday, 12 February 13 27
28. word count – flow plan
Document
Collection
Tokenize
GroupBy
M token Count
R Word
Count
[head]
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']
map
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
[{1}:'token']
[{1}:'token']
GroupBy('wc')[by:['token']]
wc[{1}:'token']
[{1}:'token']
reduce
Every('wc')[Count[decl:'count']]
[{2}:'token', 'count']
[{1}:'token']
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
[{2}:'token', 'count']
[{2}:'token', 'count']
[tail]
Tuesday, 12 February 13 28
29. word count – Scalding / Scala
Document
Collection
Tokenize
GroupBy
M token Count
R Word
Count
import com.twitter.scalding._
class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}
Tuesday, 12 February 13 29
30. word count – Scalding / Scala
Document
Collection
Tokenize
GroupBy
M token Count
R Word
Count
github.com/twitter/scalding/wiki
‣ extends the Scala collections API, distributed
lists become “pipes” backed by Cascading
‣ code is compact, easy to understand –
very close to conceptual flow diagram
‣ functional programming is great for expressing
complex workflows in MapReduce, etc.
‣ large-scale, complex problems can be handled
in just a few lines of code
‣ significant investments by Twitter, Etsy, eBay, etc.,
in this open source project
‣ extensive libraries are available for linear algebra,
abstract algebra, machine learning – e.g., “Matrix API”
‣ several large-scale apps in production deployments
‣ IMHO, especially great for data services at scale
Tuesday, 12 February 13 30
Using a functional programming language to build flows works even better than trying to represent functional programming constructs within Java…
31. word count – Cascalog / Clojure
Document
Collection
Tokenize
GroupBy
M token Count
R Word
Count
(ns impatient.core
(:use [cascalog.api]
[cascalog.more-taps :only (hfs-delimited)])
(:require [clojure.string :as s]
[cascalog.ops :as c])
(:gen-class))
(defmapcatop split [line]
"reads in a line of string and splits it by regex"
(s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
(?<- (hfs-delimited out)
[?word ?count]
((hfs-delimited in :skip-header? true) _ ?line)
(split ?line :> ?word)
(c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient
Tuesday, 12 February 13 31
32. word count – Cascalog / Clojure
Document
Collection
Tokenize
GroupBy
M token Count
R Word
Count
github.com/nathanmarz/cascalog/wiki
‣ implements Datalog in Clojure, with predicates backed by Cascading
‣ a truly declarative language – whereas Scalding lacks that aspect
of functional programming
‣ run ad-hoc queries from the Clojure REPL, approx. 10:1 code
reduction compared with SQL
‣ composable subqueries, for test-driven development (TDD) at scale
‣ fault-tolerant workflows which are simple to follow
‣ same framework used from discovery through to production apps
‣ FRP mitigates the s/w engineering costs of Accidental Complexity
‣ focus on the process of structuring data; not un/structured
‣ Leiningen build: simple, no surprises, in Clojure itself
‣ has a learning curve, limited number of Clojure developers
‣ aggregators are the magic, those take effort to learn
Tuesday, 12 February 13 32
33. Enterprise Data Workflows
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
data science perspectives:
how we got here
Tuesday, 12 February 13 33
Let’s examine an evolution of Data Science practice, subsequent to the 1997 Q3 inflection point which enabled huge ecommerce successes, and commercialized Big Data
34. circa 1996: pre- inflection point
Stakeholder Customers
Excel pivot tables
PowerPoint slide decks strategy
BI
Product
Analysts
requirements
SQL Query optimized
Engineering code Web App
result sets
transactions
RDBMS
Tuesday, 12 February 13 34
Ah, teh olde days - Perl and C++ for CGI :)
Feedback loops shown in red represent data innovations at the time…
Characterized by slow, manual processes:
data modeling / business intelligence; “throw it over the wall”…
this thinking led to impossible silos
35. circa 2001: post- big ecommerce successes
Stakeholder Product Customers
dashboards UX
Engineering
models servlets
recommenders
Algorithmic
+ Web Apps
Modeling classifiers
Middleware
aggregation
event
SQL Query history
result sets customer
transactions
Logs
DW ETL RDBMS
Tuesday, 12 February 13 35
Q3 1997: Greg Linden @ Amazon, Randy Shoup @ eBay -- independent teams arrived at the same conclusion:
parallelize workloads onto clusters of commodity servers (Intel/Linux) to scale-out horizontally.
Google and Inktomi (YHOO Search) were working along the same lines.
MapReduce grew directly out of this effort. LinkedIn, Facebook, Twitter, Apple, etc., follow.
Algorithmic modeling, which leveraged machine data, allowed for Big Data to become monetized.
REALLY monetized :)
Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine
data for automation/optimization)
MapReduce came from work in 2002. Google is now three generations beyond that -- while the Global 1000 struggles to rationalize Hadoop practices.
Google gets upset when people try to “open the kimono”; however, Twitter is in SF where that’s a national pastime :) To get an idea of what powers Google internally, check the open source
projects: Scalding, Matrix API, Algebird, etc.
36. circa 2013: clusters everywhere
Data Products Customers
business
Domain process Prod
Expert Workflow
dashboard
metrics
data
Web Apps, s/w
History services
science Mobile, etc. dev
Data
Scientist
Planner social
discovery interactions
+ optimized transactions,
Eng
modeling taps capacity content
App Dev
Use Cases Across Topologies
Hadoop, Log In-Memory
etc. Events Data Grid
Ops DW Ops
batch near time
Cluster Scheduler
introduced existing
capability SDLC
RDBMS
RDBMS
Tuesday, 12 February 13 36
Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams.
Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric.
Not unlike a practice at LLL, where 4x more data gets collected about the machine than about the experiment.
37. asymptotically…
• smarter, more robust clusters
DSL
• increased leverage of machine data
for automation and optimization
• DSLs focused on scalability, testability, Planner/
reducing s/w engineering complexity Optimizer
• increased use of “machine code”,
who writes SQL directly?
Workflow
• workflows incorporating more
“moving parts”
App
• less about “bigness” of data, History
more about complexity of process
Cluster
• greater instrumentation ⟹
even more machine data,
increased feedback
Cluster
Scheduler
Tuesday, 12 February 13 37
Enterprise Data Workflows: more about “complex” process than about “big” data
38. references…
by Leo Breiman
Statistical Modeling:
The Two Cultures
Statistical Science, 2001
bit.ly/eUTh9L
also check out RStudio:
rstudio.org/
rpubs.com/
Tuesday, 12 February 13 38
for a really great discussion about the fundamentals of Data Science and process for algorithmic modeling (analyzing the 1997 inflection point), refer back to Breiman 2001.
39. references…
by DJ Patil
Data Jujitsu
O’Reilly, 2012
amazon.com/dp/B008HMN5BE
Building Data Science Teams
O’Reilly, 2011
amazon.com/dp/B005O4U3ZE
Tuesday, 12 February 13 39
in terms of building data products, see DJ Patil's mini-books on O'Reilly:
Building Data Science Teams
Data Jujitsu
40. Enterprise Data Workflows
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
the workflow abstraction:
many aspects of an app
Tuesday, 12 February 13 40
The workflow abstraction helps make Hadoop accessible to a broader audience of developers.
Let’s take a look at how organizations can leverage it in other important ways…
41. the workflow abstraction
Tuple Flows, Pipes, Taps, Filters, Joins, Traps, etc.
…in other words, “plumbing” as a pattern language
for managing the complexity of Big Data in Enterprise apps
on many levels
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
Tuesday, 12 February 13 41
The workflow abstraction,
a pattern language for building robust, scalable Enterprise apps,
which works on many levels across an organization…
42. rather than arguing SQL vs. NoSQL…
this kind of work focuses on
the process of structuring data
which must occur long before work
on large-scale joins, visualizations,
predictive models, etc.
so the process of structuring data is
what we examine here:
i.e., how to build workflows
for Big Data
thank you Dr. Codd
“A relational model of data for large shared data banks”
dl.acm.org/citation.cfm?id=362685
Tuesday, 12 February 13 42
instead, in Data Science work we must focus on *the process of structuring data*
that must happen before the large-scale joins, predictive models, visualizations, etc.
the process of structuring data is what i will show here
how to build workflows from Big Data
thank you Dr. Codd
43. workflow – abstraction layer
• Cascading initially grew from interaction with the Nutch project, before
Hadoop had a name; API author Chris Wensel recognized that MapReduce
would be too complex for substantial work in an Enterprise context
• 5+ years later, Enterprise app deployments on Hadoop are limited by
staffing issues: difficulty of retraining staff, scarcity of Hadoop experts
• the pattern language provides a structured method for solving large,
complex design problems where the syntax of the language promotes
use of best practices – which addresses staffing issues
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
Tuesday, 12 February 13 43
First and foremost, the workflow represents an abstraction layer
to mitigate the complexity and costs of coding large apps directly in MapReduce.
44. workflow – literate collaboration
• provides an intuitive visual representation for apps: flow diagrams
• flow diagrams are quite valuable for cross-team collaboration
• this approach leverages literate programming methodology,
especially in DSLs written in functional programming languages
• example: nearly 1:1 correspondence between function calls and
flow diagram elements in Scalding
• example: expert developers on cascading-users email list
use flow diagrams to help troubleshoot issues remotely
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
Tuesday, 12 February 13 44
Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming.
Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- the expert developers
generally ask a novice to provide a flow diagram first
45. workflow – business process
• imposes a separation of concerns between the capture of business
process requirements, and the implementation details (Hadoop, etc.)
• workflow orchestration evokes the notion of business process
management for Enterprise apps (think BPM/BPEL)
• Cascalog leverages Datalog features to make business process
executable: “specify what you require, not how to achieve it”
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
Tuesday, 12 February 13 45
Business Stakeholder POV:
business process management for workflow orchestration (think BPM/BPEL)
46. workflow – data architect
• represents a physical plan for large-scale data flow management
• tap schemes and tuple streams determine the relevant schema
• a producer/consumer graph of tap identifier URIs provides a
view of data provenance
• cluster utilization vs. producer/consumer graph surfaces ROI
for Hadoop-based data products
Document
Collection
Scrub
Tokenize
token
M
HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS
Count
Word
Count
Tuesday, 12 February 13 46
Data Architect POV:
a physical plan for large-scale data flow management