Chicago Hadoop Users Group: Enterprise Data Workflows

“Enterprise Data Workﬂows
with Cascading”

Paco Nathan
Concurrent, Inc.
San Francisco, CA
@pacoid

zest.to/event63_77
2013-02-12 Copyright @2013, Concurrent, Inc.

Tuesday, 12 February 13 1
You may not have heard about us much, but you use our API in lots of places:
your bank, your airline, your hospital, your mobile device, your social network, etc.

Unstructured Data
meets
Enterprise Scale
• an example considered
• system integration:
tearing down silos
• code samples
• data science perspectives:
how we got here
• the workflow abstraction:
many aspects of an app
• developer, analyst, scientist
• summary, references

Background: I’m a data scientist, an engineering director,
spent the past decade building/leading Data teams which created large-scale apps.

This talk is about using Cascading and related DSLs to build Enterprise Data Workﬂows.
Our emphasis is on leveraging the workﬂow abstraction for system integration, for mitigating complexity, and for producing simple, robust apps at scale.
We’ll show a little something for the developers, the analysts, and the scientists in the room.

Enterprise Data Workflows
Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

an example considered

Let’s consider the matter of handling Big Data
from the perspective of building and maintaining Enterprise apps…

Enterprise Data Workﬂows
Customers
an example…
Web
App

logs Cache
logs
Logs

Support
source
trap sink
tap
tap tap

Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

Apache Hadoop rarely ever gets used in isolation

Customers
an example… the front end
Web
App

logs Cache
logs
Logs

Support
source
trap sink
tap
tap tap

Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

LOB use cases drive the demand for Big Data apps

Customers
an example… the back ofﬁce
Web
App

logs Cache
logs
Logs

Support
source
trap sink
tap
tap tap

Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

Enterprise organizations have seriously ginormous investments in existing back ofﬁce practices:
people, infrastructure, processes

Customers
an example… the heavy lifting!
Web
App

logs Cache
logs
Logs

Support
source
trap sink
tap
tap tap

Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

“Main Street” ﬁrms have invested in Hadoop to address Big Data needs,
off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

system integration:
tearing down silos

the process of building Enterprise apps is largely about
system integration and business process, meeting in the middle

Cascading – definitions
• a pattern language for Enterprise Data Workflows
• simple to build, easy to test, robust in production Customers

• design principles ⟹ ensure best practices at scale Web
App

logs Cache
logs
Logs

Support
source
trap sink
tap
tap tap

Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

A pattern language ensures that best practices are followed by an implementation.

In this case, parallelization of deterministic query plans for reliable, Enterprise-scale workflows on Hadoop, etc.

Cascading – usage
• Java API, Scala DSL Scalding, Clojure DSL Cascalog
• ASL 2 license, GitHub src, http://conjars.org Customers

• 5+ yrs production use, multiple Enterprise verticals Web
App

logs Cache
logs
Logs

Support
source
trap sink
tap
tap tap

Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

More than 5 year history of large-scale Enterprise deployments
DSLs in Scala, Clojure, Jython, JRuby, Groovy, etc.
Maven repo for third-party contribs

quotes…
“Cascading gives Java developers the ability to build
Big Data applications on Hadoop using their existing
skillset … Management can really go out and build a
team around folks that are already very experienced
with Java. Switching over to this is really a very short
exercise.”
CIO, Thor Olavsrud
2012-06-06
cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading

“Masks the complexity of MapReduce, simpliﬁes the
programming, and speeds you on your journey toward
actionable analytics … A vast improvement over native
MapReduce functions or Pig UDFs.”
2012 BOSSIE Awards, James Borck
2012-09-18
infoworld.com/slideshow/65089

Industry analysts are picking up on the stafﬁng costs related to Hadoop, “no free lunch”

Cascading – deployments
• case studies: Twitter, Etsy, Climate Corp, Nokia, Factual,
Williams-Sonoma, uSwitch, Airbnb, Square, Harvard, etc. Customers

• partners: Amazon AWS, Microsoft Azure, Hortonworks,
MapR, EMC, SpringSource, Cloudera Web
App

• OSS frameworks built atop by: Twitter, Etsy,
eBay, Climate Corp, uSwitch, YieldBot, etc. logs Cache
logs
Logs
• use cases: ETL, anti-fraud, advertising, Support
recommenders, retail pricing, eCRM, trap
source
sink
tap
marketing funnel, search analytics, tap tap

genomics, climatology, etc. Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

Several published case studies about Cascading, Scalding, Cascalog, etc.
Wide range of use cases.

Signiﬁcant investment by Twitter, Etsy, and other ﬁrms for OSS based on Cascading.
Partnerships with all Hadoop vendors.

case studies…

(Williams-Sonoma, Neiman Marcus)

concurrentinc.com/case-studies/upstream/
upstreamsoftware.com/blog/bid/86333/

(revenue team, publisher analytics)

concurrentinc.com/case-studies/twitter/
github.com/twitter/scalding/wiki

(infrastructure team)

concurrentinc.com/case-studies/airbnb/
gigaom.com/data/meet-the-combo-behind-etsy-airbnb-and-
climate-corp-hadoop-jobs/

Several customers using Cascading / Scalding / Cascalog have published case studies.
Here are a few.

Cascading – taps
• taps integrate other data frameworks, as tuple streams
• these are “plumbing” endpoints in the pattern language Customers

• sources (inputs), sinks (outputs), traps (exceptions)
Web

• where schema and provenance get determined App

• text delimited, JDBC, Memcached,
logs
HBase, Cassandra, MongoDB, etc. logs
Logs
Cache

• data serialization: Avro, Thrift, Support
source
Kryo, JSON, etc. trap
tap
tap sink
tap

• extend in ~4 lines of Java Data
Modeling PMML
Workflow

source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

Speaking of system integration,
taps provide the simplest approach for integrating different frameworks.

Cascading – topologies
• topologies execute workflows on clusters
• flow planner is much like a compiler for queries Customers

• abstraction layers reduce training costs
Web

• Hadoop (MapReduce jobs) App

• local mode (dev/test or special config)
logs Cache
logs
• in-memory data grids (real-time) Logs

Support
• flow planner can be extended
trap
source
sink
tap
to support other topologies tap tap

• blend flows from different Modeling PMML
Data
Workflow
topologies into one app
source
sink
tap
tap

Analytics
Cubes customer
Customer
profile DBs
Prefs
Hadoop
Cluster
Reporting

Another kind of integration involves apps which run partly on a Hadoop cluster, and partly somewhere else.

example topologies…

Here are some examples of topologies for distributed computing --
Apache Hadoop being the first supported by Cascading,
followed by local mode, and now a tuple space (IMDG) flow planner in the works.

Several other widely used platforms would also be likely suspects for Cascading flow planners.

Cascading – ANSI SQL
• ANSI SQL parser/optimizer atop Cascading ﬂow planner
• JDBC driver to integrate into existing tools and app servers Customers

• surface a relational catalog over a collection of Web
unstructured data App

• launch a SQL shell prompt
to run queries logs
logs Cache
Logs

• enable the analysts without Support

retraining on Hadoop, etc. trap
source
tap sink
tap tap

• transparency for Support,
Data
Ops, Finance, et al. Modeling PMML
Workflow

• combine SQL ﬂows with sink
source
tap
tap
Scalding, Cascalog, etc.
Analytics

• based on collab with Optiq – Cubes customer
Customer
profile DBs
industry-proven code base Hadoop
Prefs

Cluster
• keep the DBAs happy, and Reporting

go home a hero!

Quite a number of projects have started out with Hadoop, then grafted a SQL-like syntax onto it. Somewhere.

We started out with a query planner used in Enterprise, then partnered with Optiq -- the team behind an Enterprise-proven code base for an ANSI SQL parser/optimizer.

In the sense that Splunk handles “machine data”, this SQL implementation provides “machine code”, as the lingua franca of Enterprise system integration.

how to query…
abstraction RDBMS JVM Cluster
parser ANSI SQL ANSI SQL
compliant parser compliant parser
optimizer logical plan, logical plan,
optimized based on stats optimized based on stats
planner physical plan API “plumbing”

machine query history, app history,
data table stats tuple stats
topology b-trees, etc. heterogenous, distributed:
Hadoop, in-memory, etc.
visualization ERD ﬂow diagram

schema table schema tuple schema

catalog relational catalog tap usage DB

provenance (manual audit) data set
producers/consumers

When you peel back the onion skin on a SQL query, each of the abstraction layers used in an RDBMS has an analogue (or better) in the context of Enterprise Data Workﬂows running on
JVM clusters

Cascading – machine learning
• export predictive models as PMML
• Cascading compiles to JVM classes for parallelization Customers

• migrate workloads: SAS, Microstrategy,Teradata, etc.
Web

• great OSS tools: R, Weka, KNIME, RapidMiner, etc. App

• run multiple models in parallel
logs
as customer experiments logs
Logs
Cache

• Random Forest, Logistic Regression, Support
source
GLM, Assoc Rules, Decision Trees, trap
tap
tap sink
tap

K-Means, Hierarchical Clustering, etc.
Data
Modeling
• 2 lines of code required for
PMML
Workflow

integration sink
source
tap
tap
• integrate with other libraries: Analytics
Cubes
Matrix API, Algebird, etc. customer
Customer
profile DBs
Prefs
• combine with other flows into Hadoop
Cluster
one app: Java for ETL, Reporting

Scala for data services,
SQL for reporting, etc.

PMML has been around for a while, and export is supported by virtually every analytics platform,
covering a wide variety of predictive modeling algorithms.

Cascading reads PMML, building out workflows under the hood which run efficiently in parallel.

Much cheaper than buying a SAS license for your 2000-node Hadoop cluster ;)

Five companies are collaborating on this open source project, https://github.com/Cascading/cascading.pattern

PMML support…

Here are just a few of the tools that people use to create predictive models for export as PMML

Cascading – test-driven development
• assert patterns (regex) on the tuple streams
• trap edge cases as “data exceptions” Customers

• adjust assert levels, like log4j levels
Web

• TDD at scale: App

1. start from raw inputs in
logs
the ﬂow graph logs Cache
Logs
2. deﬁne stream assertions Support
for each stage of transforms trap
source
tap sink
tap tap
3. verify exceptions, code to
eliminate them Modeling PMML
Data
Workflow
4. rinse, lather, repeat… source
sink
tap
5. when impl is complete, tap

Analytics
app has full test coverage Cubes customer
• TDD follows from Cascalog’s Customer
profile DBs
Prefs
Hadoop
composable subqueries Cluster
Reporting
• redirect traps in production to
Ops, QA, Support, Audit, etc.

TDD is not usually high on the list when people start discussing Big Data apps.

Chris Wensel introduced into Cascading the notion of a “data exception”, and how to set stream assertion levels as part of the business logic of an application.

Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., arguably uses TDD as its methodology, in the transition from ad-hoc queries as logic predicates, then composing
those predicates into large-scale apps.

Cascading – API design principles
• specify what is required, not how it must be achieved
• provide the “glue” for system integration
• no surprises
• same JAR, any scale
• plan far ahead (before consuming cluster resources)
• fail the same way twice

Closely related to “functional relational programming”
paradigm from Moseley & Marks 2006
http://goo.gl/SKspn

Overview of the design principles embodied by Cascading as a pattern language…

Some aspects (Cascalog in particular) are closely related to “FRP” from Moseley/Marks 2006

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

code samples:
Word Count

Let’s make this real, show some code…

the ubiquitous word count
definition:
count how often each word appears in a collection of text documents

this simple program provides an excellent test case for
parallel processing, since it illustrates:
‣ requires a minimal amount of code
‣ demonstrates use of both symbolic and numeric values
‣ shows a dependency graph of tuples as an abstraction
‣ is not many steps away from useful search indexing
‣ serves as a “Hello World” for Hadoop apps

any distributed computing framework which can run Word Count
efficiently in parallel at scale can handle much larger and
more interesting compute problems


word count – pseudocode

void map (String doc_id, String text):
for each word w in segment(text):
emit(w, "1");

void reduce (String word, Iterator partial_counts):
int count = 0;

for each pc in partial_counts:
count += Int(pc);

emit(word, String(count));


word count – ﬂow diagram

Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

cascading.org/category/impatient
gist.github.com/3900702
1 map
1 reduce
18 lines code


word count – Cascading app
Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );

// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter =
new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();


word count – ﬂow plan
Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

[head]

Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']

map
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

[{1}:'token']
[{1}:'token']

GroupBy('wc')[by:['token']]

wc[{1}:'token']
[{1}:'token']

reduce
Every('wc')[Count[decl:'count']]

[{2}:'token', 'count']
[{1}:'token']

Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']


[tail]


word count – Scalding / Scala
Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

import com.twitter.scalding._

class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}


word count – Scalding / Scala
Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

github.com/twitter/scalding/wiki
‣ extends the Scala collections API, distributed
lists become “pipes” backed by Cascading
‣ code is compact, easy to understand –
very close to conceptual flow diagram
‣ functional programming is great for expressing
complex workflows in MapReduce, etc.
‣ large-scale, complex problems can be handled
in just a few lines of code
‣ significant investments by Twitter, Etsy, eBay, etc.,
in this open source project
‣ extensive libraries are available for linear algebra,
abstract algebra, machine learning – e.g., “Matrix API”
‣ several large-scale apps in production deployments

‣ IMHO, especially great for data services at scale

Using a functional programming language to build flows works even better than trying to represent functional programming constructs within Java…

word count – Cascalog / Clojure
Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

(ns impatient.core
  (:use [cascalog.api]
        [cascalog.more-taps :only (hfs-delimited)])
  (:require [clojure.string :as s]
            [cascalog.ops :as c])
  (:gen-class))

(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
  (s/split line #"[[](),.)s]+"))

(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))

; Paul Lam
; github.com/Quantisan/Impatient


word count – Cascalog / Clojure
Document
Collection

Tokenize
GroupBy
M token Count

R Word
Count

github.com/nathanmarz/cascalog/wiki
‣ implements Datalog in Clojure, with predicates backed by Cascading

‣ a truly declarative language – whereas Scalding lacks that aspect
of functional programming
‣ run ad-hoc queries from the Clojure REPL, approx. 10:1 code
reduction compared with SQL
‣ composable subqueries, for test-driven development (TDD) at scale

‣ fault-tolerant workﬂows which are simple to follow

‣ same framework used from discovery through to production apps

‣ FRP mitigates the s/w engineering costs of Accidental Complexity

‣ focus on the process of structuring data; not un/structured

‣ Leiningen build: simple, no surprises, in Clojure itself

‣ has a learning curve, limited number of Clojure developers

‣ aggregators are the magic, those take effort to learn


Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

data science perspectives:
how we got here

Let’s examine an evolution of Data Science practice, subsequent to the 1997 Q3 inﬂection point which enabled huge ecommerce successes, and commercialized Big Data

circa 1996: pre- inﬂection point

Stakeholder Customers

Excel pivot tables
PowerPoint slide decks strategy

BI
Product
Analysts

requirements

SQL Query optimized
Engineering code Web App
result sets

transactions

RDBMS

Ah, teh olde days - Perl and C++ for CGI :)

Feedback loops shown in red represent data innovations at the time…

Characterized by slow, manual processes:
data modeling / business intelligence; “throw it over the wall”…
this thinking led to impossible silos

circa 2001: post- big ecommerce successes

Stakeholder Product Customers

dashboards UX
Engineering

models servlets

recommenders
Algorithmic
+ Web Apps
Modeling classiﬁers

Middleware
aggregation
event
SQL Query history
result sets customer
transactions
Logs

DW ETL RDBMS

Q3 1997: Greg Linden @ Amazon, Randy Shoup @ eBay -- independent teams arrived at the same conclusion:
parallelize workloads onto clusters of commodity servers (Intel/Linux) to scale-out horizontally.
Google and Inktomi (YHOO Search) were working along the same lines.

MapReduce grew directly out of this effort. LinkedIn, Facebook, Twitter, Apple, etc., follow.
Algorithmic modeling, which leveraged machine data, allowed for Big Data to become monetized.
REALLY monetized :)

Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine
data for automation/optimization)

MapReduce came from work in 2002. Google is now three generations beyond that -- while the Global 1000 struggles to rationalize Hadoop practices.

Google gets upset when people try to “open the kimono”; however, Twitter is in SF where that’s a national pastime :) To get an idea of what powers Google internally, check the open source
projects: Scalding, Matrix API, Algebird, etc.

circa 2013: clusters everywhere

Data Products Customers
business
Domain process Prod
Expert Workﬂow
dashboard
metrics
data
Web Apps, s/w
History services
science Mobile, etc. dev
Data
Scientist
Planner social
discovery interactions
+ optimized transactions,
Eng
modeling taps capacity content

App Dev
Use Cases Across Topologies

Hadoop, Log In-Memory
etc. Events Data Grid
Ops DW Ops
batch near time

Cluster Scheduler
introduced existing
capability SDLC

RDBMS
RDBMS

Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams.
Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric.

Not unlike a practice at LLL, where 4x more data gets collected about the machine than about the experiment.

asymptotically…
• smarter, more robust clusters
DSL
• increased leverage of machine data
for automation and optimization
• DSLs focused on scalability, testability, Planner/
reducing s/w engineering complexity Optimizer
• increased use of “machine code”,
who writes SQL directly?
Workflow
• workflows incorporating more
“moving parts”
App
• less about “bigness” of data, History
more about complexity of process
Cluster
• greater instrumentation ⟹
even more machine data,
increased feedback
Cluster
Scheduler

Enterprise Data Workflows: more about “complex” process than about “big” data

references…

by Leo Breiman
Statistical Modeling:
The Two Cultures
Statistical Science, 2001
bit.ly/eUTh9L

also check out RStudio:
rstudio.org/
rpubs.com/

for a really great discussion about the fundamentals of Data Science and process for algorithmic modeling (analyzing the 1997 inﬂection point), refer back to Breiman 2001.

references…

by DJ Patil

Data Jujitsu
O’Reilly, 2012
amazon.com/dp/B008HMN5BE

Building Data Science Teams
O’Reilly, 2011
amazon.com/dp/B005O4U3ZE

in terms of building data products, see DJ Patil's mini-books on O'Reilly:
Building Data Science Teams
Data Jujitsu

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

the workflow abstraction:
many aspects of an app

The workﬂow abstraction helps make Hadoop accessible to a broader audience of developers.

Let’s take a look at how organizations can leverage it in other important ways…

the workﬂow abstraction
Tuple Flows, Pipes, Taps, Filters, Joins, Traps, etc.
…in other words, “plumbing” as a pattern language
for managing the complexity of Big Data in Enterprise apps
on many levels

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

The workﬂow abstraction,
a pattern language for building robust, scalable Enterprise apps,
which works on many levels across an organization…

rather than arguing SQL vs. NoSQL…
this kind of work focuses on
the process of structuring data

which must occur long before work
on large-scale joins, visualizations,
predictive models, etc.
so the process of structuring data is
what we examine here:
i.e., how to build workﬂows
for Big Data

thank you Dr. Codd
“A relational model of data for large shared data banks”
dl.acm.org/citation.cfm?id=362685

instead, in Data Science work we must focus on *the process of structuring data*
that must happen before the large-scale joins, predictive models, visualizations, etc.
the process of structuring data is what i will show here
how to build workﬂows from Big Data
thank you Dr. Codd

workflow – abstraction layer
• Cascading initially grew from interaction with the Nutch project, before
Hadoop had a name; API author Chris Wensel recognized that MapReduce
would be too complex for substantial work in an Enterprise context
• 5+ years later, Enterprise app deployments on Hadoop are limited by
staffing issues: difficulty of retraining staff, scarcity of Hadoop experts
• the pattern language provides a structured method for solving large,
complex design problems where the syntax of the language promotes
use of best practices – which addresses staffing issues

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

First and foremost, the workflow represents an abstraction layer
to mitigate the complexity and costs of coding large apps directly in MapReduce.

workflow – literate collaboration
• provides an intuitive visual representation for apps: flow diagrams
• flow diagrams are quite valuable for cross-team collaboration
• this approach leverages literate programming methodology,
especially in DSLs written in functional programming languages
• example: nearly 1:1 correspondence between function calls and
flow diagram elements in Scalding
• example: expert developers on cascading-users email list
use flow diagrams to help troubleshoot issues remotely

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming.

Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- the expert developers
generally ask a novice to provide a flow diagram first

workflow – business process
• imposes a separation of concerns between the capture of business
process requirements, and the implementation details (Hadoop, etc.)
• workflow orchestration evokes the notion of business process
management for Enterprise apps (think BPM/BPEL)
• Cascalog leverages Datalog features to make business process
executable: “specify what you require, not how to achieve it”

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

Business Stakeholder POV:
business process management for workflow orchestration (think BPM/BPEL)

workflow – data architect
• represents a physical plan for large-scale data flow management
• tap schemes and tuple streams determine the relevant schema
• a producer/consumer graph of tap identifier URIs provides a
view of data provenance
• cluster utilization vs. producer/consumer graph surfaces ROI
for Hadoop-based data products

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

Data Architect POV:
a physical plan for large-scale data flow management

workﬂow – system integration
• Cascading apps incorporate much more than just Hadoop jobs
• integration of other heterogenous data frameworks (taps) and
compute platforms (topologies)
• integration of other paradigms via DSLs, ANSI SQL, PMML, etc.
• ultimately reduced/encapsulated as a single JAR ﬁle per app

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

Systems Integrator POV:
system integration of heterogenous data sources and compute platforms

workﬂow – data scientist
• represents a directed, acyclic graph (DAG) on which we can apply
some useful math…
• Amdahl’s Law, etc., to quantify the extent of parallelization
• query optimizer can leverage app history (machine data) to select
alternative algorithms based on the shape of the data
• predictive modeling to estimate expected run times of a given app
for a given size and shape of input data

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

Data Scientist POV:
a directed, acyclic graph (DAG) on which we can apply Amdahl's Law, etc.

workﬂow – developer
• physical plan for a query avoids non-deterministic behavior –
expensive when troubleshooting
• otherwise, edge cases can become nightmares on a large cluster
• “plan far ahead”: potential problems can be inferred at compile
time or at ﬂow planner stage…
long before large, expensive resources start getting consumed…
or worse, before the wrong results get propagated downstream

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

As an engineering manager who has tasked data scientists and Hadoop developers to build large-scale apps, a typical issue is that everything works great on a laptop with a small set of
data, but then the app falls over on the staging cluster -- causing staff to spend weeks debugging edge cases.

Non-deterministic behavior of Big Data frameworks creates enormous costs at scale.

workflow – operations
• “same JAR, any scale” allows for continuous integration practices –
no need to change code or recompile a JAR
• flow diagrams annotated with metrics allow systems engineers to
identify performance bottlenecks, model utilization rates, perform
capacity planning, determine ROI on cluster infrastructure, etc.

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

Systems Engineer POV:
workflow as a JAR file, has passed CI, available in a Maven repo

metrics on the DAG for Ops utilization analysis, capacity planning, ROI on infrastructure

workflow – transparency
• fully connected context for compiler optimization, exception
handling, debug, config, scheduling, notifications, provenance, etc.
• this practice is in stark contrast to Big Data frameworks where
developers cross multiple language boundaries to troubleshoot
large-scale apps
• again, complexity is more the issue than “bigness” … and lack
of transparency is what makes complexity so expensive in
Big Data apps

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

An entire app compiles into a single JAR:
fully connected context for compiler optimization, exception handling, debug, config, scheduling, notifications, provenance, etc.

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

for the analyst
on your shopping list…

For the analysts in the audience -- also useful for Ops, Customer Support, Finance, etc.

ANSI SQL – integration
Analysts: using SQL data warehouses (MPP) today to run
large-scale queries, but need to start leveraging Hadoop
Developers: Cascading users building Enterprise data
workflows today on Hadoop
capabilities
• ANSI standard SQL parser and optimizer built on top
of Cascading
• relational catalog view into large-scale unstructured data
• SQL shell to test and submit queries into Hadoop
• JDBC driver to integrate into existing tools and
application servers
benefits
• use standard SQL to run queries over unstructured
data on Hadoop
• integrate SQL queries into Enterprise data workflows
• don’t have to learn new syntax or modify queries

Cascading worked with another team which had substantial experience building ANSI SQL compliant parser/optimizer, ready for integration with popular Enterprise relational frameworks

ANSI SQL – CSV data in local file system

The test database for MySQL is available for download from https://launchpad.net/test-db/

Here we have a bunch o’ CSV flat files in a directory in the local file system.

ANSI SQL – simple DDL overlay

Use the “lingual” command line interface to overlay DDL to describe the expected table schema.

ANSI SQL – shell prompt, catalog

Use the “lingual” SQL shell prompt to run SQL queries interactively, show catalog, etc.

ANSI SQL – queries

Here’s an example SQL query on that “employee” test database from MySQL.

ANSI SQL – JDBC driver

public void run() throws ClassNotFoundException, SQLException {
Class.forName( "cascading.lingual.jdbc.Driver" );
Connection connection =
DriverManager.getConnection(
"jdbc:lingual:local;schemas=src/main/resources/data/example" );
Statement statement = connection.createStatement();

ResultSet resultSet = statement.executeQuery(
"select *n"
+ "from "EXAMPLE"."SALES_FACT_1997" as sn"
+ "join "EXAMPLE"."EMPLOYEE" as en"
+ "on e."EMPID" = s."CUST_ID"" );

while( resultSet.next() ) {
int n = resultSet.getMetaData().getColumnCount();
StringBuilder builder = new StringBuilder();

for( int i = 1; i <= n; i++ ) {
builder.append( ( i > 1 ? "; " : "" )
+ resultSet.getMetaData().getColumnLabel( i )
+ "="
+ resultSet.getObject( i ) );
}

System.out.println( builder );
}

resultSet.close();
statement.close();
connection.close();
}
Note that in this example the schema for the DDL has been derived directly from the CSV files.

In other words, point the JDBC connection at a directory of flat files and query as if they were already loaded into SQL.

ANSI SQL – JDBC result set

$ gradle clean jar
$ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar

CUST_ID=100; PROD_ID=10; EMPID=100; NAME=Bill
CUST_ID=150; PROD_ID=20; EMPID=150; NAME=Sebastian

Caveat: if you absolutely positively must have sub-second
SQL query response for Pb-scale data on a 1000+ node
cluster… Good luck with that!! Call the MPP vendors.
This ANSI SQL library is primarily intended for batch
workﬂows – high throughput, not low-latency –
for many under-represented use cases in Enterprise IT.

success

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

for the scientist

For the data scientists in the audience…

example model integration
• use customer order history as the training data set
• train a risk classifier for orders, using Random Forest
• export model from R to PMML
• build a Cascading app to execute the PMML model
• generate a pipeline from PMML description
• planner builds flow for a topology (Hadoop)
• compile app to a JAR file
risk classifier risk classifier
• deploy the app at scale dimension: customer 360
Cascading apps
dimension: per-order

to calculate scores data prep
training analyst's
laptop
customer
data sets transactions

predict score new
model costs orders
PMML
model
detect anomaly
fraudsters detection

segment velocity
customers metrics

Hadoop Customer IMDG
DB
batch real-time
workloads workloads

cascading.org/pattern
ETL

chargebacks, partner
DW etc. data


example model integration

risk classiﬁer risk classiﬁer
dimension: customer 360 dimension: per-order
Cascading apps

training analyst's customer
data prep laptop
data sets transactions

predict score new
model costs orders
PMML
model
detect anomaly
fraudsters detection

segment velocity
customers metrics

Hadoop Customer IMDG
DB
batch real-time
workloads workloads

ETL

chargebacks, partner
DW etc. data


model creation in R

## train a RandomForest model

f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)

## test the model on the holdout test set

print(fit$importance)
print(fit)

predicted <- predict(fit, data)
data$predicted <- predicted
confuse <- table(pred = predicted, true = data[,1])
print(confuse)

## export predicted labels to TSV

write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)

## export RF model to PMML

saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))


model exported as PMML
<?xml version="1.0"?>
<PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.dmg.org/PMML-4_0
http://www.dmg.org/v4-0/pmml-4-0.xsd">
<Header copyright="Copyright (c)2012 Concurrent, Inc."
description="Random Forest Tree Model">
  <Extension name="user" value="ceteri" extender="Rattle/PMML"/>
  <Application name="Rattle/PMML" version="1.2.30"/>
  <Timestamp>2012-10-22 19:39:28</Timestamp>
</Header>
<DataDictionary numberOfFields="4">
  <DataField name="label" optype="categorical" dataType="string">
   <Value value="0"/>
   <Value value="1"/>
  </DataField>
  <DataField name="var0" optype="continuous" dataType="double"/>
</DataDictionary>
<MiningModel modelName="randomForest_Model" functionName="classification">
  <MiningSchema>
   <MiningField name="label" usageType="predicted"/>
   <MiningField name="var0" usageType="active"/>
  </MiningSchema>
  <Segmentation multipleModelMethod="majorityVote">
   <Segment id="1">
    <True/>
    <TreeModel modelName="randomForest_Model" functionName="classification"
algorithmName="randomForest" splitCharacteristic="binarySplit">
     <MiningSchema>
      <MiningField name="label" usageType="predicted"/>
     </MiningSchema>
...


model run at scale as a Cascading app

Customer
Orders

Scored GroupBy
Classify Assert
Orders token

M R

PMML
Model
Count

Failure Confusion
Traps Matrix



model run at scale as a Cascading app
public class Main {
public static void main( String[] args ) {
  String pmmlPath = args[ 0 ];
  String ordersPath = args[ 1 ];
  String classifyPath = args[ 2 ];
  String trapPath = args[ 3 ];

  Properties properties = new Properties();
  AppProps.setApplicationJarClass( properties, Main.class );
  HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

  // create source and sink taps
  Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );
  Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
  Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );

  // define a "Classifier" model from PMML to evaluate the orders
  ClassifierFunction classFunc =
new ClassifierFunction( new Fields( "score" ), pmmlPath );
  Pipe classifyPipe =
new Each( new Pipe( "classify" ), classFunc.getInputFields(), classFunc, Fields.ALL );

  // connect the taps, pipes, etc., into a flow
  FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
   .addSource( classifyPipe, ordersTap )
   .addTrap( classifyPipe, trapTap )
   .addSink( classifyPipe, classifyTap );

  // write a DOT file and run the flow
  Flow classifyFlow = flowConnector.connect( flowDef );
  classifyFlow.writeDOT( "dot/classify.dot" );
  classifyFlow.complete();
}
}


app deployed in the AWS cloud

# replace with your S3 bucket
BUCKET=temp.cascading.org/pattern
SINK=out
PMML=sample.rf.xml
DATA=sample.tsv

# clear previous output (required by Apache Hadoop)
s3cmd del -r s3://$BUCKET/$SINK

# load built JAR + input data
s3cmd put build/libs/pattern.jar s3://$BUCKET/
s3cmd put data/$PMML s3://$BUCKET/
s3cmd put data/$DATA s3://$BUCKET/

# launch cluster and run
elastic-mapreduce --create --name "RF"
--debug --enable-debugging --log-uri s3n://$BUCKET/logs
--jar s3n://$BUCKET/pattern.jar
--arg s3n://$BUCKET/$PMML
--arg s3n://$BUCKET/$DATA
--arg s3n://$BUCKET/$SINK/classify
--arg s3n://$BUCKET/$SINK/trap

results

bash-3.2$ head output/classify/part-00000
label"var0" var1" var2" order_id" predicted" score
1" 0" 1" 0" 6f8e1014" 1" 1
0" 0" 0" 1" 6f8ea22e" 0" 0
1" 0" 1" 0" 6f8ea435" 1" 1
0" 0" 0" 1" 6f8ea5e1" 0" 0
1" 0" 1" 0" 6f8ea785" 1" 1
1" 0" 1" 0" 6f8ea91e" 1" 1
0" 1" 0" 0" 6f8eaaba" 0" 0
1" 0" 1" 0" 6f8eac54" 1" 1
0" 1" 1" 0" 6f8eade3" 1" 1



Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

for the developer

For the J2EE / Scala / Clojure developers in the audience…

Cascalog app for a recommender system

GIS Regex
tree
Scrub
export parse-tree species

M M
Estimate
Join Geohash
height

Regex
src

parse-gis
Tree Filter
tree
Metadata height

Failure M
Traps
Calculate Filter Sum
Join
distance distance moment Filter
sum_moment

Estimate R M R M
road
road

Regex
traffic
parse-road
shade

Estimate Road
Join
Albedo Segments
Geohash Join

M
R
Road
Metadata gps R
gps reco
logs

Count
Geohash Max
gps_count
recent_visit

M R

github.com/Cascading/CoPA
gist.github.com/ceteri/4641263

A conceptual ﬂow diagram for the entire batch workﬂow

City of Palo Alto recently began an Open Data initiative to give the
local community greater visibility into how city government operates

intended to encourage students, entrepreneurs, local organizations,
etc., to build new apps which contribute to the public good

paloalto.opendata.junar.com/dashboards/7576/geographic-
information/

• trees, parks
• creek levels
• roads
• bike paths
• zoning
• library visits
• utility usage
• street sweeping

The City of Palo Alto has recently begun to support Open Data to give the local community greater visibility into how their city government functions.

This effort is intended to encourage students, entrepreneurs, local organizations, etc., to build new apps which contribute to the public good.

http://paloalto.opendata.junar.com/dashboards/7576/geographic-information/

process:
1. unstructured data about municipal infrastructure
(GIS data: trees, roads, parks)
✚
2. unstructured data about where people like to walk
(smartphone GPS logs)
✚ GIS Regex

tree
Scrub

M M
Estimate
Join Geohash
height

Regex

src
parse-gis
Tree Filter
tree
Metadata height

3. a wee bit o’ curated metadata
Failure M
Traps
Calculate Filter Sum
Join
distance distance moment Filter
sum_moment

Estimate R M R M
road

road
Regex
traffic
parse-road
shade

Estimate Road
Join
Albedo Segments
Geohash Join

M
R
Road
Metadata gps R
gps reco
logs

Count
Geohash Max
gps_count
recent_visit

M R

4. personalized recommendations:

“Find a shady spot on a summer day in which to walk
near downtown Palo Alto.While on a long conference call.
Sippin’ a latte or enjoying some fro-yo.”

We merge
unstructured geo data about municipal infrastructure
(GIS data: trees, roads, parks)
+
unstructured data about where people like to walk
(smartphone GPS logs)
+
a little metadata (curated)
=>
personalized recommendations:

"Find a shady spot on a summer day in which to walk near downtown Palo Alto. While on a long conference call. Sippin’ a latte or enjoying some fro-yo."

GIS export – raw, unstructured data
"Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl","
Private: -1 Tree ID: 29 Street_Name: ADDISON AV Situs
Number: 203 Tree Site: 2 Species: Celtis australis
Source: davey tree Protected: Designated: Heritage:
Appraised Value: Hardscape: None Identifier: 40 Active
Numeric: 1 Location Feature ID: 13872 Provisional:
Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point"
"Wilkie Way from West Meadow Drive to Victoria Place"," Sequence:
20 Street_Name: Wilkie Way From Street PMMS: West Meadow
Drive To Street PMMS: Victoria Place Street ID: 598 (Wilkie
Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS:
567 Year Constructed: 1950 Traffic Count: 596 Traffic
Index: residential local Traffic Class: local residential
Traffic Date: 08/24/90 Paving Length: 208 Paving Width: 40
Paving Area: 8320 Surface Type: asphalt concrete Surface
Thickness: 2.0 Base Type Pvmt: crusher run base Base
Thickness: 6.0 Soil Class: 2 Soil Value: 15 Curb Type:
Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1
District Number: 18 Land Use PMMS: 1 Overlay Year: 1990
Overlay Thickness: 1.5 Base Failure Year: 1990 Base Failure
Thickness: 6 Surface Treatment Year: Surface Treatment
Type: Alligator Severity: none Alligator Extent: 0
Block Severity: none Block Extent: 0 Longitude and
Transverse Severity: none Longitude and Transverse Extent: 0
Ravelling Severity: none Ravelling Extent: 0 Ridability
Severity: none Trench Severity: none Trench Extent: 0
Rutting Severity: none Rutting Extent: 0 Road Performance:
UL (Urban Local)
Tuesday, 12 February 13 Bike Lane: 0 Bus Route: 0 Truck Route: 73
here’s what we have to work with -- raw GIS export as CSV, with plenty o’ errors too for good measure
0 Remediation: Deduct Value: 100 Priority:
this illustrates the quintessenceCondition:
Pavement of “unstructured data” excellent Street Cut Fee per SqFt: 10.00
Source Date:
Alligator Severity! 6/10/2009 User Modified By: mnicols Identifier
Rutting Extent!
System: 21410 ","-122.1249640794,37.4155803115645,0.0
-122.124661859039,37.4154224594993,0.0
-122.124587720719,37.4153758330704,0.0
-122.12451895942,37.4153242300888,0.0
-122.124456098457,37.4152680432944,0.0

Leiningen REPL prompt

ad-hoc queries
logical predicates
Cascading flows

First we load `lein repl` to get an interactive prompt for Clojure
…bring Cascalog libraries into Clojure
…deﬁne functions to use
…and execute queries

then we convert the queries into composable, logical propositions

curate valuable metadata

since we can ﬁnd species and geolocation for each tree,

let’s add some metadata to infer other valuable data results, e.g., tree height

based on Wikipedia.org, Calﬂora.org, USDA.gov, etc.

Cascalog – an example
(defn get-trees [src trap tree_meta]
"subquery to parse/filter the tree data"
(<- [?blurb ?tree_id ?situs ?tree_site
?species ?wikipedia ?calflora ?avg_height
?tree_lat ?tree_lng ?tree_alt ?geohash
]
(src ?blurb ?misc ?geo ?kind)
(re-matches #"^s+Private.*Tree ID.*" ?misc)
(parse-tree
?misc :> _ ?priv ?tree_id ?situs ?tree_site ?raw_species)
((c/comp s/trim s/lower-case) ?raw_species :> ?species)
(tree_meta
?species ?wikipedia ?calflora ?min_height ?max_height)
(avg ?min_height ?max_height :> ?avg_height)
(geo-tree ?geo :> _ ?tree_lat ?tree_lng ?tree_alt)
(read-string ?tree_lat :> ?lat)
(read-string ?tree_lng :> ?lng)
(geohash ?lat ?lng :> ?geohash)
(:trap (hfs-textline trap))
))

Let's use Cascalog to begin our process of structuring that data

since the GIS export is vaguely in CSV format, here's a simple way to clean up the data

referring back to DJ Patil’s “Data Jujitsu”, that clean up usually accounts for 80% of project costs

GIS export “trees” data product
?blurb! ! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 from pl
?tree_id!" 412
?situs" " 115
?tree_site" 1
?species"" liquidambar styraciflua
?wikipedia" http://en.wikipedia.org/wiki/Liquidambar_styraciflua
?calflora" http://calflora.org/cgi-bin/species_query.cgi?where-calrecnum=8598
?avg_height"27.5
?tree_lat" 37.446001565119
?tree_lng" -122.167713417554
?tree_alt" 0.0
?geohash"" 9q9jh0

GIS tree Regex Scrub

M
Estimate
Join Geohash
height

Regex
src

parse-gis
M Tree
tree
Metadata

Failure
Traps

Great, now we have a data product about trees in Palo Alto, which has been enriched by our process
BTW, those geolocation ﬁelds are especially important.

Also, here’s a conceptual ﬂow diagram, showing a directed, acyclic graph (DAG) of data taps, tuple streams, operations, joins, assertions, aggregations, etc.

add to that some road data…

• traffic class (arterial, truck route, residential, etc.)
• traffic counts distribution
• surface type (asphalt, cement; age)
from which we derive estimators for noise, reflection, etc.

analysis and visualizations from RStudio:

* frequency of traffic classes
* density plot of traffic counts

recommender – data objects
each road in the GIS export is listed as a block
between two cross roads, and each may have
multiple road segments to represent turns:
" -122.161776959558,37.4518836690781,0.0
" -122.161390381489,37.4516410983794,0.0
" -122.160786011735,37.4512589903357,0.0
" -122.160531178368,37.4510977281699,0.0

( lat1, lng1, alt1 )


NB: segments in the raw GIS have the order
of geo coordinates scrambled: (lng, lat, alt)

Each road is listed in the GIS export as a block between two cross roads, and each may have multiple road segments to represent turns

recommender – spatial indexing

geohash with 6-digit resolution
approximates a 5-block square
centered lat: 37.445, lng: -122.162

9q9jh0

We use “geohash” codes for “cheap and dirty” geospatial indexing suited for parallel processing (Hadoop)

much more effective methods exist; however, this is simple to show

6-digit resolution on a geohash generates approximately a 5-block square

recommender – estimators
calculate a sum of moments
for tree height × distance
from road segment,
as an estimator
for shade:

∑( h·d )

also calculate estimators
for trafﬁc frequency 9q9jh0
and noise

Calculate a sum of moments for tree height × distance from center;
approximate, but pretty good

also calculate estimators for trafﬁc frequency and noise

recommender – personalization via GPS tracks

Here’s a Splunk screen showing GPS tracks log data from smartphones

recommender – strategy
recommenders combine multiple signals,
generally via weighted averages, to rank
personalized results:

• GPS of person ∩ road segment
• frequency and recency of visit
• trafﬁc class and rate
• road albedo (sunlight reﬂection)
• tree shade estimator
adjusting the mix allows for further
personalization at the end use

One approach to building commercial recommender systems is to take a vector of different preference metrics,
combine in to a single sortable value, then rank the results before making personalized suggestions.

The resulting data in the "reco" output set produces exactly that.

recommender – results

‣ addr: 115 HAWTHORNE AVE
‣ lat/lng: 37.446, -122.168
‣ geohash: 9q9jh0
‣ tree: 413 site 2
‣ species: Liquidambar styraciflua
‣ est. height: 23 m
‣ shade metric: 4.363
‣ traffic: local residential, light traffic
‣ recent visit: 1972376952532
‣ a short walk from my train stop ✔

One of top recommendations for me is about two blocks from my train stop,
where a couple of really big American Sweetgum trees provide ample shade
on a residential street with not much traffic

Document
Collection

Scrub
Tokenize
token

M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

summary,
references


Cascading – summary

‣ leverages a workflow abstraction for Enterprise apps
‣ provides a pattern language for system integration
‣ utilizes existing practices for JVM-based clusters
‣ addresses staffing bottlenecks due to Hadoop adoption
‣ manages complexity as data continues to scale massively
‣ reduces costs, while servicing risk-averse “conservatism”

In summary…

references…

by Paco Nathan
with Cascading
O’Reilly, 2013
amazon.com/dp/1449358721

Santa Clara, Feb 28, 1:30pm
strataconf.com/strata2013

Some of this material comes from an upcoming O’Reilly book:
“Enterprise Data Workﬂows with Cascading”

Also, come to Strata conference! I’ll be presenting related material at:
http://strataconf.com/strata2013/public/schedule/detail/27073
Santa Clara, Feb 28, 1:30pm

drill-down…

blog, community, code/wiki/gists, maven repo, products:
cascading.org
zest.to/group11
github.com/Cascading
conjars.org
goo.gl/KQtUL
concurrentinc.com

join us, we are hiring! Copyright @2013, Concurrent, Inc.

Links to our open source projects, developer community, etc…

contact me @pacoid
http://concurrentinc.com/
(we're hiring too!)

Chicago Hadoop Users Group: Enterprise Data Workflows

More Related Content

Viewers also liked

Similar to Chicago Hadoop Users Group: Enterprise Data Workflows

More from Paco Nathan

Recently uploaded

Chicago Hadoop Users Group: Enterprise Data Workflows