Pattern: PMML for Cascading and Hadoop

19th ACM SIGKDD
KDD 2013 PMML Workshop
Conference on Knowledge Discovery
and Data Mining
kdd13pmml.wordpress.com
Pattern:
PMML for Cascading and Hadoop
Paco Nathan
Mesosphere, Inc.
Girish Kathalagiri
AgilOne, Inc.
Tuesday, 13 August 13

Pattern: PMML for Cascading and Hadoop
P Nathan, G Kathalagiri (2013-08-11)
Chicago Crime Data
Workﬂow Abstraction
Cascading, Pattern, etc.

Pattern: Example App
• example integration of PMML and Cascading, using a sample app
based on the crime dataset from the City of Chicago Open Data
• sample app implements a predictive model for expected crime
rates based on location, hour of day, and month
• modeling performed in R, using the pmml package
• multiple models are captured as PMML, then integrated via
Pattern to implement the entire workﬂow as a single app
• PMML provides a vector for migrating workloads off of SAS,
SPSS, etc., onto Hadoop clusters for more cost-effective scaling

Pattern: Example App
City of Chicago Open Data portal
cityofchicago.org/city/en/narr/foia/CityData.html
Pattern open source project
github.com/Cascading/pattern
Observed beneﬁts include greatly reduced development costs
and less licensing issues at scale, while leveraging the scalability
of Apache Hadoop clusters, existing intellectual property in
predictive models, and the core competencies of analytics staff.
Analysts can train predictive models in popular analytics
frameworks, such as SAS, Microstrategy, R,Weka, SQL Server,
etc., then run those models at scale on Apache Hadoop with
little or no coding required.

API Support for Model Chaining, Transforms, etc.
workﬂow used for data preparation:

API Support for Model Chaining, Transforms, etc.
workﬂow used for model scoring:

Enterprise Data Workﬂows
middleware for Big Data applications is evolving,
with commercial examples that include:
Cascading, Lingual, Pattern, etc.
Concurrent
Anaconda,Wakari, IPython Notebook, etc.
Continuum Analytics
ParAccel Big Data Analytics Platform
Actian
ETL
data
prep
predictive
model
data
sources
end
uses

Anatomy of an Enterprise app
Deﬁnition of a typical Enterprise workﬂow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
ANSI SQL for ETL

ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic

ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive models

ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…

ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
most of the project costs…

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Cascading allows multiple departments to combine their workﬂow components
into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
cascading.org

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
FlowDef flowDef = FlowDef.flowDef()
.setName( "etl" )
.addSource( "example.employee", emplTap )
.addSource( "example.sales", salesTap )
.addSink( "results", resultsTap );

SQLPlanner sqlPlanner = new SQLPlanner()
.setSql( sqlStatement );

flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
FlowDef flowDef = FlowDef.flowDef()
.setName( "classifier" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );

PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlModel ) )
.retainOnlyActiveIncomingFields();

flowDef.addAssemblyPlanner( pmmlPlanner );

Cascading – functional programming
• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop
Cascading – used for their large-scale production
deployments
• new case studies for Cascading apps are mostly
based on domain-speciﬁc languages (DSLs) in JVM
languages which emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology
Dan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-
practices-will-improve-your-return-from-technology/

Functional Programming for Big Data
WordCount with token scrubbing…
Apache Hive: 52 lines HQL + 8 lines Python (UDF)
compared to
Scalding: 18 lines Scala/Cascading
functional programming languages help reduce
software engineering costs at scale, over time

Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in Java
to define workflows out of familiar elements:
Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Data is represented as flows of tuples. Operations
in the flows bring functional programming aspects
into Java
A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199

Workflow Abstraction – literate programming
Cascading workflows generate their own visual
documentation: flow diagrams
in formal terms, flow diagrams leverage a methodology
called literate programming
provides intuitive, visual representations for apps –
great for cross-team collaboration
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Literate Programming
Don Knuth
literateprogramming.com

Workflow Abstraction – business process
following the essence of literate programming, Cascading
workflows provide statements of business process
this recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale

Customer
Orders
Classify
Scored
Orders
GroupBy
token
Count
PMML
Model
M R
Failure
Traps
Assert
Confusion
Matrix
Pattern – score a model, using pre-deﬁned Cascading app
cascading.org/pattern

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Pattern – model scoring
• migrate workloads: SAS,Teradata, etc.,
exporting predictive models as PMML
• great open source tools – R, Weka,
KNIME, Matlab, RapidMiner, etc.
• integrate with other libraries –
Matrix API, etc.
• leverage PMML as another kind
of DSL
cascading.org/pattern

public static void main( String[] args ) throws RuntimeException {
String inputPath = args[ 0 ];
String classifyPath = args[ 1 ];
// set up the config properties
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap inputTap = new Hfs( new TextDelimited( true, "t" ), inputPath );
Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
// handle command line options
OptionParser optParser = new OptionParser();
optParser.accepts( "pmml" ).withRequiredArg();
OptionSet options = optParser.parse( args );

// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );

if( options.hasArgument( "pmml" ) ) {
String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 );
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlPath ) )
.retainOnlyActiveIncomingFields()
.setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model
flowDef.addAssemblyPlanner( pmmlPlanner );
}

// write a DOT file and run the flow
Flow classifyFlow = flowConnector.connect( flowDef );
classifyFlow.writeDOT( "dot/classify.dot" );
classifyFlow.complete();
}
Pattern – score a model, within an app

Enterprise DataWorkﬂows with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do
Newsletter for analysis, events, updates:
liber118.com/pxn/

Pattern: PMML for Cascading and Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Pattern: PMML for Cascading and Hadoop

Similar to Pattern: PMML for Cascading and Hadoop (20)

More from Paco Nathan

More from Paco Nathan (20)

Recently uploaded

Recently uploaded (20)

Pattern: PMML for Cascading and Hadoop