19th ACM SIGKDD
KDD 2013 PMML Workshop
Conference on Knowledge Discovery
and Data Mining
kdd13pmml.wordpress.com
Pattern:
PMML for Cascading and Hadoop
Paco Nathan
Mesosphere, Inc.
Girish Kathalagiri
AgilOne, Inc.
Tuesday, 13 August 13
Pattern: PMML for Cascading and Hadoop
P Nathan, G Kathalagiri (2013-08-11)
Chicago Crime Data
Workflow Abstraction
Cascading, Pattern, etc.
Tuesday, 13 August 13
Pattern: Example App
• example integration of PMML and Cascading, using a sample app
based on the crime dataset from the City of Chicago Open Data
• sample app implements a predictive model for expected crime
rates based on location, hour of day, and month
• modeling performed in R, using the pmml package
• multiple models are captured as PMML, then integrated via
Pattern to implement the entire workflow as a single app
• PMML provides a vector for migrating workloads off of SAS,
SPSS, etc., onto Hadoop clusters for more cost-effective scaling
Tuesday, 13 August 13
Pattern: Example App
City of Chicago Open Data portal
cityofchicago.org/city/en/narr/foia/CityData.html
Pattern open source project
github.com/Cascading/pattern
Observed benefits include greatly reduced development costs
and less licensing issues at scale, while leveraging the scalability
of Apache Hadoop clusters, existing intellectual property in
predictive models, and the core competencies of analytics staff.
Analysts can train predictive models in popular analytics
frameworks, such as SAS, Microstrategy, R,Weka, SQL Server,
etc., then run those models at scale on Apache Hadoop with
little or no coding required.
Tuesday, 13 August 13
API Support for Model Chaining, Transforms, etc.
workflow used for data preparation:
Tuesday, 13 August 13
API Support for Model Chaining, Transforms, etc.
workflow used for model scoring:
Tuesday, 13 August 13
Pattern: PMML for Cascading and Hadoop
P Nathan, G Kathalagiri (2013-08-11)
Chicago Crime Data
Workflow Abstraction
Cascading, Pattern, etc.
Tuesday, 13 August 13
Enterprise Data Workflows
middleware for Big Data applications is evolving,
with commercial examples that include:
Cascading, Lingual, Pattern, etc.
Concurrent
Anaconda,Wakari, IPython Notebook, etc.
Continuum Analytics
ParAccel Big Data Analytics Platform
Actian
ETL
data
prep
predictive
model
data
sources
end
uses
Tuesday, 13 August 13
Anatomy of an Enterprise app
Definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
ANSI SQL for ETL
Tuesday, 13 August 13
Anatomy of an Enterprise app
Definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
Tuesday, 13 August 13
Anatomy of an Enterprise app
Definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive models
Tuesday, 13 August 13
Anatomy of an Enterprise app
Definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…
Tuesday, 13 August 13
Anatomy of an Enterprise app
Definition of a typical Enterprise workflow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
most of the project costs…
Tuesday, 13 August 13
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
cascading.org
Tuesday, 13 August 13
a compiler sees it all…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef()
.setName( "etl" )
.addSource( "example.employee", emplTap )
.addSource( "example.sales", salesTap )
.addSink( "results", resultsTap );
 
SQLPlanner sqlPlanner = new SQLPlanner()
.setSql( sqlStatement );
 
flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org
Tuesday, 13 August 13
a compiler sees it all…
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Anatomy of an Enterprise app
Cascading allows multiple departments to combine their workflow components
into an integrated app – one among many, typically – based on 100% open source
FlowDef flowDef = FlowDef.flowDef()
.setName( "classifier" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );
 
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlModel ) )
.retainOnlyActiveIncomingFields();
 
flowDef.addAssemblyPlanner( pmmlPlanner );
Tuesday, 13 August 13
Pattern: PMML for Cascading and Hadoop
P Nathan, G Kathalagiri (2013-08-11)
Chicago Crime Data
Workflow Abstraction
Cascading, Pattern, etc.
Tuesday, 13 August 13
Cascading – functional programming
• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop
Cascading – used for their large-scale production
deployments
• new case studies for Cascading apps are mostly
based on domain-specific languages (DSLs) in JVM
languages which emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology
Dan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-
practices-will-improve-your-return-from-technology/
Tuesday, 13 August 13
Functional Programming for Big Data
WordCount with token scrubbing…
Apache Hive: 52 lines HQL + 8 lines Python (UDF)
compared to
Scalding: 18 lines Scala/Cascading
functional programming languages help reduce
software engineering costs at scale, over time
Tuesday, 13 August 13
Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in Java
to define workflows out of familiar elements:
Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Data is represented as flows of tuples. Operations
in the flows bring functional programming aspects
into Java
A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199
Tuesday, 13 August 13
Workflow Abstraction – literate programming
Cascading workflows generate their own visual
documentation: flow diagrams
in formal terms, flow diagrams leverage a methodology
called literate programming
provides intuitive, visual representations for apps –
great for cross-team collaboration
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Literate Programming
Don Knuth
literateprogramming.com
Tuesday, 13 August 13
Workflow Abstraction – business process
following the essence of literate programming, Cascading
workflows provide statements of business process
this recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale
Tuesday, 13 August 13
Customer
Orders
Classify
Scored
Orders
GroupBy
token
Count
PMML
Model
M R
Failure
Traps
Assert
Confusion
Matrix
Pattern – score a model, using pre-defined Cascading app
cascading.org/pattern
Tuesday, 13 August 13
Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Pattern – model scoring
• migrate workloads: SAS,Teradata, etc.,
exporting predictive models as PMML
• great open source tools – R, Weka,
KNIME, Matlab, RapidMiner, etc.
• integrate with other libraries –
Matrix API, etc.
• leverage PMML as another kind
of DSL
cascading.org/pattern
Tuesday, 13 August 13
public static void main( String[] args ) throws RuntimeException {
String inputPath = args[ 0 ];
String classifyPath = args[ 1 ];
// set up the config properties
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
  // create source and sink taps
Tap inputTap = new Hfs( new TextDelimited( true, "t" ), inputPath );
Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
  // handle command line options
OptionParser optParser = new OptionParser();
optParser.accepts( "pmml" ).withRequiredArg();
  OptionSet options = optParser.parse( args );
 
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );
 
if( options.hasArgument( "pmml" ) ) {
String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 );
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlPath ) )
.retainOnlyActiveIncomingFields()
.setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model
flowDef.addAssemblyPlanner( pmmlPlanner );
}
 
// write a DOT file and run the flow
Flow classifyFlow = flowConnector.connect( flowDef );
classifyFlow.writeDOT( "dot/classify.dot" );
classifyFlow.complete();
}
Pattern – score a model, within an app
Tuesday, 13 August 13
Enterprise DataWorkflows with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do
Newsletter for analysis, events, updates:
liber118.com/pxn/
Tuesday, 13 August 13

Pattern: PMML for Cascading and Hadoop

  • 1.
    19th ACM SIGKDD KDD2013 PMML Workshop Conference on Knowledge Discovery and Data Mining kdd13pmml.wordpress.com Pattern: PMML for Cascading and Hadoop Paco Nathan Mesosphere, Inc. Girish Kathalagiri AgilOne, Inc. Tuesday, 13 August 13
  • 2.
    Pattern: PMML forCascading and Hadoop P Nathan, G Kathalagiri (2013-08-11) Chicago Crime Data Workflow Abstraction Cascading, Pattern, etc. Tuesday, 13 August 13
  • 3.
    Pattern: Example App •example integration of PMML and Cascading, using a sample app based on the crime dataset from the City of Chicago Open Data • sample app implements a predictive model for expected crime rates based on location, hour of day, and month • modeling performed in R, using the pmml package • multiple models are captured as PMML, then integrated via Pattern to implement the entire workflow as a single app • PMML provides a vector for migrating workloads off of SAS, SPSS, etc., onto Hadoop clusters for more cost-effective scaling Tuesday, 13 August 13
  • 4.
    Pattern: Example App Cityof Chicago Open Data portal cityofchicago.org/city/en/narr/foia/CityData.html Pattern open source project github.com/Cascading/pattern Observed benefits include greatly reduced development costs and less licensing issues at scale, while leveraging the scalability of Apache Hadoop clusters, existing intellectual property in predictive models, and the core competencies of analytics staff. Analysts can train predictive models in popular analytics frameworks, such as SAS, Microstrategy, R,Weka, SQL Server, etc., then run those models at scale on Apache Hadoop with little or no coding required. Tuesday, 13 August 13
  • 5.
    API Support forModel Chaining, Transforms, etc. workflow used for data preparation: Tuesday, 13 August 13
  • 6.
    API Support forModel Chaining, Transforms, etc. workflow used for model scoring: Tuesday, 13 August 13
  • 7.
    Pattern: PMML forCascading and Hadoop P Nathan, G Kathalagiri (2013-08-11) Chicago Crime Data Workflow Abstraction Cascading, Pattern, etc. Tuesday, 13 August 13
  • 8.
    Enterprise Data Workflows middlewarefor Big Data applications is evolving, with commercial examples that include: Cascading, Lingual, Pattern, etc. Concurrent Anaconda,Wakari, IPython Notebook, etc. Continuum Analytics ParAccel Big Data Analytics Platform Actian ETL data prep predictive model data sources end uses Tuesday, 13 August 13
  • 9.
    Anatomy of anEnterprise app Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses ANSI SQL for ETL Tuesday, 13 August 13
  • 10.
    Anatomy of anEnterprise app Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end usesJ2EE for business logic Tuesday, 13 August 13
  • 11.
    Anatomy of anEnterprise app Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses SAS for predictive models Tuesday, 13 August 13
  • 12.
    Anatomy of anEnterprise app Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses SAS for predictive modelsANSI SQL for ETL most of the licensing costs… Tuesday, 13 August 13
  • 13.
    Anatomy of anEnterprise app Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end usesJ2EE for business logic most of the project costs… Tuesday, 13 August 13
  • 14.
    ETL data prep predictive model data sources end uses Lingual: DW → ANSISQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source a compiler sees it all… cascading.org Tuesday, 13 August 13
  • 15.
    a compiler seesit all… ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source FlowDef flowDef = FlowDef.flowDef() .setName( "etl" ) .addSource( "example.employee", emplTap ) .addSource( "example.sales", salesTap ) .addSink( "results", resultsTap );   SQLPlanner sqlPlanner = new SQLPlanner() .setSql( sqlStatement );   flowDef.addAssemblyPlanner( sqlPlanner ); cascading.org Tuesday, 13 August 13
  • 16.
    a compiler seesit all… ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source FlowDef flowDef = FlowDef.flowDef() .setName( "classifier" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap );   PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlModel ) ) .retainOnlyActiveIncomingFields();   flowDef.addAssemblyPlanner( pmmlPlanner ); Tuesday, 13 August 13
  • 17.
    Pattern: PMML forCascading and Hadoop P Nathan, G Kathalagiri (2013-08-11) Chicago Crime Data Workflow Abstraction Cascading, Pattern, etc. Tuesday, 13 August 13
  • 18.
    Cascading – functionalprogramming • Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology Dan Woods, 2013-04-17 Forbes forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming- practices-will-improve-your-return-from-technology/ Tuesday, 13 August 13
  • 19.
    Functional Programming forBig Data WordCount with token scrubbing… Apache Hive: 52 lines HQL + 8 lines Python (UDF) compared to Scalding: 18 lines Scala/Cascading functional programming languages help reduce software engineering costs at scale, over time Tuesday, 13 August 13
  • 20.
    Workflow Abstraction –pattern language Cascading uses a “plumbing” metaphor in Java to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R Data is represented as flows of tuples. Operations in the flows bring functional programming aspects into Java A Pattern Language Christopher Alexander, et al. amazon.com/dp/0195019199 Tuesday, 13 August 13
  • 21.
    Workflow Abstraction –literate programming Cascading workflows generate their own visual documentation: flow diagrams in formal terms, flow diagrams leverage a methodology called literate programming provides intuitive, visual representations for apps – great for cross-team collaboration Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R Literate Programming Don Knuth literateprogramming.com Tuesday, 13 August 13
  • 22.
    Workflow Abstraction –business process following the essence of literate programming, Cascading workflows provide statements of business process this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.) this is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.” by virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale Tuesday, 13 August 13
  • 23.
    Customer Orders Classify Scored Orders GroupBy token Count PMML Model M R Failure Traps Assert Confusion Matrix Pattern –score a model, using pre-defined Cascading app cascading.org/pattern Tuesday, 13 August 13
  • 24.
    Hadoop Cluster source tap source tap sink tap trap tap customer profile DBsCustomer Prefs logs logs Logs Data Workflow Cache Customers Support Web App Reporting Analytics Cubes sink tap ModelingPMML Pattern – model scoring • migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML • great open source tools – R, Weka, KNIME, Matlab, RapidMiner, etc. • integrate with other libraries – Matrix API, etc. • leverage PMML as another kind of DSL cascading.org/pattern Tuesday, 13 August 13
  • 25.
    public static voidmain( String[] args ) throws RuntimeException { String inputPath = args[ 0 ]; String classifyPath = args[ 1 ]; // set up the config properties Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );   // create source and sink taps Tap inputTap = new Hfs( new TextDelimited( true, "t" ), inputPath ); Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );   // handle command line options OptionParser optParser = new OptionParser(); optParser.accepts( "pmml" ).withRequiredArg();   OptionSet options = optParser.parse( args );   // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "classify" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap );   if( options.hasArgument( "pmml" ) ) { String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlPath ) ) .retainOnlyActiveIncomingFields() .setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model flowDef.addAssemblyPlanner( pmmlPlanner ); }   // write a DOT file and run the flow Flow classifyFlow = flowConnector.connect( flowDef ); classifyFlow.writeDOT( "dot/classify.dot" ); classifyFlow.complete(); } Pattern – score a model, within an app Tuesday, 13 August 13
  • 26.
    Enterprise DataWorkflows withCascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do Newsletter for analysis, events, updates: liber118.com/pxn/ Tuesday, 13 August 13