Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Pattern: PMML for Cascading and Hadoop

on

  • 2,159 views

Paper by Paco Nathan (Mesosphere) and Girish Kathalagiri (AgilOne) presented at the PMML Workshop (2013-08-11) at KDD 2013 in Chicago http://kdd13pmml.wordpress.com/ ...

Paper by Paco Nathan (Mesosphere) and Girish Kathalagiri (AgilOne) presented at the PMML Workshop (2013-08-11) at KDD 2013 in Chicago http://kdd13pmml.wordpress.com/

The paper uses Open Data from the City of Chicago to build predictive models for crime based on seasonality, geolocation, and other factors. The modeling illustrates use of the Pattern library https://github.com/Cascading/pattern in Cascading to import PMML -- in this case, the use of model chaining to create ensembles.

Statistics

Views

Total Views
2,159
Views on SlideShare
2,110
Embed Views
49

Actions

Likes
3
Downloads
41
Comments
0

2 Embeds 49

https://twitter.com 40
http://lanyrd.com 9

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Pattern: PMML for Cascading and Hadoop Pattern: PMML for Cascading and Hadoop Presentation Transcript

    • 19th ACM SIGKDD KDD 2013 PMML Workshop Conference on Knowledge Discovery and Data Mining kdd13pmml.wordpress.com Pattern: PMML for Cascading and Hadoop Paco Nathan Mesosphere, Inc. Girish Kathalagiri AgilOne, Inc. Tuesday, 13 August 13
    • Pattern: PMML for Cascading and Hadoop P Nathan, G Kathalagiri (2013-08-11) Chicago Crime Data Workflow Abstraction Cascading, Pattern, etc. Tuesday, 13 August 13
    • Pattern: Example App • example integration of PMML and Cascading, using a sample app based on the crime dataset from the City of Chicago Open Data • sample app implements a predictive model for expected crime rates based on location, hour of day, and month • modeling performed in R, using the pmml package • multiple models are captured as PMML, then integrated via Pattern to implement the entire workflow as a single app • PMML provides a vector for migrating workloads off of SAS, SPSS, etc., onto Hadoop clusters for more cost-effective scaling Tuesday, 13 August 13
    • Pattern: Example App City of Chicago Open Data portal cityofchicago.org/city/en/narr/foia/CityData.html Pattern open source project github.com/Cascading/pattern Observed benefits include greatly reduced development costs and less licensing issues at scale, while leveraging the scalability of Apache Hadoop clusters, existing intellectual property in predictive models, and the core competencies of analytics staff. Analysts can train predictive models in popular analytics frameworks, such as SAS, Microstrategy, R,Weka, SQL Server, etc., then run those models at scale on Apache Hadoop with little or no coding required. Tuesday, 13 August 13
    • API Support for Model Chaining, Transforms, etc. workflow used for data preparation: Tuesday, 13 August 13
    • API Support for Model Chaining, Transforms, etc. workflow used for model scoring: Tuesday, 13 August 13
    • Pattern: PMML for Cascading and Hadoop P Nathan, G Kathalagiri (2013-08-11) Chicago Crime Data Workflow Abstraction Cascading, Pattern, etc. Tuesday, 13 August 13
    • Enterprise Data Workflows middleware for Big Data applications is evolving, with commercial examples that include: Cascading, Lingual, Pattern, etc. Concurrent Anaconda,Wakari, IPython Notebook, etc. Continuum Analytics ParAccel Big Data Analytics Platform Actian ETL data prep predictive model data sources end uses Tuesday, 13 August 13
    • Anatomy of an Enterprise app Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses ANSI SQL for ETL Tuesday, 13 August 13
    • Anatomy of an Enterprise app Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end usesJ2EE for business logic Tuesday, 13 August 13
    • Anatomy of an Enterprise app Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses SAS for predictive models Tuesday, 13 August 13
    • Anatomy of an Enterprise app Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end uses SAS for predictive modelsANSI SQL for ETL most of the licensing costs… Tuesday, 13 August 13
    • Anatomy of an Enterprise app Definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies… ETL data prep predictive model data sources end usesJ2EE for business logic most of the project costs… Tuesday, 13 August 13
    • ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source a compiler sees it all… cascading.org Tuesday, 13 August 13
    • a compiler sees it all… ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source FlowDef flowDef = FlowDef.flowDef() .setName( "etl" ) .addSource( "example.employee", emplTap ) .addSource( "example.sales", salesTap ) .addSink( "results", resultsTap );   SQLPlanner sqlPlanner = new SQLPlanner() .setSql( sqlStatement );   flowDef.addAssemblyPlanner( sqlPlanner ); cascading.org Tuesday, 13 August 13
    • a compiler sees it all… ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Anatomy of an Enterprise app Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source FlowDef flowDef = FlowDef.flowDef() .setName( "classifier" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap );   PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlModel ) ) .retainOnlyActiveIncomingFields();   flowDef.addAssemblyPlanner( pmmlPlanner ); Tuesday, 13 August 13
    • Pattern: PMML for Cascading and Hadoop P Nathan, G Kathalagiri (2013-08-11) Chicago Crime Data Workflow Abstraction Cascading, Pattern, etc. Tuesday, 13 August 13
    • Cascading – functional programming • Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wiki Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology Dan Woods, 2013-04-17 Forbes forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming- practices-will-improve-your-return-from-technology/ Tuesday, 13 August 13
    • Functional Programming for Big Data WordCount with token scrubbing… Apache Hive: 52 lines HQL + 8 lines Python (UDF) compared to Scalding: 18 lines Scala/Cascading functional programming languages help reduce software engineering costs at scale, over time Tuesday, 13 August 13
    • Workflow Abstraction – pattern language Cascading uses a “plumbing” metaphor in Java to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R Data is represented as flows of tuples. Operations in the flows bring functional programming aspects into Java A Pattern Language Christopher Alexander, et al. amazon.com/dp/0195019199 Tuesday, 13 August 13
    • Workflow Abstraction – literate programming Cascading workflows generate their own visual documentation: flow diagrams in formal terms, flow diagrams leverage a methodology called literate programming provides intuitive, visual representations for apps – great for cross-team collaboration Scrub token Document Collection Tokenize Word Count GroupBy token Count Stop Word List Regex token HashJoin Left RHS M R Literate Programming Don Knuth literateprogramming.com Tuesday, 13 August 13
    • Workflow Abstraction – business process following the essence of literate programming, Cascading workflows provide statements of business process this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.) this is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.” by virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale Tuesday, 13 August 13
    • Customer Orders Classify Scored Orders GroupBy token Count PMML Model M R Failure Traps Assert Confusion Matrix Pattern – score a model, using pre-defined Cascading app cascading.org/pattern Tuesday, 13 August 13
    • Hadoop Cluster source tap source tap sink tap trap tap customer profile DBsCustomer Prefs logs logs Logs Data Workflow Cache Customers Support Web App Reporting Analytics Cubes sink tap Modeling PMML Pattern – model scoring • migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML • great open source tools – R, Weka, KNIME, Matlab, RapidMiner, etc. • integrate with other libraries – Matrix API, etc. • leverage PMML as another kind of DSL cascading.org/pattern Tuesday, 13 August 13
    • public static void main( String[] args ) throws RuntimeException { String inputPath = args[ 0 ]; String classifyPath = args[ 1 ]; // set up the config properties Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );   // create source and sink taps Tap inputTap = new Hfs( new TextDelimited( true, "t" ), inputPath ); Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );   // handle command line options OptionParser optParser = new OptionParser(); optParser.accepts( "pmml" ).withRequiredArg();   OptionSet options = optParser.parse( args );   // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "classify" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap );   if( options.hasArgument( "pmml" ) ) { String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlPath ) ) .retainOnlyActiveIncomingFields() .setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model flowDef.addAssemblyPlanner( pmmlPlanner ); }   // write a DOT file and run the flow Flow classifyFlow = flowConnector.connect( flowDef ); classifyFlow.writeDOT( "dot/classify.dot" ); classifyFlow.complete(); } Pattern – score a model, within an app Tuesday, 13 August 13
    • Enterprise DataWorkflows with Cascading O’Reilly, 2013 shop.oreilly.com/product/ 0636920028536.do Newsletter for analysis, events, updates: liber118.com/pxn/ Tuesday, 13 August 13