Enterprise Data Workflows with Cascading
 

Enterprise Data Workflows with Cascading

on

  • 2,389 views

Cascading meetup held jointly with Enterprise Big Data meetup at Tata Consultancy Services in Santa Clara on 2012-12-17

Cascading meetup held jointly with Enterprise Big Data meetup at Tata Consultancy Services in Santa Clara on 2012-12-17
http://www.meetup.com/cascading/events/94079162/

Statistics

Views

Total Views
2,389
Views on SlideShare
2,366
Embed Views
23

Actions

Likes
13
Downloads
75
Comments
1

5 Embeds 23

https://twitter.com 12
http://zest.to 7
http://tweets.bigdatagal.com 2
https://twimg0-a.akamaihd.net 1
http://11.frame.zest.to 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • It's more interesting one.............pl. provide such more.......pl.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Enterprise Data Workflows with Cascading Enterprise Data Workflows with Cascading Presentation Transcript

    • Enterprise Data Workflows with Cascading Document Collection Paco Nathan Scrub Tokenize token M HashJoin Regex Left token GroupBy R Concurrent, Inc. Stop Word token List RHS Count Word Count pnathan@concurrentinc.com @pacoid Copyright @2012, Concurrent, Inc.Monday, 17 December 12 1
    • Unstructured Data meets Enterprise Scale 1. Cascading API: a few facts & quotes 2. Example #1: distributed file copy 3. Example #2: word count 4. Pattern Language: workflow abstraction 5. Compare: Scalding, Cascalog, Hive, PigMonday, 17 December 12 2
    • Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Cascading API: a few facts & quotesMonday, 17 December 12 3
    • Enterprise apps, pre-Hadoop SQL queries Data analyst Warehouse ops ETL data data sets sources insights data sources Analytics Apps modeling Tools developer priorities ad-hoc dashboards analysis queries domainMonday, 17 December 12 4
    • Enterprise apps, pre-Hadoop the devil you know: ‣ “scale up” as needed – larger proprietary hardware ‣ data warehouse: e.g., Oracle,Teradata, etc. – expensive ‣ analytics: e.g., SAS, Microstrategy, etc. – expensive ‣ highly trained staff in specific roles – lots of “silos” however, to be competitive now, the data rates must scale by orders of magnitude... ( alternatively, can we get hired onto the SAS sales team? )Monday, 17 December 12 5
    • Enterprise apps, with Hadoop Apache Hadoop offers an attractive migration path: ‣ open source software – less expensive ‣ commodity hardware – less expensive ‣ fault tolerance for large-scale parallel workloads ‣ great use cases: Yahoo!, Facebook, Twitter, Amazon, Apple, etc. ‣ offload workflows from licensed platforms, based on “scale-out”Monday, 17 December 12 6
    • Enterprise apps, with Hadoop queries, Java job tracker models apps name node Hadoop Cluster analyst developer ETL needs opsMonday, 17 December 12 7
    • Enterprise apps, with Hadoop anything odd about that diagram? queries, models Java apps job tracker name node Hadoop Cluster analyst developer ETL needs ‣ demands expert Hadoop developers ops ‣ experts are hard to find, expensive ‣ even harder to train from among existing staff ‣ early adopter abstractions are not suitable for Enterprise IT ‣ importantly: Hadoop is almost never used in isolationMonday, 17 December 12 8
    • Cascading API: purpose ‣ simplify data processing development and deployment ‣ improve application developer productivity ‣ enable data processing application manageabilityMonday, 17 December 12 9
    • Cascading API: a few facts Java open source project (ASL 2) using Git, Gradle, Maven, JUnit, etc. in production (~5 yrs) at hundreds of enterprise Hadoop deployments: Finance, Health Care, Transportation, other verticals studies published about large use cases: Twitter, Etsy, eBay, Airbnb, Square, Climate Corp, FlightCaster, Williams-Sonoma, Trulia, TeleNav partnerships and distribution with SpringSource, Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC several open source projects built atop, managed by Twitter, Etsy, eBay, etc., which provide substantial Machine Learning libraries DSLs available in Scala, Clojure, Python (Jython), Ruby (JRuby), Groovy data “taps” integrate popular data frameworks via JDBC, Memcached, HBase, plus serialization in Apache Thrift, Avro, Kyro, etc. entire app compiles into a single JAR: fully connected for compiler optimization, exception handling, debug, config, scheduling, notifications, provenance, etc.Monday, 17 December 12 10
    • Cascading API: a few quotes “Cascading gives Java developers the ability to build Big Data applications on Hadoop using their existing skillset … Management can really go out and build a team around folks that are already very experienced with Java. Switching over to this is really a very short exercise.” CIO, Thor Olavsrud, 2012-06-06 cio.com/article/707782/Ease_Big_Data_Hiring_Pain_With_Cascading “Masks the complexity of MapReduce, simplifies the programming, and speeds you on your journey toward actionable analytics … A vast improvement over native MapReduce functions or Pig UDFs.” 2012 BOSSIE Awards, James Borck, 2012-09-18 infoworld.com/slideshow/65089 “Company’s promise to application developers is an opportunity to build and test applications on their desktops in the language of choice with familiar constructs and reusable components” Dr. Dobb’s, Adrian Bridgwater, 2012-06-08 drdobbs.com/jvm/where-does-big-data-go-to-get-data-inten/240001759Monday, 17 December 12 11
    • Enterprise concerns “Notes from the Mystery Machine Bus” by Steve Yegge, Google goo.gl/SeRZa “conservative” “liberal” (mostly) Enterprise (mostly) Start-Up risk management customer experiments assurance flexibility well-defined schema schema follows code explicit configuration convention type-checking compiler interpreted scripts wants no surprises wants no impediments Java, Scala, Clojure, etc. PHP, Ruby, Python, etc. Cascading, Scalding, Cascalog, etc. Hive, Pig, Hadoop Streaming, etc.Monday, 17 December 12 12
    • Enterprise adoption As Enterprise apps move into Hadoop and related BigData frameworks, risk profiles shift toward more conservative programming practices Cascading provides a popular API – formally speaking, as a pattern language – for defining and managing Enterprise data workflowsMonday, 17 December 12 13
    • Migration of batch toolsets Enterprise Migration Start-Ups define pipelines J2EE Cascading Pig query data SQL Lingual Hive predictive models SAS Pattern MahoutMonday, 17 December 12 14
    • Summary Cascading API benefits: ‣ addresses staffing bottlenecks due to Hadoop adoption ‣ reduces costs, while servicing risk concerns and “conservatism” ‣ manages complexity as the data continues to scale massively ‣ provides a pattern language for system integration ‣ leverages a workflow abstraction for Enterprise apps ‣ utilizes existing practices for JVM-based clustersMonday, 17 December 12 15
    • Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Code Example #1: distributed file copyMonday, 17 December 12 16
    • 1: distributed file copy public class   Main   {   public static void   main( String[] args )     {     String inPath = args[ 0 ];     String outPath = args[ 1 ]; Source     Properties props = new Properties();     AppProps.setApplicationJarClass( props, Main.class );     HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );     // create the source tap     Tap inTap = new Hfs( new TextDelimited( true, "t" ), inPath ); M     // create the sink tap     Tap outTap = new Hfs( new TextDelimited( true, "t" ), outPath ); Sink     // specify a pipe to connect the taps     Pipe copyPipe = new Pipe( "copy" );     // connect the taps, pipes, etc., into a flow     FlowDef flowDef = FlowDef.flowDef().setName( "copy" )      .addSource( copyPipe, inTap )      .addTailSink( copyPipe, outTap );     // run the flow     flowConnector.connect( flowDef ).complete(); 1 mapper     }   } 0 reducers 10 lines codeMonday, 17 December 12 17
    • 1: distributed file copy shown: ‣ a source tap – input data ‣ a sink tap – output data ‣ a pipe connecting a source to a sink ‣ simplest possible Cascading app not shown: ‣ what kind of taps? and what size of input data set? ‣ could be: JDBC, HBase, Cassandra, XML, flat files, etc. ‣ what kind of topology? and what size of cluster? ‣ could be: Hadoop, in-memory, etc. as system architects, we leverage patternMonday, 17 December 12 18
    • principle: same JAR, any scale MegaCorp Enterprise IT: Pb’s data 1000+ node private cluster EVP calls you when app fails runtime: days+ Production Cluster: Tb’s data EMR w/ 50 HPC Instances Ops monitors results runtime: hours – days Staging Cluster: Gb’s data EMR + 4 Spot Instances CI shows red or green lights runtime: minutes – hours Your Laptop: Mb’s data Hadoop standalone mode passes unit tests, or not runtime: seconds – minutesMonday, 17 December 12 19
    • principle: fail the same way twice troubleshooting at scale: ‣ physical plan for a query provides a deterministic strategy ‣ avoid non-deterministic behavior – expensive when troubleshooting ‣ otherwise, edge cases become nightmares on large clusters ‣ again, addresses “conservative” need for predictability ‣ a core value which is unique to CascadingMonday, 17 December 12 20
    • principle: plan ahead flow planner per topology: ‣ leverage the flow graph (DAG) ‣ catch as many errors as possible before an app gets submitted ‣ potential problems caught at compile time or at flow planner stage ‣ …long before large, expensive resources start getting consumed ‣ …or worse, before the wrong results get propagated downstreamMonday, 17 December 12 21
    • Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Code Example #2: word countMonday, 17 December 12 22
    • 2: word count defined: count how often each word appears in a collection of text documents a simple program provides a great test case for parallel processing, since it illustrates: ‣ requires a minimal amount of code ‣ demonstrates use of both symbolic and numeric values ‣ shows a dependency graph of tuples as an abstraction ‣ is not many steps away from useful search indexing ‣ serves as a “Hello World” for Hadoop apps any distributed computing framework which runs Word Count efficiently in parallel at scale, can handle much larger, more interesting compute problemsMonday, 17 December 12 23
    • 2: word count Document Collection Tokenize GroupBy M token Count R Word Count 1 mapper 1 reducer 18 lines code gist.github.com/3900702Monday, 17 December 12 24
    • 2: word count Document Collection M Tokenize GroupBy token Count String docPath = args[ 0 ]; R Word Count String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete();Monday, 17 December 12 25
    • 2: word count [head] Hfs[TextDelimited[[doc_id, text]->[ALL]]][data/rain.txt]] [{2}:doc_id, text] [{2}:doc_id, text] map Each(token)[RegexSplitGenerator[decl:token][args:1]] [{1}:token] [{1}:token] GroupBy(wc)[by:[token]] wc[{1}:token] [{1}:token] reduce Every(wc)[Count[decl:count]] [{2}:token, count] [{1}:token] Hfs[TextDelimited[[UNKNOWN]->[token, count]]][output/wc]] 1 mapper [{2}:token, count] 1 reducer [{2}:token, count] 18 lines code [tail]Monday, 17 December 12 26
    • 2: word count deltas between Example #1 and Example #2: ‣ defines source tap as a collection of text documents ‣ defines sink tap to produce word count tuples (desired end result) ‣ uses named fields, applying structure to unstructured data ‣ adds semantics to the workflow, specifying business logic ‣ inserts operations into the pipe: Tokenize, GroupBy, Count ‣ shows function and aggregation applied to data tuples in parallel Document Collection Source Tokenize GroupBy M token Count M Sink R Word CountMonday, 17 December 12 27
    • Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Pattern Language: the workflow abstractionMonday, 17 December 12 28
    • enterprise data workflows Tuples, Pipelines, Taps, Operations, Joins, Assertions, Traps, etc. …in other words, “plumbing” as a pattern language for handling Big Data in Enterprise IT Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word CountMonday, 17 December 12 29
    • pattern language defined: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices “plumbing” metaphor of pipes and operators in Cascading helps indicate: algorithms to be used at particular points, appropriate architectural trade-offs, frameworks which must be integrated, etc. design patterns: originated in consensus negotiation for architecture, later used in software engineering wikipedia.org/wiki/Pattern_languageMonday, 17 December 12 30
    • data workflows: team ‣ Business Stakeholder POV: business process management for workflow orchestration (think BPM/BPEL) ‣ Systems Integrator POV: system integration of heterogenous data sources and compute platforms ‣ Data Scientist POV: a directed, acyclic graph (DAG) on which we can apply Amdahls Law, etc. ‣ Data Architect POV: a physical plan for large-scale data flow management ‣ Software Architect POV: a pattern language, similar to plumbing or circuit design Document Collection ‣ App Developer POV: M Tokenize Scrub token API bindings for Java, Scala, Clojure, Jython, JRuby, etc. Stop Word List HashJoin Left RHS Regex token GroupBy token R Count ‣ Systems Engineer POV: Word Count a JAR file, has passed CI, available in a Maven repoMonday, 17 December 12 31
    • data workflows: layers business domain expertise, business trade-offs, process operating parameters, market position, etc. API Java, Scala, Clojure, Jython, JRuby, Groovy, etc. language …envision whatever runs in a JVM optimize / schedule major changes in technology now Document Collection Scrub Tokenize token M physical Stop Word HashJoin Left Regex token GroupBy token R plan List RHS Count “assembler” Word Count code topology Apache Hadoop, in-memory local mode …envision GPUs, streaming, etc. machine data Splunk, New Relic, Typesafe, Nagios, etc.Monday, 17 December 12 32
    • data workflows: example web web Memcached web logsweb logs cluster API logs Cascading app source sink tap tap Customers Recommender source System trap tap tap customer Support Customer profile review Profile DBs DBs Hadoop clusterMonday, 17 December 12 33
    • data workflows: SQL vs. JVM abstraction SQL parser SQL parser optimizer logical plan, optimized based on stats planner physical plan machine query history, data table stats topology b-trees, etc. visualization ERD schema table schema catalog relational catalogMonday, 17 December 12 34
    • data workflows: SQL vs. JVM abstraction SQL JVM parser SQL parser SQL-92 compliant parser (in progress) optimizer logical plan, logical plan, optimized based on stats optimized based on stats planner physical plan API “plumbing” machine query history, app history, data table stats tuple stats topology b-trees, etc. heterogenous, distributed: Hadoop, in-memory, etc. visualization ERD flow diagram schema table schema tuple schema catalog relational catalog tap usage DBMonday, 17 December 12 35
    • Cascading taxonomy Cascading scheduler app app instance source tap Maven flow repo sink step tap slice owner trap kind mapper | reducer tap topology hadoop | localMonday, 17 December 12 36
    • MapReduce architecture ‣ name node / data node ‣ job tracker / task tracker ‣ submit queue ‣ task slots ‣ HDFS ‣ distributed cache Wikipedia ApacheMonday, 17 December 12 37
    • Summary If you were leading a team responsible for Enterprise apps: ‣ which of the previous two slides seems easier to understand? ‣ which is simpler to use for training and managing a team? ‣ which costs the most in the long run?Monday, 17 December 12 38
    • Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Compare & Contrast: other approachesMonday, 17 December 12 39
    • wc: pseudocode Document Collection M Tokenize GroupBy token Count R Word Count void map (String doc_id, String text): for each word w in segment(text): emit(w, "1"); void reduce (String word, Iterator partial_counts): int count = 0; for each pc in partial_counts: count += Int(pc); emit(word, String(count));Monday, 17 December 12 40
    • Scalding / Scala Document Collection M Tokenize GroupBy token Count R Word Count // Sujit Pal // sujitpal.blogspot.com/2012/08/scalding-for-impatient.html package com.mycompany.impatient import com.twitter.scalding._ class Part2(args : Args) extends Job(args) {   val input = Tsv(args("input"), (docId, text))   val output = Tsv(args("output"))   input.read.     flatMap(text -> word) { text : String => text.split("""s+""") }.     groupBy(word) { group => group.size }.     write(output) }Monday, 17 December 12 41
    • Scalding / Scala Document Collection M Tokenize GroupBy token Count github.com/twitter/scalding/wiki R Word Count notes: ‣ code is compact, easy to understand ‣ functional programming is great for expressing complex workflows in MapReduce, etc. ‣ very large-scale, complex problems can be handled in just a few lines of code ‣ many large-scale apps in production deployments ‣ significant investments by Twitter, Etsy, eBay, etc., in this open source project ‣ extensive libraries are available for linear algebra, machine learning – e.g., “Matrix API”Monday, 17 December 12 42
    • Cascalog / Clojure Document Collection M Tokenize GroupBy token Count R Word Count ; Paul Lam ; github.com/Quantisan/Impatient (ns impatient.core   (:use [cascalog.api]         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count)))Monday, 17 December 12 43
    • Cascalog / Clojure Document Collection M Tokenize GroupBy token Count github.com/nathanmarz/cascalog/wiki R Word Count notes: ‣ code is compact, easy to understand ‣ functional programming is great for expressing complex workflows in MapReduce, etc. ‣ significant investments by Twitter, Climate Corp, etc., in this open source project ‣ can run queries from the Clojure REPL ‣ compelling for very large-scale use cases where code correctness can be verified before deploymentMonday, 17 December 12 44
    • Apache Hive Document Collection M Tokenize GroupBy token Count R Word Count -- Steve Severance -- stackoverflow.com/questions/10039949/word-count-program-in-hive CREATE TABLE input (line STRING); LOAD DATA LOCAL INPATH input.tsv OVERWRITE INTO TABLE input; SELECT  word, COUNT(*) FROM input  LATERAL VIEW explode(split(text, )) lTable AS word GROUP BY word ;Monday, 17 December 12 45
    • Apache Hive Document Collection M Tokenize GroupBy token Count hive.apache.org R Word Count pro: ‣ most popular abstraction atop Apache Hadoop ‣ SQL-like language is syntactically familiar to most analysts ‣ simple to load large-scale unstructured data and run ad-hoc queries con: ‣ not a relational engine, many surprises at scale ‣ difficult to represent complex workflows, ML algorithms, etc. ‣ one poorly-trained analyst can bottleneck an entire cluster ‣ app-level integration requires other coding, outside of script language ‣ logical planner mixed with physical planner; cannot collect app stats ‣ non-deterministic exec: number of mappers+reducers changes unexpectedly ‣ business logic must cross multiple language boundaries: difficult to troubleshoot, optimize, audit, handle exceptions, set notifications, etc.Monday, 17 December 12 46
    • Apache Pig Document Collection M Tokenize GroupBy token Count R Word Count -- kudos to Dmitriy Ryaboy docPipe = LOAD $docPath USING PigStorage(t, tagsource) AS (doc_id, text); docPipe = FILTER docPipe BY doc_id != doc_id; -- specify regex to split "document" text lines into token stream tokenPipe = FOREACH docPipe GENERATE doc_id, FLATTEN(TOKENIZE(text, [](),.)) AS token; tokenPipe = FILTER tokenPipe BY token MATCHES w.*; -- determine the word counts tokenGroups = GROUP tokenPipe BY token; wcPipe = FOREACH tokenGroups GENERATE group AS token, COUNT(tokenPipe) AS count; -- output STORE wcPipe INTO $wcPath USING PigStorage(t, tagsource); EXPLAIN -out dot/wc_pig.dot -dot wcPipe;Monday, 17 December 12 47
    • Apache Pig Document Collection M Tokenize GroupBy token Count pig.apache.org R Word Count pro: ‣ easy to learn data manipulation language (DML) ‣ interactive prompt (Grunt) makes it simple to prototype apps ‣ extensibility through UDFs con: ‣ not a full programming language; must extend via UDFs outside of language ‣ app-level integration requires other coding, outside of script language ‣ simple problems are simple to do; hard problems become quite complex ‣ difficult to parameterize scripts externally; must rewrite to change taps! ‣ logical planner mixed with physical planner; cannot collect app stats ‣ non-deterministic exec: number of mappers+reducers changes unexpectedly ‣ business logic must cross multiple language boundaries: difficult to troubleshoot, optimize, audit, handle exceptions, set notifications, etc.Monday, 17 December 12 48
    • Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count Code Example #N: city of palo alto, etc.Monday, 17 December 12 49
    • extend: wc + scrub + stop words Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count 1 mapper Word 1 reducer Count 28+10 lines codeMonday, 17 December 12 50
    • extend: a simple search engine Unique Insert SumBy D doc_id 1 doc_id Document Collection M R M R RHS Scrub Tokenize token HashJoin M RHS token HashJoin Regex Unique CountBy DF Left token token token ExprFunc CoGroup Stop Word tf-idf List RHS M R M R M R TF-IDF M CountBy TF doc_id, token CountBy Sort token count M R M Word R M R Count 10 mappers 8 reducers 68+14 lines codeMonday, 17 December 12 51
    • City of Palo Alto open data Regex Regex tree Scrub filter parser species M HashJoin Left Geohash CoPA GIS exprot Tree Metadata M RHS RHS tree Regex Checkpoint road Regex Regex tsv parser tsv filter Tree Filter GroupBy Checkpoint parser CoGroup Distance tree_dist tree_name shade M R M R M RHS M HashJoin Estimate Road Left Albedo Geohash CoGroup Segments Road Metadata GPS Failure RHS M logs Traps R road Geohash M Regex park filter reco M park github.com/Cascading/CoPA/wiki ‣ GIS export for parks, roads, trees (unstructured / open data) ‣ log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks ‣ curated metadata, used to enrich the dataset ‣ could extend via mash-up with many available public data APIs Enterprise-scale app: road albedo + tree species metadata + geospatial indexing “Find a shady spot on a summer day to walk near downtown and take a call…”Monday, 17 December 12 52
    • CoPA: log eventsMonday, 17 December 12 53
    • CoPA: results 0.12 Estimated Tree Height (meters) 0.10 0.08 count 0 density 100 0.06 200 300 0.04 0.02 0.00 0 10 20 30 40 50 avg_height ‣ addr: 115 HAWTHORNE AVE ‣ lat/lng: 37.446, -122.168 ‣ geohash: 9q9jh0 ‣ tree: 413 site 2 ‣ species: Liquidambar styraciflua ‣ avg height 23 m ‣ road albedo: 0.12 ‣ distance: 10 m ‣ a short walk from my train stop ✔Monday, 17 December 12 54
    • Intro to Cascading Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count PMML: predictive modelingMonday, 17 December 12 55
    • PMML modelMonday, 17 December 12 56
    • cascading.pattern example: 1. use customer order history as the training data set 2. train a risk classifier for orders, using Random Forest 3. export model from R to PMML 4. build a Cascading app to execute the PMML model 4.1. generate a pipeline from PMML description 4.2. planner builds the flow for a topology (Hadoop) 4.3. compile app to a JAR file 5. deploy the app at scale to calculate scoresMonday, 17 December 12 57
    • cascading.pattern risk classifier risk classifier dimension: customer 360 dimension: per-order Cascading apps training analysts customer data prep laptop data sets transactions predict score new model costs orders PMML model detect anomaly fraudsters detection segment velocity customers metrics Hadoop Customer IMDG DB batch real-time workloads workloads ETL chargebacks, partner DW etc. dataMonday, 17 December 12 58
    • 1: “orders” data set... train/test in R... exported as PMMLMonday, 17 December 12 59
    • R modeling ## train a RandomForest model f <- as.formula("as.factor(label) ~ .") fit <- randomForest(f, data_train, ntree=50) ## test the model on the holdout test set print(fit$importance) print(fit) predicted <- predict(fit, data) data$predicted <- predicted confuse <- table(pred = predicted, true = data[,1]) print(confuse) ## export predicted labels to TSV write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="t", row.names=FALSE) ## export RF model to PMML saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))Monday, 17 December 12 60
    • R output MeanDecreaseGini var0 0.6591701 var1 33.8625179 var2 8.0290020 OOB estimate of error rate: 13.83% Confusion matrix: 0 1 class.error 0 28 5 0.1515152 1 8 53 0.1311475 [1] "./data/sample.rf.xml"Monday, 17 December 12 61
    • 2: Cascading app takes PMML as a parameter...Monday, 17 December 12 62
    • PMML model <?xml version="1.0"?> <PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:schemaLocation="http://www.dmg.org/PMML-4_0 http://www.dmg.org/v4-0/pmml-4-0.xsd">  <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">   <Extension name="user" value="ceteri" extender="Rattle/PMML"/>   <Application name="Rattle/PMML" version="1.2.30"/>   <Timestamp>2012-10-22 19:39:28</Timestamp>  </Header>  <DataDictionary numberOfFields="4">   <DataField name="label" optype="categorical" dataType="string">    <Value value="0"/>    <Value value="1"/>   </DataField>   <DataField name="var0" optype="continuous" dataType="double"/>   <DataField name="var1" optype="continuous" dataType="double"/>   <DataField name="var2" optype="continuous" dataType="double"/>  </DataDictionary>  <MiningModel modelName="randomForest_Model" functionName="classification">   <MiningSchema>    <MiningField name="label" usageType="predicted"/>    <MiningField name="var0" usageType="active"/>    <MiningField name="var1" usageType="active"/>    <MiningField name="var2" usageType="active"/>   </MiningSchema>   <Segmentation multipleModelMethod="majorityVote">    <Segment id="1">     <True/>     <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">      <MiningSchema>       <MiningField name="label" usageType="predicted"/>       <MiningField name="var0" usageType="active"/>       <MiningField name="var1" usageType="active"/>       <MiningField name="var2" usageType="active"/>      </MiningSchema> ...Monday, 17 December 12 63
    • Cascading app public class Main { public static void main( String[] args ) {   String pmmlPath = args[ 0 ];   String ordersPath = args[ 1 ];   String classifyPath = args[ 2 ];   String trapPath = args[ 3 ];   Properties properties = new Properties();   AppProps.setApplicationJarClass( properties, Main.class );   HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );   // create source and sink taps   Tap ordersTap = new Hfs( new TextDelimited( true, "t" ), ordersPath );   Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );   Tap trapTap = new Hfs( new TextDelimited( true, "t" ), trapPath );   // define a "Classifier" model from PMML to evaluate the orders   Classifier classifier = new Classifier( pmmlPath );   Pipe classifyPipe = new Each( new Pipe( "classify" ), classifier.getFields(), new ClassifierFunction( new Fields( "score" ), classifier ), Fields.ALL );   // connect the taps, pipes, etc., into a flow   FlowDef flowDef = FlowDef.flowDef().setName( "classify" )    .addSource( classifyPipe, ordersTap )    .addTrap( classifyPipe, trapTap )    .addSink( classifyPipe, classifyTap );   // write a DOT file and run the flow   Flow classifyFlow = flowConnector.connect( flowDef );   classifyFlow.writeDOT( "dot/classify.dot" );   classifyFlow.complete(); } }Monday, 17 December 12 64
    • 3: app deployed on a cluster to score customers at scale...Monday, 17 December 12 65
    • deploy to cloud elastic-mapreduce --create --name "RF" --jar s3n://temp.cascading.org/pattern/pattern.jar --arg s3n://temp.cascading.org/pattern/sample.rf.xml --arg s3n://temp.cascading.org/pattern/sample.tsv --arg s3n://temp.cascading.org/pattern/out/classify --arg s3n://temp.cascading.org/pattern/out/trap aws.amazon.com/elasticmapreduce/Monday, 17 December 12 66
    • results bash-3.2$ head output/classify/part-00000 label" var0" var1" var2" order_id" predicted"score 1" 0" 1" 0" 6f8e1014" 1" 1 0" 0" 0" 1" 6f8ea22e" 0" 0 1" 0" 1" 0" 6f8ea435" 1" 1 0" 0" 0" 1" 6f8ea5e1" 0" 0 1" 0" 1" 0" 6f8ea785" 1" 1 1" 0" 1" 0" 6f8ea91e" 1" 1 0" 1" 0" 0" 6f8eaaba" 0" 0 1" 0" 1" 0" 6f8eac54" 1" 1 0" 1" 1" 0" 6f8eade3" 1" 1Monday, 17 December 12 67
    • drill-down blog, code/wiki/gists, JARs, community, DevOps products: cascading.org github.org/Cascading conjars.org meetup.com/cascading goo.gl/KQtUL concurrentinc.com pnathan@concurrentinc.com @pacoid Copyright @2012, Concurrent, Inc.Monday, 17 December 12 68