• Share
  • Email
  • Embed
  • Like
  • Private Content
Functional programming for optimization problems in Big Data
 

Functional programming for optimization problems in Big Data

on

  • 14,613 views

Invited talk for the INFORMS chapter at Stanford. 2013-03-06.

Invited talk for the INFORMS chapter at Stanford. 2013-03-06.

Statistics

Views

Total Views
14,613
Views on SlideShare
11,671
Embed Views
2,942

Actions

Likes
22
Downloads
100
Comments
0

20 Embeds 2,942

http://liber118.com 2120
http://www.scoop.it 372
http://szamitogepesnyelveszet.blogspot.hu 290
https://twitter.com 85
http://szamitogepesnyelveszet.blogspot.com 18
http://szamitogepesnyelveszet.blogspot.co.uk 18
http://www.directrss.co.il 18
http://szamitogepesnyelveszet.blogspot.de 4
http://meetup239.rssing.com 3
http://www.liber118.com 3
http://szamitogepesnyelveszet.blogspot.nl 2
http://webcache.googleusercontent.com 1
http://szamitogepesnyelveszet.blogspot.sk 1
http://szamitogepesnyelveszet.blogspot.com.es 1
http://feedly.com 1
https://www.rebelmouse.com 1
http://www.linkedin.com 1
http://www.szamitogepesnyelveszet.blogspot.hu 1
http://cloud.feedly.com 1
http://localhost 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Functional programming for optimization problems in Big Data Functional programming for optimization problems in Big Data Presentation Transcript

    • “Functional programming for optimization problems in Big Data” Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid Copyright @2013, Concurrent, Inc.Wednesday, 06 March 13 1
    • The Workflow Abstraction Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Data Science 2. Functional Programming 3. Workflow Abstraction 4. Typical Use Cases 5. Open Data ExampleWednesday, 06 March 13 2Let’s consider a trendline subsequent to the 1997 Q3 inflection point which enabled huge ecommerce successes and commercialized Big Data.Where did Big Data come from, and where is this kind of work headed?
    • Q3 1997: inflection point Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware. This effort prepared the way for huge Internet successes in the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG MapReduce and the Apache Hadoop open source stack emerged from this.Wednesday, 06 March 13 3Q3 1997: Greg Linden, et al., @ Amazon, Randy Shoup, et al., @ eBay -- independent teams arrived at the same conclusion:parallelize workloads onto clusters of commodity servers to scale-out horizontally.Google and Inktomi (YHOO Search) were working along the same lines.
    • Circa 1996: pre- inflection point Stakeholder Customers Excel pivot tables PowerPoint slide decks strategy BI Product Analysts requirements SQL Query optimized Engineering code Web App result sets transactions RDBMSWednesday, 06 March 13 4Perl and C++ for CGI :)Feedback loops shown in red represent data innovations at the time… these are rather static.Characterized by slow, manual processes:data modeling / business intelligence; “throw it over the wall”…this thinking led to impossible silos
    • Circa 2001: post- big ecommerce successes Stakeholder Product Customers dashboards UX Engineering models servlets recommenders Algorithmic + Web Apps Modeling classifiers Middleware aggregation event SQL Query history result sets customer transactions Logs DW ETL RDBMSWednesday, 06 March 13 5Machine data (unstructured logs) captured social interactions. Data from aggregated logs fed into algorithmic modeling to produce recommenders, classifiers, and other predictive models -- e.g., ad networks automating parts of themarketing funnel, as in our case study.LinkedIn, Facebook, Twitter, Apple, etc., followed early successes. Algorithmic modeling, leveraging machine data, allowed for Big Data to become monetized.
    • Circa 2013: clusters everywhere Data Products Customers business Domain process Prod Expert Workflow dashboard metrics data Web Apps, s/w History services science Mobile, etc. dev Data Scientist Planner social discovery interactions + optimized transactions, Eng modeling taps capacity content App Dev Use Cases Across Topologies Hadoop, Log In-Memory etc. Events Data Grid Ops DW Ops batch near time Cluster Scheduler introduced existing capability SDLC RDBMS RDBMSWednesday, 06 March 13 6Here’s what our more savvy customers are using for architecture and process today: traditional SDLC, but also Data Science inter-disciplinary teams.Also, machine data (app history) driving planners and schedulers for advanced multi-tenant cluster computing fabric.Not unlike a practice at LLL, where much more data gets collected about the machine than about the experiment.We see this feeding into cluster optimization in YARN, Apache Mesos, etc.
    • references… by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9LWednesday, 06 March 13 7Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)
    • references… Amazon “Early Amazon: Splitting the website” – Greg Linden glinden.blogspot.com/2006/02/early-amazon-splitting-website.html eBay “The eBay Architecture” – Randy Shoup, Dan Pritchett addsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.html addsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf Inktomi (YHOO Search) “Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff) youtube.com/watch?v=E91oEn1bnXM Google “Underneath the Covers at Google” – Jeff Dean (0:06:54 ff) youtube.com/watch?v=qsan-GQaeyk perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspxWednesday, 06 March 13 8In their own words…
    • core values Data Science teams develop actionable insights, building confidence for decisions that work may influence a few decisions worth billions (e.g., M&A) or billions of small decisions (e.g., AdWords) probably somewhere in-between… solving for pattern, at scale. by definition, this is a multi-disciplinary pursuit which requires teams, not sole playersWednesday, 06 March 13 9
    • team process = needs help people ask the discovery right questions allow automation to place modeling informed bets deliver products at Gephi integration scale to customers build smarts into apps product features keep infrastructure systems running, cost-effectiveWednesday, 06 March 13 10
    • team composition = roles Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Domain Expert business process, stakeholder Count Word Count data science Data Scientist data prep, discovery, modeling, etc. App Dev software engineering, automation Ops systems engineering, access introduced capabilityWednesday, 06 March 13 11This is an example of multi-disciplinary team composition for data scienceWhile other emerging problems spaces will require other more specific kinds of team roles
    • matrix: evaluate needs × roles nn o overy very elliing e ng ratiio rat o apps apps stem stem ss diisc d sc mod mod nteg iinte g sy sy stakeholder scientist developer opsWednesday, 06 March 13 12
    • most valuable skills approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues: ETL, log file analysis, etc. unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up most valuable skills: ‣ learn to use programmable tools that prepare data ‣ learn to generate compelling data visualizations ‣ learn to estimate the confidence for reported results ‣ learn to automate work, making analysis repeatable D3 the rest of the skills – modeling, algorithms, etc. – those are secondaryWednesday, 06 March 13 13
    • science in data science? edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC in a nutshell, what we do… tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT ‣ estimate probability taeS egnahC wodniW D3 nepO dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM ‣ calculate analytic variance teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU wodniW D3 nepO ‣ manipulate order complexity ‣ leverage use of learning theory + collab with DevOps, Stakeholders + reduce work to cron entriesWednesday, 06 March 13 14
    • references… by DJ Patil Data Jujitsu O’Reilly, 2012 amazon.com/dp/B008HMN5BE Building Data Science Teams O’Reilly, 2011 amazon.com/dp/B005O4U3ZEWednesday, 06 March 13 15Leo Breiman wrote an excellent paper in 2001, “Two Cultures”, chronicling this evolution and the sea change from data modeling (silos, manual process) to algorithmic modeling (machine data for automation/optimization)
    • The Workflow Abstraction Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Data Science 2. Functional Programming 3. Workflow Abstraction 4. Typical Use Cases 5. Open Data ExampleWednesday, 06 March 13 16Origin and overview of Cascading API as a workflow abstraction for Enterprise Big Data apps.
    • Cascading – origins API author Chris Wensel worked as a system architect at an Enterprise firm well-known for several popular data products. Wensel was following the Nutch open source project – before Hadoop even had a name. He noted that it would become difficult to find Java developers to write complex Enterprise apps directly in Apache Hadoop – a potential blocker for leveraging this new open source technology.Wednesday, 06 March 13 17Cascading initially grew from interaction with the Nutch project, before Hadoop had a nameAPI author Chris Wensel recognized that MapReduce would be too complex for J2EE developers to perform substantial work in an Enterprise context, with any abstraction layer.
    • Cascading – functional programming Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature. To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows.Wednesday, 06 March 13 18Years later, Enterprise app deployments on Hadoop are limited by staffing issues: difficulty of retraining staff, scarcity of Hadoop experts.
    • examples… • Twitter, Etsy, eBay, YieldBot, uSwitch, etc., have invested in functional programming open source projects atop Cascading – used for their large-scale production deployments • new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming: Cascalog in Clojure (2010) Scalding in Scala (2012) github.com/nathanmarz/cascalog/wiki github.com/twitter/scalding/wikiWednesday, 06 March 13 19Many case studies, many Enterprise production deployments now for 5+ years.
    • The Ubiquitous Word Count Document Collection Definition: M Tokenize GroupBy token Count count how often each word appears count how often each word appears R Word Count inin a collection of text documents a collection of text documents This simple program provides an excellent test case for parallel processing, since it illustrates: void map (String doc_id, String text): for each word w in segment(text): • requires a minimal amount of code emit(w, "1"); • demonstrates use of both symbolic and numeric values • shows a dependency graph of tuples as an abstraction void reduce (String word, Iterator group): • is not many steps away from useful search indexing int count = 0; • serves as a “Hello World” for Hadoop apps for each pc in group: count += Int(pc); Any distributed computing framework which can run Word emit(word, String(count)); Count efficiently in parallel at scale can handle much larger and more interesting compute problems.Wednesday, 06 March 13 20Taking a wild guess, most people who’ve written any MapReduce code have seen this example app already...
    • word count – conceptual flow diagram Document Collection Tokenize GroupBy M token Count R Word Count 1 map cascading.org/category/impatient 1 reduce 18 lines code gist.github.com/3900702Wednesday, 06 March 13 21Based on a Cascading implementation of Word Count, this is a conceptual flow diagram: the pattern language in use to specify the business process, using a literate programming methodology to describe a data workflow.
    • word count – Cascading app in Java Document Collection String docPath = args[ 0 ]; Tokenize GroupBy token String wcPath = args[ 1 ]; M Count Properties properties = new Properties(); R Word Count AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap )  .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete();Wednesday, 06 March 13 22Based on a Cascading implementation of Word Count, here is sample code --approx 1/3 the code size of the Word Count example from Apache Hadoop2nd to last line: generates a DOT file for the flow diagram
    • word count – generated flow diagram Document Collection Tokenize [head] M GroupBy token Count R Word Count Hfs[TextDelimited[[doc_id, text]->[ALL]]][data/rain.txt]] [{2}:doc_id, text] [{2}:doc_id, text] map Each(token)[RegexSplitGenerator[decl:token][args:1]] [{1}:token] [{1}:token] GroupBy(wc)[by:[token]] wc[{1}:token] [{1}:token] reduce Every(wc)[Count[decl:count]] [{2}:token, count] [{1}:token] Hfs[TextDelimited[[UNKNOWN]->[token, count]]][output/wc]] [{2}:token, count] [{2}:token, count] [tail]Wednesday, 06 March 13 23As a concrete example of literate programming in Cascading,here is the DOT representation of the flow plan -- generated by the app itself.
    • word count – Cascalog / Clojure Document Collection (ns impatient.core M Tokenize GroupBy token Count   (:use [cascalog.api] R Word Count         [cascalog.more-taps :only (hfs-delimited)])   (:require [clojure.string :as s]             [cascalog.ops :as c])   (:gen-class)) (defmapcatop split [line]   "reads in a line of string and splits it by regex"   (s/split line #"[[](),.)s]+")) (defn -main [in out & args]   (?<- (hfs-delimited out)        [?word ?count]        ((hfs-delimited in :skip-header? true) _ ?line)        (split ?line :> ?word)        (c/count ?count))) ; Paul Lam ; github.com/Quantisan/ImpatientWednesday, 06 March 13 24Here is the same Word Count app written in Clojure, using Cascalog.
    • word count – Cascalog / Clojure Document Collection github.com/nathanmarz/cascalog/wiki Tokenize GroupBy M token Count R Word Count • implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language • run ad-hoc queries from the Clojure REPL – approx. 10:1 code reduction compared with SQL • composable subqueries, used for test-driven development (TDD) practices at scale • Leiningen build: simple, no surprises, in Clojure itself • more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog • has a learning curve, limited number of Clojure developers • aggregators are the magic, and those take effort to learnWednesday, 06 March 13 25From what we see about language features, customer case studies, and best practices in general --Cascalog represents some of the most sophisticated uses of Cascading, as well as some of the largest deployments.Great for large-scale, complex apps, where small teams must limit the complexities in their process.
    • word count – Scalding / Scala Document Collection import com.twitter.scalding._ M Tokenize GroupBy token Count   R Word Count class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), (doc_id, text), skipHeader = true) .read .flatMap(text -> token) { text : String => text.split("[ [](),.]") } .groupBy(token) { _.size(count) } .write(Tsv(args("wc"), writeHeader = true)) }Wednesday, 06 March 13 26Here is the same Word Count app written in Scala, using Scalding.Very compact, easy to understand; however, also more imperative than Cascalog.
    • word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls • extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc. • significant investments by Twitter, Etsy, eBay, etc. • great for data services at scale • less learning curve than Cascalog, not as much of a high-level languageWednesday, 06 March 13 27If you wanted to see what a data services architecture for machine learning work at, say, Google scale would look like as an open source project -- that’s Scalding. That’s what they’re doing.
    • word count – Scalding / Scala Document Collection github.com/twitter/scalding/wiki Tokenize GroupBy M token Count R Word Count • extends the Scala collections API so that distributed lists become “pipes” backed by Cascading • code is compact, easy to understand • nearly 1:1 between elements of conceptual flow diagram and function calls Cascalog and Scalding DSLs • extensive libraries are available for linear algebra, abstractaspects leverage the functional algebra, machine learning – e.g., Matrix API, Algebird, etc. of MapReduce, helping to limit • significant investments by Twitter, Etsy, eBay, etc. complexity in process • great for data services at scale (imagine SOA infra @ Google as an open source project) • less learning curve than Cascalog, not as much of a high-level languageWednesday, 06 March 13 28Arguably, using a functional programming language to build flows is better than trying to represent functional programming constructs within Java…
    • The Workflow Abstraction Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Data Science 2. Functional Programming 3. Workflow Abstraction 4. Typical Use Cases 5. Open Data ExampleWednesday, 06 March 13 29CS theory related to data workflow abstraction, to manage complexity
    • Cascading workflows – pattern language Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc. Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Data is represented as flows of tuples. Operations within Word the tuple flows bring functional programming aspects into Count Java apps. In formal terms, this provides a pattern language.Wednesday, 06 March 13 30A pattern language, based on the metaphor of “plumbing”
    • references… pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices. amazon.com/dp/0195019199 design patterns: the notion originated in consensus negotiation for architecture, later applied in OOP software engineering by “Gang of Four”. amazon.com/dp/0201633612Wednesday, 06 March 13 31Chris Alexander originated the use of pattern language in a project called “The Oregon Experiment”, in the 1970s.
    • Cascading workflows – literate programming Cascading workflows generate their own visual documentation: flow diagrams Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count In formal terms, flow diagrams leverage a methodology Word Count called literate programming Provides intuitive, visual representations for apps, great for cross-team collaboration.Wednesday, 06 March 13 32Formally speaking, the pattern language in Cascading gets leveraged as a visual representation used for literate programming.Several good examples exist, but the phenomenon of different developers troubleshooting a program together over the “cascading-users” email list is most telling -- expert developers generally ask a novice to provide a flow diagram first
    • references… by Don Knuth Literate Programming Univ of Chicago Press, 1992 literateprogramming.com/ “Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”Wednesday, 06 March 13 33Don Knuth originated the notion of literate programming, or code as “literature” which explains itself.
    • examples… • Scalding apps have nearly 1:1 correspondence between function calls and the elements in their flow diagrams – excellent elision and literate representation • noticed on cascading-users email list: when troubleshooting issues, Cascading experts ask novices to provide an app’s flow diagram (generated [head] as a DOT file), sometimes in lieu of showing code Hfs[TextDelimited[[doc_id, text]->[ALL]]][data/rain.txt]] [{2}:doc_id, text] [{2}:doc_id, text] map Each(token)[RegexSplitGenerator[decl:token][args:1]] In formal terms, a flow diagram is a directed, acyclic [{1}:token] [{1}:token] graph (DAG) on which lots of interesting math applies GroupBy(wc)[by:[token]] for query optimization, predictive models about app wc[{1}:token] [{1}:token] reduce execution, parallel efficiency metrics, etc. Every(wc)[Count[decl:count]] [{2}:token, count] [{1}:token] Hfs[TextDelimited[[UNKNOWN]->[token, count]]][output/wc]] [{2}:token, count] [{2}:token, count] [tail]Wednesday, 06 March 13 34Literate programming examples observed on the email list are some of the best illustrations of this methodology.
    • Cascading workflows – business process Following the essence of literate programming, Cascading workflows provide statements of business process This recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data) As a separation of concerns between business process and implementation details (Hadoop, etc.) This is especially apparent in large-scale Cascalog apps: “Specify what you require, not how to achieve it.” By virtue of the pattern language, the flow planner in used in a Cascading app determines how to translate business process into efficient, parallel jobs at scale.Wednesday, 06 March 13 35Business Stakeholder POV:business process management for workflow orchestration (think BPM/BPEL)
    • references… by Edgar Codd “A relational model of data for large shared data banks” Communications of the ACM, 1970 dl.acm.org/citation.cfm?id=362685 Rather than arguing between SQL vs. NoSQL… structured vs. unstructured data frameworks… this approach focuses on: the process of structuring data That’s what apps do – Making Data WorkWednesday, 06 March 13 36Focus on *the process of structuring data*which must happen before the large-scale joins, predictive models, visualizations, etc.Just because your data is loaded into a “structured” store, that does not imply that your app has finished structuring it for the purpose of making data work.BTW, anybody notice that the O’Reilly “animal” for the Cascading book is an Atlantic Cod? (pun intended)
    • Cascading workflows – functional relational programming The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL. Cascalog, in particular, implements more of what Codd intended for a “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in: Moseley & Marks, 2006 “Out of the Tar Pit” goo.gl/SKspnWednesday, 06 March 13 37A more contemporary statement along similar lines...
    • Two Avenues… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞Wednesday, 06 March 13 38Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
    • Cascading workflows – functional relational programming The combination of functional programming, pattern language, DSLs, literate programming, business process, etc., traces back to the original definition of the relational model (Codd, 1970) prior to SQL. Cascalog, in particular, implements more of what Codd intended for a several theoretical aspects converge “data sublanguage” and is considered to be close to a full implementation of the functional relational programming paradigm defined in: into software engineering practices Moseley & Marks, 2006which mitigates the complexity of “Out of the Tar Pit” building and maintaining Enterprise goo.gl/SKspn data workflowsWednesday, 06 March 13 39
    • The Workflow Abstraction Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Data Science 2. Functional Programming 3. Workflow Abstraction 4. Typical Use Cases 5. Open Data ExampleWednesday, 06 March 13 40Here are a few use cases to consider, for Enterprise data workflows
    • Cascading – deployments • 5+ history of Enterprise production deployments, ASL 2 license, GitHub src, http://conjars.org • partners: Amazon AWS, Microsoft Azure, Hortonworks, MapR, EMC, SpringSource, Cloudera • case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, etc. • use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, genomics, climatology, etc.Wednesday, 06 March 13 41Several published case studies about Cascading, Cascalog, Scalding, etc.Wide range of use cases.Significant investment by Twitter, Etsy, and other firms for OSS based on Cascading.Partnerships with the various Hadoop distro vendors, cloud providers, etc.
    • Finance: Ecommerce Risk Problem: stat.berkeley.edu <1% chargeback rate allowed by Visa, others follow • may leverage CAPTURE/AUTH wait period • Cybersource,Vindicia, others haven’t stopped fraud >15% chargeback rate common for mobile in US: • not much info shared with merchant • carrier as judge/jury/executioner; customer assumed correct most common: professional fraud (identity theft, etc.) • patterns of attack change all the time • widespread use of IP proxies, to mask location • global market for stolen credit card info other common case is friendly fraud • teenager billing to parent’s cell phoneWednesday, 06 March 13 42
    • Finance: Ecommerce Risk KPI: stat.berkeley.edu chargeback rate (CB) • ground truth for how much fraud the bank/carrier claims • 7-120 day latencies from the bank false positive rate (FP) • estimated cost: predicts customer support issues • complaints due to incorrect fraud scores on valid orders (or lies) false negative rate (FN) • estimated risk: how much fraud may pass undetected in future orders • changes with new product features/services/inventory/marketingWednesday, 06 March 13 43
    • Finance: Ecommerce Risk Data Science Issues: stat.berkeley.edu • chargeback limits imply few training cases • sparse data implies lots of missing values – must impute • long latency on chargebacks – “good” flips to “bad” • most detection occurs within large-scale batch, decisions required during real-time event processing • not just one pattern to detect – many, ever-changing • many unknowns: blocked orders scare off professional fraud, inferences cannot be confirmed • cannot simply use raw data as input – requires lots of data preparation and statistical modeling • each ecommerce firm has shopping/policy nuances which get exploited differently – hard to generalize solutionsWednesday, 06 March 13 44
    • Finance: Ecommerce Risk Predictive Analytics: stat.berkeley.edu batch • cluster/segment customers for expected behaviors • adjust for seasonal variation • geospatial indexing / bayesian point estimates (fraud by lat/lng) • impute missing values (“guesses” to fill-in sparse data) • run anti-fraud classifier (customer 360) real-time • exponential smoothing (estimators for velocity) • calculate running medians (anomaly detection) • run anti-fraud classifier (per order)Wednesday, 06 March 13 45
    • Finance: Ecommerce Risk 1. Data Preparation (batch) stat.berkeley.edu ‣ ETL from bank, log sessionization, customer profiles, etc. - large-scale joins of customers + orders ‣ apply time window - too long: patterns lose currency - too short: not enough wait for chargebacks ‣ segment customers - temporary fraud (identity theft which has been resolved) - confirmed fraud (chargebacks from the bank) - estimated fraud (blocked/banned by Customer Support) - valid orders (but different clusters of expected behavior) ‣ subsample to rebalance data - produce training set + test holdout - adjust balance for FP/FN bias (company risk profile)Wednesday, 06 March 13 46
    • Finance: Ecommerce Risk 2. Model Creation (analyst) stat.berkeley.edu ‣ distinguish between different IV data types - continuous (e.g., age) - boolean (e.g., paid lead) - categorical (e.g., gender) - computed (e.g., geo risk, velocities) ‣ use geospatial smoothing for lat/lng ‣ determine distributions for IV ‣ adjust IV for seasonal variation, where appropriate ‣ impute missing values based on density functions / medians ‣ factor analysis: determine which IV to keep (too many creates problems) ‣ train model: random forest (RF) classifiers predict likely fraud ‣ calculate the confusion matrix (TP/FP/TN/FN)Wednesday, 06 March 13 47
    • Finance: Ecommerce Risk 3. Test Model (analyst/batch loop) stat.berkeley.edu ‣ calculate estimated fraud rates ‣ identify potential found fraud cases ‣ report to Customer Support for review ‣ generate risk vs. benefit curves ‣ visualize estimated impact of new model 4. Decision (stakeholder) ‣ decide risk vs. benefit (minimize fraud + customer support costs) ‣ coordinate with bank/carrier if there are current issues ‣ determine go/no-go, when to deploy in production, size of rolloutWednesday, 06 March 13 48
    • Finance: Ecommerce Risk 5. Production Deployment (near-time) stat.berkeley.edu ‣ run model on in-memory grid / transaction processing ‣ A/B test to verify model in production (progressive rollout) ‣ detect anomalies - use running medians on continuous IVs - use exponential smoothing on computed IVs (velocities) - trigger notifications ‣ monitor KPI and other metrics in dashboardsWednesday, 06 March 13 49
    • Finance: Ecommerce Risk risk classifier risk classifier dimension: customer 360 dimension: per-order Cascading apps training analysts customer data prep laptop data sets transactions predict score new model costs orders PMML model detect anomaly fraudsters detection segment velocity customers metrics Hadoop Customer IMDG DB batch real-time workloads workloads ETL chargebacks, partner DW etc. dataWednesday, 06 March 13 50
    • Ecommerce: Marketing Funnel Problem: • must optimize large ad spend budget Wikipedia • different vendors report different kinds of metrics • some campaigns are much smaller than others • seasonal variation distorts performance • inherent latency in spend vs. effect • ads channels cannot scale up immediately • must “scrub” leads to dispute payments/refunds • hard to predict ROI for incremental ad spend • many issues of diminishing returns in generalWednesday, 06 March 13 51
    • Ecommerce: Marketing Funnel KPI: cost per paying user (CPP) Wikipedia • must align metrics for different ad channels • generally need to estimate to end-of-month customer lifetime value (LTV) • big differences based on geographic region, age, gender, etc. • assumes that new customers behave like previous customers return on investment (ROI) • relationship between CPP and LTV • adjust to invest in marketing (>CPP) vs. extract profit (>LTV) other metrics • reach: how many people get a brand message • customer satisfaction: would recommend to a friend, etc.Wednesday, 06 March 13 52
    • Ecommerce: Marketing Funnel Predictive Analytics: batch Wikipedia • log aggregation, followed with cohort analysis • bayesian point estimates compare different-sized ad tests • time series analysis normalizes for seasonal variation • geolocation adjusts for regional cost/benefit • customer lifetime value estimates ROI of new leads • linear programming models estimate elasticity of demand real-time • determine whether this is actually a new customer… • new: modify initial UX based on ad channel, region, friends, etc. • old: recommend products/services/friends based on behaviors • adjust spend on poorly performing channels • track back to top referring sites/partnersWednesday, 06 March 13 53
    • Airlines Problem: • minimize schedule delays • re-route around weather and airport conditions • manage supplier channels and inventories to minimize AOG KPI: forecast future passenger demand customer loyalty aircraft on ground (AOG) mean time between failures (MTBF)Wednesday, 06 March 13 54
    • Airlines Predictive Analytics: batch • predict “last mile” failures • optimize capacity utilization • operations research problem to optimize stocking / minimize fuel waste • boost customer loyalty by adjusting incentives frequent flyer programs real-time • forecast schedule delays • monitor factors for travel conditions: weather, airports, etc.Wednesday, 06 March 13 55
    • The Workflow Abstraction Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Data Science 2. Functional Programming 3. Workflow Abstraction 4. Typical Use Cases 5. Open Data ExampleWednesday, 06 March 13 56Cascalog app for mobile data API (recommender service) based on City of Palo Alto Open Data
    • Palo Alto is quite a pleasant place • temperate weather • lots of parks, enormous trees • great coffeehouses • walkable downtown • not particularly crowded On a nice summer day, who wants to be stuck indoors on a phone call? Instead, take it outside – go for a walk And example open source project: github.com/Cascading/CoPA/wikiWednesday, 06 March 13 57Palo Alto is generally quite a pleasant place: the weather is temperate, there are lots of parks with enormous trees,most of downtown is quite walkable, and its not particularly crowded.On a summer day in Palo Alto, one of the last things anybody really wants is to be stuck in an office on a long phone call.Instead people walk outside and take their calls, probably heading toward a favorite espresso bar or a frozen yogurt shop.On a hot summer day in Palo Alto, knowing a nice quiet route to walk in the shade would be great.
    • 1. Open Data about municipal infrastructure (GIS data: trees, roads, parks) ✚ 2. Big Data about where people like to walk (smartphone GPS logs) ✚ Document Collection 3. some curated metadata M Tokenize Scrub token HashJoin Regex (which surfaces the value) Left token GroupBy R Stop Word token List RHS Count Word Count 4. personalized recommendations: “Find a shady spot on a summer day in which to walk near downtown Palo Alto.While on a long conference call. Sipping a latte or enjoying some fro-yo.”Wednesday, 06 March 13 58We mergeunstructured geo data about municipal infrastructure(GIS data: trees, roads, parks)+unstructured data about where people like to walk(smartphone GPS logs)+a little metadata (curated)=>personalized recommendations:"Find a shady spot on a summer day in which to walk near downtown Palo Alto. While on a long conference call. Sippin’ a latte or enjoying some fro-yo."
    • discovery The City of Palo Alto recently began to support Open Data to give the local community greater visibility into how their city government operates This effort is intended to encourage students, entrepreneurs, local organizations, etc., to build new apps which contribute to the public good paloalto.opendata.junar.com/dashboards/7576/geographic-information/Wednesday, 06 March 13 59The City of Palo Alto has recently begun to support Open Data to give the local community greater visibility into how their city government functions.This effort is intended to encourage students, entrepreneurs, local organizations, etc., to build new apps which contribute to the public good.http://paloalto.opendata.junar.com/dashboards/7576/geographic-information/
    • discovery GIS about trees in Palo Alto:Wednesday, 06 March 13 60(trees map overlay)
    • discovery Geographic_Information,,, "Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29 Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australis Source: davey tree Protected: Designated: Heritage: Appraised Value: Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872 Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point" "Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: Wilkie Way From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID: 598 (Wilkie Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 Year Constructed: 1950 Traffic Count: 596 Traffic Index: residential local Traffic Class: local residential Traffic Date: 08/24/90 Paving Length: 208 Paving Width: 40 Paving Area: 8320 Surface Type: asphalt concrete Surface Thickness: 2.0 Base Type Pvmt: crusher run base Base Thickness: 6.0 Soil Class: 2 Soil Value: 15 Curb Type: Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1 District Number: 18 Land Use PMMS: 1 Overlay Year: 1990 Overlay Thickness: 1.5 Base Failure Year: 1990 Base Failure Thickness: 6 Surface Treatment Year: Surface Treatment Type: Alligator Severity: none Alligator Extent: 0 Block Severity: none Block Extent: 0 Longitude and Transverse Severity: none Longitude and Transverse Extent: 0 Trench Severity: Ravelling Severity: none none Trench Extent: 0 Ravelling Extent: Rutting Severity: 0 none (unstructured data…) Ridability Severity: Rutting Extent: none 0 Road Performance: UL (Urban Local) Bike Lane: 0 Bus Route: 0 Truck Route: 0 Remediation: Deduct Value: 100 Priority: Pavement Condition: excellent Street Cut Fee per SqFt: 10.00 Source Date: 6/10/2009 User Modified By: mnicols Identifier System: 21410 ","-122.1249640794,37.4155803115645,0.0 -122.124661859039,37.4154224594993,0.0 -122.124587720719,37.4153758330704,0.0 -122.12451895942,37.4153242300888,0.0 -122.124456098457,37.4152680432944,0.0 -122.124399616238,37.4152077003122,0.0 -122.124374937753,37.4151774433318,0.0 ","Line"Wednesday, 06 March 13 61here’s what we have to work with -- raw GIS export as CSV, with plenty o’ errors too, for good measurethis illustrates a great example of “unstructured data”Alligator Severity!
    • discovery (defn parse-gis [line] "leverages parse-csv for complex CSV format in GIS export" (first (csv/parse-csv line)) )     (defn etl-gis [gis trap] "subquery to parse data sets from the GIS source tap" (<- [?blurb ?misc ?geo ?kind] (gis ?line) (parse-gis ?line :> ?blurb ?misc ?geo ?kind) (:trap (hfs-textline trap)) )) (specify what you require, not how to achieve it… 80/20 rule of data prep cost)Wednesday, 06 March 13 62Lets use Cascalog to begin our process of structuring that datasince the GIS export is vaguely in CSV format, heres a simple way to clean up the datareferring back to DJ Patil’s “Data Jujitsu”, that clean up usually accounts for 80% of project costs
    • discovery (ad-hoc queries get refined into composable predicates) Identifier: 474 Tree ID: 412 Tree: 412 site 1 at 115 HAWTHORNE AV Tree Site: 1 Street_Name: HAWTHORNE AV Situs Number: 115 Private: -1 Species: Liquidambar styraciflua Source: davey tree Hardscape: None 37.446001565119,-122.167713417554,0.0 PointWednesday, 06 March 13 63First we load `lein repl` to get an interactive prompt for Clojure…bring Cascalog libraries into Clojure…define functions to use…and execute queriesthen we convert the queries into composable, logical predicatesLet’s take a peek at the results...TSV output becomes more structured,while the “bad” data has been trapped into a data set for review[bold/colors added for clarity]
    • discovery (curate valuable metadata)Wednesday, 06 March 13 64since we can find species and geolocation for each tree,let’s add some metadata to infer other valuable data results, e.g., tree heightbased on Wikipedia.org, Calflora.org, USDA.gov, etc.
    • (defn get-trees [src trap tree_meta] discovery "subquery to parse/filter the tree data" (<- [?blurb ?tree_id ?situs ?tree_site ?species ?wikipedia ?calflora ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash ] (src ?blurb ?misc ?geo ?kind) (re-matches #"^s+Private.*Tree ID.*" ?misc) (parse-tree ?misc :> _ ?priv ?tree_id ?situs ?tree_site ?raw_species) ((c/comp s/trim s/lower-case) ?raw_species :> ?species) (tree_meta ?species ?wikipedia ?calflora ?min_height ?max_height) (avg ?min_height ?max_height :> ?avg_height) (geo-tree ?geo :> _ ?tree_lat ?tree_lng ?tree_alt) (read-string ?tree_lat :> ?lat) (read-string ?tree_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (:trap (hfs-textline trap)) ?blurb!! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 2 )) ?tree_id! " 412 ?situs"" 115 ?tree_site" 1 ?species" " liquidambar styraciflua ?wikipedia" http://en.wikipedia.org/wiki/Liquidambar_styraciflua ?calflora http://calflora.org/cgi-bin/species_query.cgi?where-ca ?avg_height" 27.5 ?tree_lat" 37.446001565119 ?tree_lng" -122.167713417554 ?tree_alt" 0.0 ?geohash" " 9q9jh0Wednesday, 06 March 13 65Next, refine the data about trees: join with metadata, calculate estimators, etc.Now we have a data product about trees in Palo Alto, which has been enriched by our processBTW, those geolocation fields are especially important...
    • discovery // run analysis and visualization in R library(ggplot2) dat_folder <- ~/src/concur/CoPA/out/tree data <- read.table(file=paste(dat_folder, "part-00000", sep="/"), sep="t", quote="", na.strings="NULL", header=FALSE, encoding="UTF8")   summary(data) t <- head(sort(table(data$V5), decreasing=TRUE) trees <- as.data.frame.table(t, n=20)) colnames(trees) <- c("species", "count")   m <- ggplot(data, aes(x=V8)) m <- m + ggtitle("Estimated Tree Height (meters)") m + geom_histogram(aes(y = ..density.., fill = ..count..)) + geom_density()   par(mar = c(7, 4, 4, 2) + 0.1) plot(trees, xaxt="n", xlab="") axis(1, labels=FALSE) text(1:nrow(trees), par("usr")[3] - 0.25, srt=45, adj=1, labels=trees$species, xpd=TRUE) grid(nx=nrow(trees))Wednesday, 06 March 13 66Another aspect of the “Discovery” phase is to poke at the data: run summary stats, visualize the data, etc.We’ll use RStudio for that...
    • discovery Analysis of the tree data: sweetgumWednesday, 06 March 13 67some analysis and visualizations from RStudio: * frequency of species * density plot of tree heights in Palo Alto
    • discovery GIS Regex tree Scrub export parse-tree species M Estimate Join Geohash height Regex src parse-gis M Tree tree Metadata Failure Traps (flow diagram, gis tree)Wednesday, 06 March 13 68Here’s a conceptual flow diagram, which shows a directed, acyclic graph (DAG) of data taps, tuple streams, operations, joins, assertions, aggregations, etc.
    • discovery In addition, the road data provides: • traffic class (arterial, truck route, residential, etc.) • traffic counts distribution • surface type (asphalt, cement; age) This leads to estimators for noise, sunlight reflection, etc.Wednesday, 06 March 13 69more analysis and visualizations from RStudio: * frequency of traffic classes * density plot of traffic counts
    • modeling geohash with 6-digit resolution approximates a 5-block square centered lat: 37.445, lng: -122.162 9q9jh0Wednesday, 06 March 13 70Shifting into the modeling phase, we use “geohash” codes for “cheap and dirty” geospatial indexing suited for parallel processing (Hadoop)much more effective methods exist; however, this is simple to show6-digit resolution on a geohash generates approximately a 5-block square
    • modeling Each road in the GIS export is listed as a block between two cross roads, and each may have multiple road segments to represent turns: " -122.161776959558,37.4518836690781,0.0 " -122.161390381489,37.4516410983794,0.0 " -122.160786011735,37.4512589903357,0.0 " -122.160531178368,37.4510977281699,0.0 ( lat1, lng1, alt1 ) ( lat3, lng3, alt3 ) ( lat0, lng0, alt0 ) ( lat2, lng2, alt2 ) NB: segments in the raw GIS have the order of geo coordinates scrambled: (lng, lat, alt)Wednesday, 06 March 13 71Each road is listed in the GIS export as a block between two cross roads, and each may have multiple road segments to represent turns
    • modeling Our app analyzes each road segment as a data tuple, calculating a center point, then uses a geohash to define a boundary: ( lat, lng, alt ) 9q9jh0Wednesday, 06 March 13 72Then uses a geohash to specify a boundary
    • modeling Query to join a road segment tuple with all the trees within its geohash boundary: 9q9jh0Wednesday, 06 March 13 73Query to join the road segment tuple with trees within its geohash boundary
    • modeling Use distance-to-midpoint to filter trees which are too far away to provide shade. Calculate a sum of moments for tree height × distance from road segment, as an estimator for shade: X X ∑( h·d ) X Also calculate estimators for traffic frequency and noiseWednesday, 06 March 13 74Use distance to midpoint to filter out trees which are too far away to provide shadeCalculate a sum of moments for tree height × distance from center;approximate, but pretty goodalso calculate estimators for traffic frequency and noise
    • modeling (defn get-shade [trees roads] "subquery to join tree and road estimates, maximize for shade" (<- [?road_name ?geohash ?road_lat ?road_lng ?road_alt ?road_metric ?tree_metric] (roads ?road_name _ _ _ ?albedo ?road_lat ?road_lng ?road_alt ?geohash ?traffic_count _ ?traffic_class _ _ _ _) (road-metric ?traffic_class ?traffic_count ?albedo :> ?road_metric) (trees _ _ _ _ _ _ _ ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash) (read-string ?avg_height :> ?height) ;; limit to trees which are higher than people (> ?height 2.0) (tree-distance ?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance) ;; limit to trees within a one-block radius (not meters) (<= ?distance 25.0) (/ ?height ?distance :> ?tree_moment) (c/sum ?tree_moment :> ?sum_tree_moment) ;; magic number 200000.0 used to scale tree moment ;; based on median (/ ?sum_tree_moment 200000.0 :> ?tree_metric) ))Wednesday, 06 March 13 75We also filter these estimators, based on a few magic numbers obtained during analysis in R
    • modeling Filter tree height M Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road shade traffic (flow diagram, shade)Wednesday, 06 March 13 76A conceptual flow diagram, showing the DAG for the join of road + tree => estimators for shade
    • modeling (defn get-gps [gps_logs trap] "subquery to aggregate and rank GPS tracks per user" (<- [?uuid ?geohash ?gps_count ?recent_visit] (gps_logs ?date ?uuid ?gps_lat ?gps_lng ?alt ?speed ?heading ?elapsed ?distance) (read-string ?gps_lat :> ?lat) (read-string ?gps_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (c/count :> ?gps_count) (date-num ?date :> ?visit) (c/max ?visit :> ?recent_visit) )) ?uuid ?geohash ?gps_count ?recent_visit cf660e041e994929b37cc5645209c8ae 9q8yym 7 1972376866448 342ac6fd3f5f44c6b97724d618d587cf 9q9htz 4 1972376690969 32cc09e69bc042f1ad22fc16ee275e21 9q9hv3 3 1972376670935 342ac6fd3f5f44c6b97724d618d587cf 9q9hv3 3 1972376691356 342ac6fd3f5f44c6b97724d618d587cf 9q9hwn 13 1972376690782 342ac6fd3f5f44c6b97724d618d587cf 9q9hwp 58 1972376690965 482dc171ef0342b79134d77de0f31c4f 9q9jh0 15 1972376952532 b1b4d653f5d9468a8dd18a77edcc5143 9q9jh0 18 1972376945348Wednesday, 06 March 13 77Here’s a Cascalog function to aggregate GPS tracks per userin other words, behavioral targetingthis shows aggregation in Cascalog -- the subtle but hard partnow we have a data product about walkable road segments in Palo Alto
    • modeling (defn get-reco [tracks shades] "subquery to recommend road segments based on GPS tracks" (<- [?uuid ?road ?geohash ?lat ?lng ?alt ?gps_count ?recent_visit ?road_metric ?tree_metric] (tracks ?uuid ?geohash ?gps_count ?recent_visit) (shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric) )) Recommenders often combine multiple signals, via weighted averages, to rank personalized results: • GPS of person ∩ road segment • frequency and recency of visit • traffic class and rate • road albedo (sunlight reflection) • tree shade estimator Adjusting the mix allows for further personalization at the end useWednesday, 06 March 13 78One approach to building commercial recommender systems is to take a vector of different preference metrics,combine in to a single sortable value, then rank the results before making personalized suggestions.The resulting data in the "reco" output set produces exactly that.“tracks” represents behavioral targeting, while “shades” represents our inventoryOverall, the resulting app needs to enable feedback loops involving customers, their GPS tracks, their profile settings, etc.
    • apps ‣ addr: 115 HAWTHORNE AVE ‣ lat/lng: 37.446, -122.168 ‣ geohash: 9q9jh0 ‣ tree: 413 site 2 ‣ species: Liquidambar styraciflua ‣ est. height: 23 m ‣ shade metric: 4.363 ‣ traffic: local residential, light traffic ‣ recent visit: 1972376952532 ‣ a short walk from my train stop ✔Wednesday, 06 March 13 79One of top recommendations for me is about two blocks from my train stop,where a couple of really big American Sweetgum trees provide ample shadeon a residential street with not much traffic
    • references… Enterprise Data Workflows with Cascading O’Reilly, 2013 amazon.com/dp/1449358721Wednesday, 06 March 13 80Some of this material comes from an upcoming O’Reilly book: “Enterprise Data Workflows with Cascading”Should be in Rough Cuts soon - scheduled to be out in print this June.
    • drill-down… blog, dev community, code/wiki/gists, maven repo, commercial products, career opportunities: cascading.org zest.to/group11 github.com/Cascading conjars.org goo.gl/KQtUL concurrentinc.com join us for very interesting work! Copyright @2013, Concurrent, Inc.Wednesday, 06 March 13 81Links to our open source projects, developer community, etc…contact me @pacoidhttp://concurrentinc.com/(were hiring too!)