SlideShare a Scribd company logo
1 of 81
Download to read offline
“Using Cascalog to build
            an app based on
            City of Palo Alto Open Data”


               Paco Nathan           Document
                                     Collection




                                                   Tokenize
                                                                   Scrub
                                                                   token




               Concurrent, Inc.
                                             M



                                                                           HashJoin   Regex
                                                                             Left     token
                                                                                              GroupBy    R
                                                              Stop Word                        token
                                                                 List
                                                                             RHS




               San Francisco, CA                                                                 Count




               @pacoid
                                                                                                             Word
                                                                                                             Count




                                                  Copyright @2013, Concurrent, Inc.




Monday, 28 January 13                                                                                                1
This project began as a machine
          learning workshop for a graduate
          seminar at CMU West

          Many thanks to:

          Stuart Evans,
          CMU Distinguished Service Professor

          Jonathan Reichental,
          City of Palo Alto CIO

          We use Cascalog to develop
          a Big Data workflow

          Open Source:
          github.com/Cascading/CoPA/wiki




Monday, 28 January 13                           2
Palo Alto is generally quite
          a pleasant place

           • temperate weather
           • lots of parks, enormous trees
           • great coffeehouses
           • walkable downtown
           • not particularly crowded
           • friendly VCs (sort of)

          On a nice summer day, who wants
          to be stuck indoors on a phone call?
          Instead, take it outside –
          go for a walk


Monday, 28 January 13                            3
Surely, there must be
          an app for that…

          But wait, there isn’t?

          So let’s build one!




                                   source: Apple




Monday, 28 January 13                              4
process




source: algaelab.org




Monday, 28 January 13             5
1. unstructured data about municipal infrastructure
          (GIS data: trees, roads, parks)
                                              ✚
          2. unstructured data about where people like to walk
          (smartphone GPS logs)
                                              ✚                       Document
                                                                      Collection



                                                                                                   Scrub
                                                                                   Tokenize
                                                                                                   token




          3. a wee bit o’ curated metadata
                                                                              M



                                                                                                           HashJoin   Regex
                                                                                                             Left     token
                                                                                                                              GroupBy    R
                                                                                              Stop Word                        token
                                                                                                 List
                                                                                                             RHS




                                                                                                                                 Count




                                                                                                                                             Word
                                                                                                                                             Count




          4. personalized recommendations:
          “Find a shady spot on a summer day in which to walk
           near downtown Palo Alto.While on a long conference call.
           Sippin’ a latte or enjoying some fro-yo.”


Monday, 28 January 13                                                                                                                                6
“unstructured” vs. “structured” data
          is actually quite a Big Debate
          refer back to Edgar Codd 1969
          to learn about the Relational Model
          relational != SQL
          but I digress…




Monday, 28 January 13                            7
Data Science work must focus on
          the process of structuring data
          which must occur long before the
          large-scale joins, predictive models,
          visualizations, etc.
          So, the process of structuring data is
          what we examine here:
          i.e., how to build workflows
          for Big Data


          thank you Dr. Codd
          “A relational model of data for large shared data banks”
          dl.acm.org/citation.cfm?id=362685




Monday, 28 January 13                                                8
references

                by DJ Patil

                Data Jujitsu
                O’Reilly, 2012
                amazon.com/dp/B008HMN5BE

                Building Data Science Teams
                O’Reilly, 2011
                amazon.com/dp/B005O4U3ZE

Monday, 28 January 13                         9
references

                by Leo Breiman
                Statistical Modeling:
                The Two Cultures
                Statistical Science, 2001
                bit.ly/eUTh9L

                also check out RStudio:
                rstudio.org/
                rpubs.com/

Monday, 28 January 13                       10
Generally speaking, we could approach the matter of developing
          an Open Data app through these steps:
           • clean up the raw, unstructured data from CoPA download (ETL)
           • before modeling, perform visualization and analysis in RStudio
           • spend time on ideation and research for potential use cases
           • iterate on business process for the app workflow
           • integrate with use cases represented by the workflow taps
           • apply best practices and TDD at scale
           • …PROFIT!



                                                          source: South Park




Monday, 28 January 13                                                          11
edoMpUsserD:IUN




          In terms of actual process used in
                                                                 tcudorP ylppA lenaP yrotnevnI tneilC
                                                              tcudorP evomeR lenaP yrotnevnI tneilC
                                                                                     edoMmooRyM:IUN
                                                                                 edoMmooRcilbuP:IUN
                                                                                              ydduB ddA
                                                                                           nigoL etisbeW
                                                                                                       vd



          Data Science, here’s how my teams
                                                                                      edoMsdneirF:IUN
                                                                                          edoMtahC:IUN
                                                                                      egasseM a evaeL
                                                                         G1 :gniniamer ecaps sserddA
                                                                                  dekcilCeliforPyM:IUN
                                                                                   edoMstiderCyuB:IUN




          have worked:
                                                                                       tohspanS a ekaT
                                                                                   egapemoH nwO tisiV
                                                                                           elbbuB a epyT
                                                                                            taeS egnahC
                                                                                      wodniW D3 nepO
                                                                                              dneirF ddA
                                                             revO tcudorP pilF lenaP yrotnevnI tneilC
                                                                                               lenaP tidE
                                                                                                woN tahC
                                                                                                 teP yalP
                                                                                                teP deeF
                                                         2 petS egaP traC esahcruP edaM remotsuC
                                                                      M215 :gniniamer ecaps sserddA
                                                                                          gnihtolC no tuP
                                                                                       bew :metI na yuB
                                                                                         edoMeivoM:IUN


                               help people ask the
                                                                ytinummoc ,tneilc :detratS weiV eivoM




               discovery
                                                                                         teP weN etaerC
                                                                    detrats etius tset :tseTytivitcennoC
                                                                               emag pazyeh dehcnuaL
                                                                                eciov mooRcilbuP tahC



                               right questions
                                                                                      egasseM yadhtriB
                                                                                      edoMlairotuT:IUN
                                                                                ybbol semag dehcnuaL
                                                                                    noitartsigeR euqinU




                                                                                                            edoMpUsserD:IUN
                                                                                                            tcudorP ylppA lenaP yrotnevnI tneilC
                                                                                                            tcudorP evomeR lenaP yrotnevnI tneilC
                                                                                                            edoMmooRyM:IUN
                                                                                                            edoMmooRcilbuP:IUN
                                                                                                            ydduB ddA
                                                                                                            nigoL etisbeW
                                                                                                            vd
                                                                                                            edoMsdneirF:IUN
                                                                                                            edoMtahC:IUN
                                                                                                            egasseM a evaeL
                                                                                                            G1 :gniniamer ecaps sserddA
                                                                                                            dekcilCeliforPyM:IUN
                                                                                                            edoMstiderCyuB:IUN
                                                                                                            tohspanS a ekaT
                                                                                                            egapemoH nwO tisiV
                                                                                                            elbbuB a epyT
                                                                                                            t a eS e g n a h C

                                                                                                            dneirF ddA
                                                                                                            revO tcudorP pilF lenaP yrotnevnI tneilC
                                                                                                            lenaP tidE
                                                                                                            woN tahC
                                                                                                            teP yalP
                                                                                                            teP deeF
                                                                                                            2 petS egaP traC esahcruP edaM remotsuC
                                                                                                            M215 :gniniamer ecaps sserddA
                                                                                                            gnihtolC no tuP
                                                                                                            bew :metI na yuB
                                                                                                            edoMeivoM:IUN
                                                                                                            ytinummoc ,tneilc :detratS weiV eivoM
                                                                                                            teP weN etaerC
                                                                                                            detrats etius tset :tseTytivitcennoC
                                                                                                            emag pazyeh dehcnuaL
                                                                                                            eciov mooRcilbuP tahC
                                                                                                            egasseM yadhtriB
                                                                                                            edoMlairotuT:IUN
                                                                                                            ybbol semag dehcnuaL
                                                                                                            noitartsigeR euqinU
                                                                                                            wodniW D3 nepO
                               allow automation to
                modeling       place informed bets


                               deliver products at
             integration       scale to customers


                               build smarts into
                        apps   product features

                               keep infrastructure
                 systems       running, cost-effective


Monday, 28 January 13                                                                                                                                  12
For the process used with this Open Data app,
          we chose to use Cascalog
          by Nathan Marz, Sam Ritchie, et al., 2010
          a DSL in Clojure which implements
          Datalog, backed by Cascading


          Some aspects of CS theory:

           • Functional Relational Programming
           • mitigates Accidental Complexity
           • has been compared with Codd 1969

          github.com/nathanmarz/cascalog/wiki



Monday, 28 January 13                                     13
Q:
            Who uses Cascalog, other than Twitter?

          A:
           • Climate Corp (they’re hiring, ask for Crea)
           • Factual
           • Nokia Maps
           • Harvard School of Public Health
           • YieldBot (PDX)
           • uSwitch (London)
           • etc.



Monday, 28 January 13                                      14
pro:
           • 10:1 reduction in code volume compared to SQL
           • most advanced uses of Cascading
           • Leiningen build: simple, no surprises, in Clojure itself
           • test-driven development (TDD) for Big Data
           • fault-tolerant workflows which are simple to follow
           • machine learning, map-reduce, etc., started in LISP
               years ago anywho
          con:
           • learning curve, limited number of Clojure developers
           • aggregators are the magic, those take effort to learn


Monday, 28 January 13                                                   15
Accidental Complexity:
          Not O(N^2) complexity, but the costs of software
          engineering at scale over time
          What happens when you build recommenders,
          then go work on other projects for six months?
          What does it cost others to maintain your apps?
          Cascalog allows for leveraging the same framework,
          same code base, from Discovery phase through
          to Systems phase
          It focuses on the process of structuring data:
          specify what you require, not how it must be achieved
          Huge implications for software engineering




Monday, 28 January 13                                             16
discovery




source: 2001 A Space Odyssey




Monday, 28 January 13                      17
discovery
          The City of Palo Alto recently began to support
          Open Data to give the local community greater
          visibility into how their city government operates
          This effort is intended to encourage students,
          entrepreneurs, local organizations, etc., to build
          new apps which contribute to the public good
          paloalto.opendata.junar.com/dashboards/7576/
          geographic-information/




Monday, 28 January 13                                                      18
discovery
          GIS about trees in Palo Alto:




Monday, 28 January 13                                 19
discovery
          GIS about roads in Palo Alto:




Monday, 28 January 13                                 20
discovery
      Geographic_Information,,,

          "Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl","
          Private:        -1      Tree ID:     29      Street_Name:      ADDISON AV        Situs
          Number:        203      Tree Site:      2      Species:    Celtis australis
          Source:        davey tree        Protected:          Designated:           Heritage:
          Appraised Value:               Hardscape:       None     Identifier:      40      Active
          Numeric:        1      Location Feature ID:        13872      Provisional:
          Install Date:             ","37.4409634615283,-122.15648458861,0.0 ","Point"
          "Wilkie Way from West Meadow Drive to Victoria Place","                   Sequence:
          20         Street_Name:      Wilkie Way        From Street PMMS:       West Meadow
          Drive         To Street PMMS:       Victoria Place        Street ID:      598 (Wilkie
          Wy, Palo Alto)           From Street ID PMMS:        689      To Street ID PMMS:
          567         Year Constructed:       1950       Traffic Count:     596      Traffic
          Index:        residential local         Traffic Class:      local residential
          Traffic Date:          08/24/90      Paving Length:       208     Paving Width:        40
          Paving Area:         8320      Surface Type:       asphalt concrete        Surface
          Thickness:
          Thickness:
                             2.0
                             6.0   (um, bokay…)
                                     Base Type Pvmt:
                                     Soil Class:       2
                                                            crusher run base
                                                             Soil Value:      15
                                                                                    Base
                                                                                     Curb Type:
          Curb Thickness:               Gutter Width:       36.0     Book:     22      Page:     1
          District Number:          18      Land Use PMMS:       1    Overlay Year:        1990
          Overlay Thickness:           1.5     Base Failure Year:        1990      Base Failure
          Thickness:         6     Surface Treatment Year:              Surface Treatment
          Type:             Alligator Severity:        none      Alligator Extent:       0
          Block Severity:          none      Block Extent:       0    Longitude and
          Transverse Severity:           none       Longitude and Transverse Extent:          0
          Ravelling Severity:           none      Ravelling Extent:       0      Ridability
Monday, 28Severity:
          January 13        none     Trench Severity:        none     Trench Extent:        0         21
discovery
          (defn parse-gis [line]
              "leverages parse-csv for complex CSV format in GIS export"
              (first (csv/parse-csv line))
            )
           
           
          (defn etl-gis [gis trap]
              "subquery to parse data sets from the GIS source tap"
              (<- [?blurb ?misc ?geo ?kind]
                  (gis ?line)
                  (parse-gis ?line :> ?blurb ?misc ?geo ?kind)
                  (:trap (hfs-textline trap))
               ))




                        (specify what you require,
                          not how to achieve it…
                           addressing the 80%)


Monday, 28 January 13                                                             22
discovery




                         (convert ad-hoc queries
                        into logical propositions)

Monday, 28 January 13                                      23
discovery
          Identifier:   474
          Tree ID:      412
          Tree:         412 site 1 at 115 HAWTHORNE AV
          Tree Site:    1
          Street_Name: HAWTHORNE AV
          Situs Number: 115
          Private:      -1
          Species:      Liquidambar styraciflua
          Source:       davey tree
          Hardscape:    None
          37.446001565119,-122.167713417554,0.0
          Point




                        (obtain recognizable
                              results)



Monday, 28 January 13                                                24
discovery




                        (curate valuable metadata)



Monday, 28 January 13                                     25
discovery
          (defn get-trees [src trap tree_meta]
            "subquery to parse/filter the tree data"
            (<- [?blurb ?tree_id ?situs ?tree_site
                 ?species ?wikipedia ?calflora ?avg_height
                 ?tree_lat ?tree_lng ?tree_alt ?geohash
                 ]
                (src ?blurb ?misc ?geo ?kind)
                (re-matches #"^s+Private.*Tree ID.*" ?misc)
                (parse-tree
                   ?misc :> _ ?priv ?tree_id ?situs ?tree_site ?raw_species)
                ((c/comp s/trim s/lower-case) ?raw_species :> ?species)
                (tree_meta
                   ?species ?wikipedia ?calflora ?min_height ?max_height)
                (avg ?min_height ?max_height :> ?avg_height)
                (geo-tree ?geo :> _ ?tree_lat ?tree_lng ?tree_alt)
                (read-string ?tree_lat :> ?lat)
                (read-string ?tree_lng :> ?lng)
                (geohash ?lat ?lng :> ?geohash)
                (:trap (hfs-textline trap))
             ))




Monday, 28 January 13                                                            26
discovery
          ?blurb! ! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 from pl
          ?tree_id!" 412
          ?situs" " 115
          ?tree_site" 1
          ?species"" liquidambar styraciflua
          ?wikipedia" http://en.wikipedia.org/wiki/Liquidambar_styraciflua
          ?calflora" http://calflora.org/cgi-bin/species_query.cgi?where-calrecnum=8598
          ?avg_height"27.5
          ?tree_lat" 37.446001565119
          ?tree_lng" -122.167713417554
          ?tree_alt" 0.0
          ?geohash"" 9q9jh0




                        (et voilà, a data product)



Monday, 28 January 13                                                                 27
discovery
          // run some analysis and visualization in R
          library(ggplot2)

          dat_folder <- '~/src/concur/CoPA/out/tree'
          data <- read.table(file=paste(dat_folder, "part-00000", sep="/"),
                               sep="t", quote="", na.strings="NULL",
                               header=FALSE, encoding="UTF8")
           
          summary(data)

          t <- head(sort(table(data$V5), decreasing=TRUE)
          trees <- as.data.frame.table(t, n=20))
          colnames(trees) <- c("species", "count")
           
          m <- ggplot(data, aes(x=V8))
          m <- m + ggtitle("Estimated Tree Height (meters)")
          m + geom_histogram(aes(y = ..density.., fill = ..count..)) +
          geom_density()
           
          par(mar = c(7, 4, 4, 2) + 0.1)
          plot(trees, xaxt="n", xlab="")
          axis(1, labels=FALSE)
          text(1:nrow(trees), par("usr")[3] - 0.25, srt=45, adj=1,
               labels=trees$species, xpd=TRUE)
          grid(nx=nrow(trees))


Monday, 28 January 13                                                           28
discovery




                        sweetgum




Monday, 28 January 13                          29
discovery

               GIS                                  Regex




                                          tree
                                                                Scrub
              export                               parse-tree   species




                                          M
                                                                                  Estimate
                                                                           Join                Geohash
                                                                                   height




                         Regex
                                    src




                        parse-gis
          M                                                       Tree
                                                                                                         tree
                                                                Metadata




                                                 Failure
                                                  Traps




                          (flow diagram, gis                                         tree)



Monday, 28 January 13                                                                                           30
definitions
          The conceptual flow diagram shows a directed, acyclic graph (DAG)
          of taps, tuple streams, functions, joins, aggregations, assertions, etc.
          Cascading is formally a pattern language – patterns of “plumbing”
          fit together to ensure best practices for large-scale parallel processing
          in risk-aversive environments – hard requirements of Enterprise IT
                             GIS                                 Regex




                                                       tree
                                                                             Scrub
                            export                              parse-tree   species




                                                       M
                                                                                               Estimate
                                                                                        Join              Geohash
                                                                                                height




                                      Regex
                                                 src




                                     parse-gis
                        M                                                      Tree
                                                                                                                    tree
                                                                             Metadata




                                                              Failure
                                                               Traps




          In other words, Cascading forces functional programming
          through an API for JVM-based languages such as Java, Scala, Clojure
          Through this approach, we define Enterprise Data Workflows


Monday, 28 January 13                                                                                                      31
definitions
         pattern language: a structured method for
         solving large, complex design problems, where
         the syntax of the language promotes the use
         of best practices

         amazon.com/dp/0195019199


         design patterns: originated in consensus
         negotiation for architecture, later used in
         OOP software engineering

         amazon.com/dp/0201633612



Monday, 28 January 13                                    32
discovery
          (defn get-roads [src trap road_meta]
            "subquery to parse/filter the road data"
            (<- [?blurb ?bike_lane ?bus_route ?truck_route ?albedo
                 ?min_lat ?min_lng ?min_alt ?geohash
                 ?traffic_count ?traffic_index ?traffic_class
                 ?paving_length ?paving_width ?paving_area ?surface_type
                 ]
                (src ?blurb ?misc ?geo ?kind)
                (re-matches #"^s+Sequence.*Traffic Count.*" ?misc)
                (parse-road ?misc :> _
                   ?traffic_count ?traffic_index ?traffic_class
                   ?paving_length ?paving_width ?paving_area ?surface_type
                   ?overlay_year ?bike_lane ?bus_route ?truck_route)
                (road_meta ?surface_type ?albedo_new ?albedo_worn)
                (estimate-albedo
                 ?overlay_year ?albedo_new ?albedo_worn :> ?albedo)
                (bigram ?geo :> ?pt0 ?pt1)
                (midpoint ?pt0 ?pt1 :> ?lat ?lng ?alt)
                ;; why filter for min? because there are geo duplicates..
                (c/min ?lat :> ?min_lat)
                (c/min ?lng :> ?min_lng)
                (c/min ?alt :> ?min_alt)
                (geohash ?min_lat ?min_lng :> ?geohash)
                (:trap (hfs-textline trap))
             ))


Monday, 28 January 13                                                            33
discovery
          ?blurb" " " Hawthorne Avenue from Alma Street to High Street
          ?traffic_count"3110
          ?traffic_class"local residential
          ?surface_type" asphalt concrete
          ?albedo" " " 0.12
          ?min_lat"" " 37.446140860599854"
          ?min_lng " " -122.1674652295435
          ?min_alt " " 0.0
          ?geohash"" " 9q9jh0




                        (another data product)



Monday, 28 January 13                                                        34
discovery
          The road data provides:

           • traffic class (arterial, truck route, residential, etc.)
           • traffic counts distribution
           • surface type (asphalt, cement; age)
          This leads to estimators for noise, reflection, etc.




Monday, 28 January 13                                                              35
discovery

               GIS
              export




                         Regex




                                             road
                                                      Regex

                                       src
                        parse-gis                   parse-road
          M




                                             M
                                                                           Estimate     Road
                                                                    Join
                                                                            Albedo    Segments
                                                                                                     Geohash
                             Failure
                              Traps


                                                                                      R
                                                          Road
                                                                                                               road
                                                         Metadata




                         (flow diagram, gis                                           road)



Monday, 28 January 13                                                                                                 36
modeling




source: America’s Next Top Model




Monday, 28 January 13                         37
modeling

          GIS data from Palo Alto provides us with
          geolocation about each item in the export:
          latitude, longitude, altitude
          Geo data is great for managing municipal
          infrastructure as well as for mobile apps
          Predictive modeling in our Open Data
          example focuses on leveraging geolocation
          We use spatial indexing by creating
          a grid of geohash values, for efficient
          parallel processing
          Cascalog queries collect items with the
          same geohash values – using them as keys
          for large-scale joins (Hadoop)



Monday, 28 January 13                                             38
modeling




                                 geohash with 6-digit resolution
                                 approximates a 5-block square
                                 centered lat: 37.445, lng: -122.162


                        9q9jh0




Monday, 28 January 13                                                       39
modeling

         Each road in the GIS export is listed as a block
         between two cross roads, and each may have
         multiple road segments to represent turns:
         "     -122.161776959558,37.4518836690781,0.0
         "     -122.161390381489,37.4516410983794,0.0
         "     -122.160786011735,37.4512589903357,0.0
         "     -122.160531178368,37.4510977281699,0.0

                                      ( lat1, lng1, alt1 )
                                                                              ( lat3, lng3, alt3 )




             ( lat0, lng0, alt0 )
                                                             ( lat2, lng2, alt2 )



          NB: segments in the raw GIS have the order
          of geo coordinates scrambled: (lng, lat, alt)


Monday, 28 January 13                                                                                40
modeling

         Our app analyzes each road segment as a data tuple,
         calculating the center point for each:



                                                  ( lat, lng, alt )




Monday, 28 January 13                                                            41
modeling

         Then uses a geohash to define a grid cell,
         as a boundary (or “canopy”):




                           9q9jh0




Monday, 28 January 13                                           42
modeling

         Query to join a road segment tuple with all the trees
         within its geohash boundary:




                           9q9jh0




Monday, 28 January 13                                                       43
modeling

         Use distance-to-midpoint to filter trees which are
         too far away to provide shade:



                                   X                         X




                                          X




Monday, 28 January 13                                                       44
modeling

         Calculate a sum of moments for tree height × distance
         from road segment, as an estimator for shade:




                    ∑( h·d )

          We also calculate estimators for traffic frequency
          and noise


Monday, 28 January 13                                                       45
modeling
          (defn get-shade [trees roads]
            "subquery to join tree and road estimates, maximize for shade"
            (<- [?road_name ?geohash ?road_lat ?road_lng
                 ?road_alt ?road_metric ?tree_metric]
                (roads ?road_name _ _ _
                 ?albedo ?road_lat ?road_lng ?road_alt ?geohash
                 ?traffic_count _ ?traffic_class _ _ _ _)
                (road-metric
                 ?traffic_class ?traffic_count ?albedo :> ?road_metric)
                (trees _ _ _ _ _ _ _
                 ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash)
                (read-string ?avg_height :> ?height)
                ;; limit to trees which are higher than people
                (> ?height 2.0)
                (tree-distance
                 ?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance)
                ;; limit to trees within a one-block radius (not meters)
                (<= ?distance 25.0)
                (/ ?height ?distance :> ?tree_moment)
                (c/sum ?tree_moment :> ?sum_tree_moment)
                ;; magic number 200000.0 used to scale tree moment
                ;; based on median
                (/ ?sum_tree_moment 200000.0 :> ?tree_metric)
             ))




Monday, 28 January 13                                                          46
modeling
          ?road_name" "   Hawthorne Avenue from Alma Street to High Street
          ?geohash"" "    9q9jh0
          ?road_lat" "    37.446140860599854
          ?road_lng " "   -122.1674652295435
          ?road_alt " "   0.0
          ?road_metric"   [1.0 0.5488121277250486 0.88]
          ?tree_metric"   4.36321007861036




                        (another data product)



Monday, 28 January 13                                                           47
modeling


                                Filter
                        tree
                                height




                    M
                                                      Calculate    Filter         Sum
                                           Join
                                                       distance   distance       moment      Filter
                                                                                          sum_moment




                               Estimate           R   M                      R            M
                        road                                                                           shade
                                 traffic




                               (flow diagram, shade)



Monday, 28 January 13                                                                                          48
modeling




Monday, 28 January 13              49
modeling




Monday, 28 January 13              50
modeling
          (defn get-gps [gps_logs trap]
            "subquery to aggregate and rank GPS tracks per user"
            (<- [?uuid ?geohash ?gps_count ?recent_visit]
                (gps_logs
                 ?date ?uuid ?gps_lat ?gps_lng ?alt ?speed ?heading
                 ?elapsed ?distance)
                (read-string ?gps_lat :> ?lat)
                (read-string ?gps_lng :> ?lng)
                (geohash ?lat ?lng :> ?geohash)
                (c/count :> ?gps_count)
                (date-num ?date :> ?visit)
                (c/max ?visit :> ?recent_visit)
           ))



                         (behavioral targeting:
                        aggregate GPS tracks by
                          recency, frequency)


Monday, 28 January 13                                                            51
modeling




                            gps                Count
                                   Geohash                  Max
                            logs             gps_count
                                                         recent_visit




                        M                    R
                                                                        gps




                        (flow diagram, gps)



Monday, 28 January 13                                                                    52
modeling
          ?uuid                              ?geohash   ?gps_count   ?recent_visit
          cf660e041e994929b37cc5645209c8ae   9q8yym     7            1972376866448
          342ac6fd3f5f44c6b97724d618d587cf   9q9htz     4            1972376690969
          32cc09e69bc042f1ad22fc16ee275e21   9q9hv3     3            1972376670935
          342ac6fd3f5f44c6b97724d618d587cf   9q9hv3     3            1972376691356
          342ac6fd3f5f44c6b97724d618d587cf   9q9hv6     1            1972376691180
          342ac6fd3f5f44c6b97724d618d587cf   9q9hv8     18           1972376691028
          342ac6fd3f5f44c6b97724d618d587cf   9q9hv9     7            1972376691101
          342ac6fd3f5f44c6b97724d618d587cf   9q9hvb     22           1972376691010
          342ac6fd3f5f44c6b97724d618d587cf   9q9hwn     13           1972376690782
          342ac6fd3f5f44c6b97724d618d587cf   9q9hwp     58           1972376690965
          482dc171ef0342b79134d77de0f31c4f   9q9jh0     15           1972376952532
          b1b4d653f5d9468a8dd18a77edcc5143   9q9jh0     18           1972376945348




                        (GPS personalization)



Monday, 28 January 13                                                                53
modeling
          (defn get-reco [tracks shades]
            "subquery to recommend road segments based on GPS tracks"
            (<- [?uuid ?road ?geohash ?lat ?lng ?alt
                 ?gps_count ?recent_visit ?road_metric ?tree_metric]
                (tracks ?uuid ?geohash ?gps_count ?recent_visit)
                (shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric)
             ))




                        (finally, the recommender)



Monday, 28 January 13                                                              54
modeling

         Recommenders combine multiple signals,
         generally via weighted averages, to rank
         personalized results:

           • GPS of person ∩ road segment
           • frequency and recency of visit
           • traffic class and rate
           • road albedo (sunlight reflection)
           • tree shade estimator
         Adjusting the mix allows for further
         personalization at the end use




Monday, 28 January 13                                          55
integration




source: Wolfram




Monday, 28 January 13                 56
integration

         Hadoop is rarely ever used in isolation
         System integration is a hard problem in Big Data,
         especially social aspects: breaking down silos
         Cascading was built for this purpose:

           • taps across many data frameworks:
               HBase, Cassandra, MongoDB, etc.                   GIS                                 Regex




                                                                                           tree
                                                                                                                 Scrub
                                                                export                              parse-tree   species




           • support for a variety of data serialization:
                                                                                           M
                                                                                                                                   Estimate
                                                                                                                            Join              Geohash
                                                                                                                                    height




                                                                          Regex




                                                                                     src
               Avro,Thrift, Kryo, JSON, etc.
                                                                         parse-gis
                                                            M                                                      Tree
                                                                                                                                                        tree
                                                                                                                 Metadata




                                                                                                  Failure
                                                                                                   Traps




           • planning on multiple topologies:
               MapReduce, in-memory, tuple spaces, etc.

           • test-driven development (TDD) at scale
           • ANSI SQL-92 integration, PMML, etc.

Monday, 28 January 13                                                                                                                                          57
integration

         This example focuses on the batch workflow
         to examine best practices for parallel processing
         Integrating with a mobile app requires next steps:

           • push “reco” output to a Redis cluster
               (caching layer) via a Cascading tap
           • leverage Redis “sorted sets” for ranking
               personalized results
           • create lightweight API in Node.js + Nginx
               for low-latency access at scale
           • collect social interactions in Splunk
           • instrument via Nagios, New Relic, Flurry, etc.
         That provides a data service – doesn’t even begin
         to address: design, user experience, marketing,
         implementation, etc., for a complete app…

Monday, 28 January 13                                                       58
integration

          Batch workflow plus a data service:


             web
               web                             Redis                    web          mobile
             logsGIS
               logs                           cluster                   app           API
                export                                                                        Customers


                              Cascading app
                 source                         sink
                   tap                          tap




                                                        source
                          Recommender                     tap



                  trap                        source             customer
                   tap                          tap                           Splunk
                                                                  profile
                                                                  Customer
                                                                   DBs
                                                                    Prefs


                                                                              web
               Support                                                          web
                          Hadoop cluster                                      logs gps
               review                                                           logs
                                                                                 tracks




Monday, 28 January 13                                                                                     59
integration

         In terms of deploying a batch workflow,
         there are several considerations:

           • build package for a “fat jar” (lein uberjar)
           • continuous integration
           • JAR repository
           • cluster scheduling (e.g., EMR)
           • instrumentation (Concurrent)
           • troubleshooting from app layer




Monday, 28 January 13                                                     60
apps




source: Apple




Monday, 28 January 13          61
apps

         We work on discovery, modeling, integration – long before
         coding an app. In a linear-logical sense, one might prefer a “waterfall”
         approach; however, that would undermine core values – mitigating
         Accidental Complexity – TDD, scalability, fault-tolerance, etc.
         In lieu of SQL queries, we define a composable set of logical
         propositions which can be executed, instrumented, tested, etc.,
         independently for best practices at scale in parallel
         Back to functional relational programming, particularly Datalog’s
         logic programming, we use subqueries as logical propositions…
         within a functional context… to leverage the relational model

           • scalability: specify what you require, not how
           • testability: disprove the opposites of propositions, to validate
         Taken together in the context of Cascalog, now let’s build the app…


Monday, 28 January 13                                                               62
apps
          (defproject cascading-copa "0.1.0-SNAPSHOT"
            :description "City of Palo Alto Open Data recommender in Cascalog"
            :url "https://github.com/Cascading/CoPA"
            :license {:name "Apache License, Version 2.0"
                       :url "http://www.apache.org/licenses/LICENSE-2.0"
                       :distribution :repo
                     }
            :uberjar-name "copa.jar"
            :aot [copa.core]
            :main copa.core
            :source-paths ["src/main/clj"]
            :dependencies [[org.clojure/clojure "1.4.0"]
                            [cascalog "1.10.0"]
                            [cascalog-more-taps "0.3.1-SNAPSHOT"]
                            [clojure-csv/clojure-csv "1.3.2"]
                            [org.clojars.sunng/geohash "1.0.1"]
                            [org.clojure/clojure-contrib "1.2.0"]
                            [date-clj "1.0.1"]
                            ]
            :profiles {:dev {:dependencies [[midje-cascalog "0.4.0"]]}
                        :provided {:dependencies [
                             [org.apache.hadoop/hadoop-core "0.20.2-dev"]
                             ]}}
            )




Monday, 28 January 13                                                            63
apps




Monday, 28 January 13          64
apps



                                                           (results)



             ‣   addr: 115 HAWTHORNE AVE
             ‣   lat/lng: 37.446, -122.168
             ‣   geohash: 9q9jh0
             ‣   tree: 413 site 2
             ‣   species: Liquidambar styraciflua
             ‣   est. height: 23 m
             ‣   shade metric: 4.363
             ‣   traffic: local residential, light traffic
             ‣   recent visit: 1972376952532
             ‣   a short walk from my train stop ✔



Monday, 28 January 13                                                   65
apps


          GIS                               Regex
                                    tree

                                                             Scrub
         export                            parse-tree        species




     M                              M
                                                                                  Estimate
                                                                       Join                  Geohash
                                                                                   height




                   Regex
                              src




                  parse-gis
                                                              Tree                                                 Filter
                                                                                                           tree
                                                            Metadata                                               height




                                           Failure                                                     M
                                            Traps
                                                                                                                                         Calculate         Filter             Sum
                                                                                                                              Join
                                                                                                                                          distance        distance           moment           Filter
                                                                                                                                                                                           sum_moment




                                                                                                                  Estimate           R   M                               R                 M
                                                                                                           road
                                    road




                                             Regex
                                                                                                                    traffic
                                           parse-road
                                                                                                                                                                                                        shade




                                                                       Estimate     Road
                                                           Join
                                                                        Albedo    Segments
                                                                                             Geohash                                                                                                            Join



                                    M
                                                                                  R
                                                 Road
                                                Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                         gps               reco
                                                                                                                                             logs




                                                                                                                                                                       Count
                                                                                                                                                     Geohash                             Max
                                                                                                                                                                     gps_count
                                                                                                                                                                                      recent_visit




                   (flow diagram,
                                                                                                                                         M                           R




                       for the
                  whole enchilada)
Monday, 28 January 13                                                                                                                                                                                                             66
definitions
         Design principles in the Cascading API pattern language,
         which help ensure best practices for Big Data apps in
         an Enterprise context:
          • specify what is required, not how it must be achieved
           • provide the “glue” for system integration
           • same JAR, any scale
           • users want no surprises
           • fail the same way twice
           • plan far ahead
         These points echo arguments about functional relational
         programming (FRP) and Accidental Complexity
         from Moseley/Marks 2006



Monday, 28 January 13                                               67
systems




source: Wired




Monday, 28 January 13             68
principle: same JAR, any scale
                                                                      MegaCorp Enterprise IT:
                                                                      Pb’s data
                                                                      1000+ node private cluster
                                                                      EVP calls you when app fails
                                                                      runtime: days+

                                                       Production Cluster:
                                                       Tb’s data
                                                       EMR w/ many HPC Instances
                                                       Ops monitors results
                                                       runtime: hours – days

                                   Staging Cluster:
                                   Gb’s data
                                   EMR + a few Spot Instances
                                   CI shows red or green lights
                                   runtime: minutes – hours

                Your Laptop:
                Mb’s data
                Hadoop standalone mode
                passes unit tests, or not
                runtime: seconds – minutes



Monday, 28 January 13                                                                                69
systems
          #!/bin/bash -ex
          # edit the `BUCKET` variable to use one of your S3 buckets:
          BUCKET=temp.cascading.org/copa
          SINK=out
           
          # clear previous output (required by Apache Hadoop)
          s3cmd del -r s3://$BUCKET/$SINK
          # load built JAR + input data
          s3cmd put target/copa.jar s3://$BUCKET/
          s3cmd put -r data s3://$BUCKET/
           
          # launch cluster and run
          elastic-mapreduce --create --name "CoPA" 
            --debug --enable-debugging --log-uri s3n://$BUCKET/logs 
            --jar s3n://$BUCKET/copa.jar 
            --arg s3n://$BUCKET/data/copa.csv 
            --arg s3n://$BUCKET/data/meta_tree.tsv 
            --arg s3n://$BUCKET/data/meta_road.tsv 
            --arg s3n://$BUCKET/data/gps.csv 
            --arg s3n://$BUCKET/$SINK/trap 
            --arg s3n://$BUCKET/$SINK/park 
            --arg s3n://$BUCKET/$SINK/tree 
            --arg s3n://$BUCKET/$SINK/road 
            --arg s3n://$BUCKET/$SINK/shade 
            --arg s3n://$BUCKET/$SINK/gps 
            --arg s3n://$BUCKET/$SINK/reco

Monday, 28 January 13                                                             70
systems




Monday, 28 January 13             71
systems

                ‣ name node / data node
                ‣ job tracker / task tracker
                ‣ submit queue
                ‣ task slots
                ‣ HDFS
                ‣ distributed cache

                                                     Wikipedia




                                               (under
                                                 the
                                               hood)
                               Apache


Monday, 28 January 13                                            72
bucket
                         list
Monday, 28 January 13            73
Could combine this with a variety of data APIs:
          • Trulia neighborhood data, housing prices
          • Factual local business (FB Places, etc.)
          • CommonCrawl open source full web crawl
          • Wunderground local weather data
          • WalkScore neighborhood data, walkability
          • Data.gov US federal open data
          • Data.NASA.gov NASA open data
          • DBpedia datasets derived from Wikipedia
          • GeoWordNet semantic knowledge base
          • Geolytics demographics, GIS, etc.
          • Foursquare,Yelp, CityGrid, Localeze,YP
          • various photo sharing


Monday, 28 January 13                                      74
Data Quality: some species names have
         spelling errors or misclassifications – could
         be cleaned up and provided back to CoPA
         to improve municipal services

         Assumptions have been made about
         missing data – were these appropriate
         for the intended use case?

         There are better ways to handle spatial
         indexing: k-d trees, etc.

         The tree data product needs: photos,
         toxicity, natives vs. invasives,
         common names, etc.


Monday, 28 January 13                                    75
Arguably, this is not a “large” data set:
          • Palo Alto has 65K population
          • great location for a POC
          • prior to deploying in large metro areas
          • CoPA is a leader in e-gov
          • app is simpler to study on a laptop

         Could extend to other cities with Open Data
         initiatives:
              SF, SJ, PDX, Seattle, VanBC…

         Let’s get coverage for all of Ecotopia!




Monday, 28 January 13                                  76
Trulia: optimize sales leads using estimated
         allergy zones, based on buyers’ real estate
         preferences

         Calflora: report new observations of invasives
         endangered species, etc.; infer regions of affinity
         for releasing beneficial insects

         City of Palo Alto: assess zoning impact,
         e.g., oleanders near day care centers; monitor
         outbreaks of tree diseases (big impact on
         property values)

         start-ups: some invasive species are valuable
         in Chinese medicine while others can be
         converted to biodiesel – potential win-win
         for targeted harvest services

Monday, 28 January 13                                          77
summary points
                  • geo data is great for municipal infrastructure and for mobile apps
                  • Cascading as a pattern language for Enterprise Data Workflows
                  • design principles in the API/pattern language ensure best practices
                  • focus on the process of structuring data; not un/structured
                  • Cascalog subqueries as composable logical propositions
                  • FRP mitigates the engineering costs of Accidental Complexity
                  • Data Science process: discovery, modeling, integration, apps, systems
                  • Hadoop is rarely ever used in isolation; breaking down silos is the
                        hard problem, which must be socialized to resolve


Monday, 28 January 13                                                                       78
references

                leiningen.org
                github.com/nathanmarz/cascalog/wiki
                sritchie.github.com
                vimeo.com/16398892
                manning.com/marz
                java.dzone.com/articles/using-lucene-
                and-cascalog-fast


Monday, 28 January 13                                   79
references

                by Paco Nathan
                Enterprise Data Workflows
                with Cascading
                O’Reilly, 2013
                amazon.com/dp/1449358721


                Santa Clara, Feb 28, 1:30pm
                strataconf.com/strata2013

Monday, 28 January 13                         80
drill-down

                 blog, code/wiki/gists, maven repo, community, products:
                 cascading.org
                 github.org/Cascading
                 conjars.org
                 meetup.com/cascading
                 goo.gl/KQtUL
                 concurrentinc.com


                 we are hiring!                             Copyright @2013, Concurrent, Inc.




Monday, 28 January 13                                                                           81

More Related Content

Viewers also liked

What to Do Once You Have an Idea (case study)
What to Do Once You Have an Idea (case study)What to Do Once You Have an Idea (case study)
What to Do Once You Have an Idea (case study)Sergey Sundukovskiy
 
Aspen ideas Festival Talk on Gov20
Aspen ideas Festival Talk on Gov20Aspen ideas Festival Talk on Gov20
Aspen ideas Festival Talk on Gov20Tim O'Reilly
 
Open Data: From the Information Age to the Action Age (Keynote File)
Open Data: From the Information Age to the Action Age (Keynote File)Open Data: From the Information Age to the Action Age (Keynote File)
Open Data: From the Information Age to the Action Age (Keynote File)Tim O'Reilly
 
Some Lessons for Startups (ppt)
Some Lessons for Startups (ppt)Some Lessons for Startups (ppt)
Some Lessons for Startups (ppt)Tim O'Reilly
 
The Ultimate Guide to Content Marketing & Influencer Strategy
The Ultimate Guide to Content Marketing & Influencer StrategyThe Ultimate Guide to Content Marketing & Influencer Strategy
The Ultimate Guide to Content Marketing & Influencer StrategyAllan V. Braverman
 
Seoul Digital Forum (keynote file)
Seoul Digital Forum (keynote file)Seoul Digital Forum (keynote file)
Seoul Digital Forum (keynote file)Tim O'Reilly
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphsDavid Gleich
 
Elastic Apache Mesos on Amazon EC2
Elastic Apache Mesos on Amazon EC2Elastic Apache Mesos on Amazon EC2
Elastic Apache Mesos on Amazon EC2Paco Nathan
 
Digital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the worldDigital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the worldOReillyStrata
 
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...Wesley Schwalje
 
Mobilité partagée, un enjeu d'innovation dans un système global de transport
Mobilité partagée, un enjeu d'innovation dans un système global de transportMobilité partagée, un enjeu d'innovation dans un système global de transport
Mobilité partagée, un enjeu d'innovation dans un système global de transportPierre-Olivier Desmurs
 
The roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours tooThe roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours tooMohnish Jadwani
 
Ficod 2011 pdf (with notes)
Ficod 2011 pdf (with notes)Ficod 2011 pdf (with notes)
Ficod 2011 pdf (with notes)Tim O'Reilly
 
Awakening India - Jago Party
Awakening India - Jago PartyAwakening India - Jago Party
Awakening India - Jago PartyKapil Mohan
 

Viewers also liked (18)

What to Do Once You Have an Idea (case study)
What to Do Once You Have an Idea (case study)What to Do Once You Have an Idea (case study)
What to Do Once You Have an Idea (case study)
 
Aspen ideas Festival Talk on Gov20
Aspen ideas Festival Talk on Gov20Aspen ideas Festival Talk on Gov20
Aspen ideas Festival Talk on Gov20
 
Web 20
Web 20Web 20
Web 20
 
Open Data: From the Information Age to the Action Age (Keynote File)
Open Data: From the Information Age to the Action Age (Keynote File)Open Data: From the Information Age to the Action Age (Keynote File)
Open Data: From the Information Age to the Action Age (Keynote File)
 
Some Lessons for Startups (ppt)
Some Lessons for Startups (ppt)Some Lessons for Startups (ppt)
Some Lessons for Startups (ppt)
 
The Ultimate Guide to Content Marketing & Influencer Strategy
The Ultimate Guide to Content Marketing & Influencer StrategyThe Ultimate Guide to Content Marketing & Influencer Strategy
The Ultimate Guide to Content Marketing & Influencer Strategy
 
Seoul Digital Forum (keynote file)
Seoul Digital Forum (keynote file)Seoul Digital Forum (keynote file)
Seoul Digital Forum (keynote file)
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
 
Elastic Apache Mesos on Amazon EC2
Elastic Apache Mesos on Amazon EC2Elastic Apache Mesos on Amazon EC2
Elastic Apache Mesos on Amazon EC2
 
Government 2.0
Government 2.0Government 2.0
Government 2.0
 
Digital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the worldDigital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the world
 
Stanford Ee380
Stanford Ee380Stanford Ee380
Stanford Ee380
 
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...
Tahseen Consulting’s Work on Knowledge-based Economies in the Arab Word is Ci...
 
Mobilité partagée, un enjeu d'innovation dans un système global de transport
Mobilité partagée, un enjeu d'innovation dans un système global de transportMobilité partagée, un enjeu d'innovation dans un système global de transport
Mobilité partagée, un enjeu d'innovation dans un système global de transport
 
Why go google
Why go googleWhy go google
Why go google
 
The roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours tooThe roadtrip that led to my first rails commit and how you could make yours too
The roadtrip that led to my first rails commit and how you could make yours too
 
Ficod 2011 pdf (with notes)
Ficod 2011 pdf (with notes)Ficod 2011 pdf (with notes)
Ficod 2011 pdf (with notes)
 
Awakening India - Jago Party
Awakening India - Jago PartyAwakening India - Jago Party
Awakening India - Jago Party
 

Similar to Using Cascalog to build
 an app based on City of Palo Alto Open Data

Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsPaco Nathan
 
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big DataFunctional programming for optimization problems in Big Data
Functional programming for optimization problems in Big DataPaco Nathan
 
Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingPaco Nathan
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow AbstractionOReillyStrata
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow AbstractionPaco Nathan
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Paco Nathan
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiPaco Nathan
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Paco Nathan
 
Building Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingBuilding Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingPaco Nathan
 
A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...Paco Nathan
 

Similar to Using Cascalog to build
 an app based on City of Palo Alto Open Data (11)

Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Chicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data WorkflowsChicago Hadoop Users Group: Enterprise Data Workflows
Chicago Hadoop Users Group: Enterprise Data Workflows
 
Functional programming for optimization problems in Big Data
Functional programming for optimization problems in Big DataFunctional programming for optimization problems in Big Data
Functional programming for optimization problems in Big Data
 
Enterprise Data Workflows with Cascading
Enterprise Data Workflows with CascadingEnterprise Data Workflows with Cascading
Enterprise Data Workflows with Cascading
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)
 
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKaiCascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
 
Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...Pattern: an open source project for migrating predictive models onto Apache H...
Pattern: an open source project for migrating predictive models onto Apache H...
 
Building Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with CascadingBuilding Enterprise Apps for Big Data with Cascading
Building Enterprise Apps for Big Data with Cascading
 
A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...A Data Scientist And A Log File Walk Into A Bar...
A Data Scientist And A Log File Walk Into A Bar...
 

More from Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

More from Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

Using Cascalog to build
 an app based on City of Palo Alto Open Data

  • 1. “Using Cascalog to build an app based on City of Palo Alto Open Data” Paco Nathan Document Collection Tokenize Scrub token Concurrent, Inc. M HashJoin Regex Left token GroupBy R Stop Word token List RHS San Francisco, CA Count @pacoid Word Count Copyright @2013, Concurrent, Inc. Monday, 28 January 13 1
  • 2. This project began as a machine learning workshop for a graduate seminar at CMU West Many thanks to: Stuart Evans, CMU Distinguished Service Professor Jonathan Reichental, City of Palo Alto CIO We use Cascalog to develop a Big Data workflow Open Source: github.com/Cascading/CoPA/wiki Monday, 28 January 13 2
  • 3. Palo Alto is generally quite a pleasant place • temperate weather • lots of parks, enormous trees • great coffeehouses • walkable downtown • not particularly crowded • friendly VCs (sort of) On a nice summer day, who wants to be stuck indoors on a phone call? Instead, take it outside – go for a walk Monday, 28 January 13 3
  • 4. Surely, there must be an app for that… But wait, there isn’t? So let’s build one! source: Apple Monday, 28 January 13 4
  • 6. 1. unstructured data about municipal infrastructure (GIS data: trees, roads, parks) ✚ 2. unstructured data about where people like to walk (smartphone GPS logs) ✚ Document Collection Scrub Tokenize token 3. a wee bit o’ curated metadata M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 4. personalized recommendations: “Find a shady spot on a summer day in which to walk near downtown Palo Alto.While on a long conference call. Sippin’ a latte or enjoying some fro-yo.” Monday, 28 January 13 6
  • 7. “unstructured” vs. “structured” data is actually quite a Big Debate refer back to Edgar Codd 1969 to learn about the Relational Model relational != SQL but I digress… Monday, 28 January 13 7
  • 8. Data Science work must focus on the process of structuring data which must occur long before the large-scale joins, predictive models, visualizations, etc. So, the process of structuring data is what we examine here: i.e., how to build workflows for Big Data thank you Dr. Codd “A relational model of data for large shared data banks” dl.acm.org/citation.cfm?id=362685 Monday, 28 January 13 8
  • 9. references by DJ Patil Data Jujitsu O’Reilly, 2012 amazon.com/dp/B008HMN5BE Building Data Science Teams O’Reilly, 2011 amazon.com/dp/B005O4U3ZE Monday, 28 January 13 9
  • 10. references by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9L also check out RStudio: rstudio.org/ rpubs.com/ Monday, 28 January 13 10
  • 11. Generally speaking, we could approach the matter of developing an Open Data app through these steps: • clean up the raw, unstructured data from CoPA download (ETL) • before modeling, perform visualization and analysis in RStudio • spend time on ideation and research for potential use cases • iterate on business process for the app workflow • integrate with use cases represented by the workflow taps • apply best practices and TDD at scale • …PROFIT! source: South Park Monday, 28 January 13 11
  • 12. edoMpUsserD:IUN In terms of actual process used in tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd Data Science, here’s how my teams edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN have worked: tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC wodniW D3 nepO dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN help people ask the ytinummoc ,tneilc :detratS weiV eivoM discovery teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC right questions egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT t a eS e g n a h C dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU wodniW D3 nepO allow automation to modeling place informed bets deliver products at integration scale to customers build smarts into apps product features keep infrastructure systems running, cost-effective Monday, 28 January 13 12
  • 13. For the process used with this Open Data app, we chose to use Cascalog by Nathan Marz, Sam Ritchie, et al., 2010 a DSL in Clojure which implements Datalog, backed by Cascading Some aspects of CS theory: • Functional Relational Programming • mitigates Accidental Complexity • has been compared with Codd 1969 github.com/nathanmarz/cascalog/wiki Monday, 28 January 13 13
  • 14. Q: Who uses Cascalog, other than Twitter? A: • Climate Corp (they’re hiring, ask for Crea) • Factual • Nokia Maps • Harvard School of Public Health • YieldBot (PDX) • uSwitch (London) • etc. Monday, 28 January 13 14
  • 15. pro: • 10:1 reduction in code volume compared to SQL • most advanced uses of Cascading • Leiningen build: simple, no surprises, in Clojure itself • test-driven development (TDD) for Big Data • fault-tolerant workflows which are simple to follow • machine learning, map-reduce, etc., started in LISP years ago anywho con: • learning curve, limited number of Clojure developers • aggregators are the magic, those take effort to learn Monday, 28 January 13 15
  • 16. Accidental Complexity: Not O(N^2) complexity, but the costs of software engineering at scale over time What happens when you build recommenders, then go work on other projects for six months? What does it cost others to maintain your apps? Cascalog allows for leveraging the same framework, same code base, from Discovery phase through to Systems phase It focuses on the process of structuring data: specify what you require, not how it must be achieved Huge implications for software engineering Monday, 28 January 13 16
  • 17. discovery source: 2001 A Space Odyssey Monday, 28 January 13 17
  • 18. discovery The City of Palo Alto recently began to support Open Data to give the local community greater visibility into how their city government operates This effort is intended to encourage students, entrepreneurs, local organizations, etc., to build new apps which contribute to the public good paloalto.opendata.junar.com/dashboards/7576/ geographic-information/ Monday, 28 January 13 18
  • 19. discovery GIS about trees in Palo Alto: Monday, 28 January 13 19
  • 20. discovery GIS about roads in Palo Alto: Monday, 28 January 13 20
  • 21. discovery Geographic_Information,,, "Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29 Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australis Source: davey tree Protected: Designated: Heritage: Appraised Value: Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872 Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point" "Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: Wilkie Way From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID: 598 (Wilkie Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 Year Constructed: 1950 Traffic Count: 596 Traffic Index: residential local Traffic Class: local residential Traffic Date: 08/24/90 Paving Length: 208 Paving Width: 40 Paving Area: 8320 Surface Type: asphalt concrete Surface Thickness: Thickness: 2.0 6.0 (um, bokay…) Base Type Pvmt: Soil Class: 2 crusher run base Soil Value: 15 Base Curb Type: Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1 District Number: 18 Land Use PMMS: 1 Overlay Year: 1990 Overlay Thickness: 1.5 Base Failure Year: 1990 Base Failure Thickness: 6 Surface Treatment Year: Surface Treatment Type: Alligator Severity: none Alligator Extent: 0 Block Severity: none Block Extent: 0 Longitude and Transverse Severity: none Longitude and Transverse Extent: 0 Ravelling Severity: none Ravelling Extent: 0 Ridability Monday, 28Severity: January 13 none Trench Severity: none Trench Extent: 0 21
  • 22. discovery (defn parse-gis [line] "leverages parse-csv for complex CSV format in GIS export" (first (csv/parse-csv line)) )     (defn etl-gis [gis trap] "subquery to parse data sets from the GIS source tap" (<- [?blurb ?misc ?geo ?kind] (gis ?line) (parse-gis ?line :> ?blurb ?misc ?geo ?kind) (:trap (hfs-textline trap)) )) (specify what you require, not how to achieve it… addressing the 80%) Monday, 28 January 13 22
  • 23. discovery (convert ad-hoc queries into logical propositions) Monday, 28 January 13 23
  • 24. discovery Identifier: 474 Tree ID: 412 Tree: 412 site 1 at 115 HAWTHORNE AV Tree Site: 1 Street_Name: HAWTHORNE AV Situs Number: 115 Private: -1 Species: Liquidambar styraciflua Source: davey tree Hardscape: None 37.446001565119,-122.167713417554,0.0 Point (obtain recognizable results) Monday, 28 January 13 24
  • 25. discovery (curate valuable metadata) Monday, 28 January 13 25
  • 26. discovery (defn get-trees [src trap tree_meta] "subquery to parse/filter the tree data" (<- [?blurb ?tree_id ?situs ?tree_site ?species ?wikipedia ?calflora ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash ] (src ?blurb ?misc ?geo ?kind) (re-matches #"^s+Private.*Tree ID.*" ?misc) (parse-tree ?misc :> _ ?priv ?tree_id ?situs ?tree_site ?raw_species) ((c/comp s/trim s/lower-case) ?raw_species :> ?species) (tree_meta ?species ?wikipedia ?calflora ?min_height ?max_height) (avg ?min_height ?max_height :> ?avg_height) (geo-tree ?geo :> _ ?tree_lat ?tree_lng ?tree_alt) (read-string ?tree_lat :> ?lat) (read-string ?tree_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (:trap (hfs-textline trap)) )) Monday, 28 January 13 26
  • 27. discovery ?blurb! ! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 from pl ?tree_id!" 412 ?situs" " 115 ?tree_site" 1 ?species"" liquidambar styraciflua ?wikipedia" http://en.wikipedia.org/wiki/Liquidambar_styraciflua ?calflora" http://calflora.org/cgi-bin/species_query.cgi?where-calrecnum=8598 ?avg_height"27.5 ?tree_lat" 37.446001565119 ?tree_lng" -122.167713417554 ?tree_alt" 0.0 ?geohash"" 9q9jh0 (et voilà, a data product) Monday, 28 January 13 27
  • 28. discovery // run some analysis and visualization in R library(ggplot2) dat_folder <- '~/src/concur/CoPA/out/tree' data <- read.table(file=paste(dat_folder, "part-00000", sep="/"), sep="t", quote="", na.strings="NULL", header=FALSE, encoding="UTF8")   summary(data) t <- head(sort(table(data$V5), decreasing=TRUE) trees <- as.data.frame.table(t, n=20)) colnames(trees) <- c("species", "count")   m <- ggplot(data, aes(x=V8)) m <- m + ggtitle("Estimated Tree Height (meters)") m + geom_histogram(aes(y = ..density.., fill = ..count..)) + geom_density()   par(mar = c(7, 4, 4, 2) + 0.1) plot(trees, xaxt="n", xlab="") axis(1, labels=FALSE) text(1:nrow(trees), par("usr")[3] - 0.25, srt=45, adj=1, labels=trees$species, xpd=TRUE) grid(nx=nrow(trees)) Monday, 28 January 13 28
  • 29. discovery sweetgum Monday, 28 January 13 29
  • 30. discovery GIS Regex tree Scrub export parse-tree species M Estimate Join Geohash height Regex src parse-gis M Tree tree Metadata Failure Traps (flow diagram, gis tree) Monday, 28 January 13 30
  • 31. definitions The conceptual flow diagram shows a directed, acyclic graph (DAG) of taps, tuple streams, functions, joins, aggregations, assertions, etc. Cascading is formally a pattern language – patterns of “plumbing” fit together to ensure best practices for large-scale parallel processing in risk-aversive environments – hard requirements of Enterprise IT GIS Regex tree Scrub export parse-tree species M Estimate Join Geohash height Regex src parse-gis M Tree tree Metadata Failure Traps In other words, Cascading forces functional programming through an API for JVM-based languages such as Java, Scala, Clojure Through this approach, we define Enterprise Data Workflows Monday, 28 January 13 31
  • 32. definitions pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices amazon.com/dp/0195019199 design patterns: originated in consensus negotiation for architecture, later used in OOP software engineering amazon.com/dp/0201633612 Monday, 28 January 13 32
  • 33. discovery (defn get-roads [src trap road_meta] "subquery to parse/filter the road data" (<- [?blurb ?bike_lane ?bus_route ?truck_route ?albedo ?min_lat ?min_lng ?min_alt ?geohash ?traffic_count ?traffic_index ?traffic_class ?paving_length ?paving_width ?paving_area ?surface_type ] (src ?blurb ?misc ?geo ?kind) (re-matches #"^s+Sequence.*Traffic Count.*" ?misc) (parse-road ?misc :> _ ?traffic_count ?traffic_index ?traffic_class ?paving_length ?paving_width ?paving_area ?surface_type ?overlay_year ?bike_lane ?bus_route ?truck_route) (road_meta ?surface_type ?albedo_new ?albedo_worn) (estimate-albedo ?overlay_year ?albedo_new ?albedo_worn :> ?albedo) (bigram ?geo :> ?pt0 ?pt1) (midpoint ?pt0 ?pt1 :> ?lat ?lng ?alt) ;; why filter for min? because there are geo duplicates.. (c/min ?lat :> ?min_lat) (c/min ?lng :> ?min_lng) (c/min ?alt :> ?min_alt) (geohash ?min_lat ?min_lng :> ?geohash) (:trap (hfs-textline trap)) )) Monday, 28 January 13 33
  • 34. discovery ?blurb" " " Hawthorne Avenue from Alma Street to High Street ?traffic_count"3110 ?traffic_class"local residential ?surface_type" asphalt concrete ?albedo" " " 0.12 ?min_lat"" " 37.446140860599854" ?min_lng " " -122.1674652295435 ?min_alt " " 0.0 ?geohash"" " 9q9jh0 (another data product) Monday, 28 January 13 34
  • 35. discovery The road data provides: • traffic class (arterial, truck route, residential, etc.) • traffic counts distribution • surface type (asphalt, cement; age) This leads to estimators for noise, reflection, etc. Monday, 28 January 13 35
  • 36. discovery GIS export Regex road Regex src parse-gis parse-road M M Estimate Road Join Albedo Segments Geohash Failure Traps R Road road Metadata (flow diagram, gis road) Monday, 28 January 13 36
  • 37. modeling source: America’s Next Top Model Monday, 28 January 13 37
  • 38. modeling GIS data from Palo Alto provides us with geolocation about each item in the export: latitude, longitude, altitude Geo data is great for managing municipal infrastructure as well as for mobile apps Predictive modeling in our Open Data example focuses on leveraging geolocation We use spatial indexing by creating a grid of geohash values, for efficient parallel processing Cascalog queries collect items with the same geohash values – using them as keys for large-scale joins (Hadoop) Monday, 28 January 13 38
  • 39. modeling geohash with 6-digit resolution approximates a 5-block square centered lat: 37.445, lng: -122.162 9q9jh0 Monday, 28 January 13 39
  • 40. modeling Each road in the GIS export is listed as a block between two cross roads, and each may have multiple road segments to represent turns: " -122.161776959558,37.4518836690781,0.0 " -122.161390381489,37.4516410983794,0.0 " -122.160786011735,37.4512589903357,0.0 " -122.160531178368,37.4510977281699,0.0 ( lat1, lng1, alt1 ) ( lat3, lng3, alt3 ) ( lat0, lng0, alt0 ) ( lat2, lng2, alt2 ) NB: segments in the raw GIS have the order of geo coordinates scrambled: (lng, lat, alt) Monday, 28 January 13 40
  • 41. modeling Our app analyzes each road segment as a data tuple, calculating the center point for each: ( lat, lng, alt ) Monday, 28 January 13 41
  • 42. modeling Then uses a geohash to define a grid cell, as a boundary (or “canopy”): 9q9jh0 Monday, 28 January 13 42
  • 43. modeling Query to join a road segment tuple with all the trees within its geohash boundary: 9q9jh0 Monday, 28 January 13 43
  • 44. modeling Use distance-to-midpoint to filter trees which are too far away to provide shade: X X X Monday, 28 January 13 44
  • 45. modeling Calculate a sum of moments for tree height × distance from road segment, as an estimator for shade: ∑( h·d ) We also calculate estimators for traffic frequency and noise Monday, 28 January 13 45
  • 46. modeling (defn get-shade [trees roads] "subquery to join tree and road estimates, maximize for shade" (<- [?road_name ?geohash ?road_lat ?road_lng ?road_alt ?road_metric ?tree_metric] (roads ?road_name _ _ _ ?albedo ?road_lat ?road_lng ?road_alt ?geohash ?traffic_count _ ?traffic_class _ _ _ _) (road-metric ?traffic_class ?traffic_count ?albedo :> ?road_metric) (trees _ _ _ _ _ _ _ ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash) (read-string ?avg_height :> ?height) ;; limit to trees which are higher than people (> ?height 2.0) (tree-distance ?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance) ;; limit to trees within a one-block radius (not meters) (<= ?distance 25.0) (/ ?height ?distance :> ?tree_moment) (c/sum ?tree_moment :> ?sum_tree_moment) ;; magic number 200000.0 used to scale tree moment ;; based on median (/ ?sum_tree_moment 200000.0 :> ?tree_metric) )) Monday, 28 January 13 46
  • 47. modeling ?road_name" " Hawthorne Avenue from Alma Street to High Street ?geohash"" " 9q9jh0 ?road_lat" " 37.446140860599854 ?road_lng " " -122.1674652295435 ?road_alt " " 0.0 ?road_metric" [1.0 0.5488121277250486 0.88] ?tree_metric" 4.36321007861036 (another data product) Monday, 28 January 13 47
  • 48. modeling Filter tree height M Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road shade traffic (flow diagram, shade) Monday, 28 January 13 48
  • 51. modeling (defn get-gps [gps_logs trap] "subquery to aggregate and rank GPS tracks per user" (<- [?uuid ?geohash ?gps_count ?recent_visit] (gps_logs ?date ?uuid ?gps_lat ?gps_lng ?alt ?speed ?heading ?elapsed ?distance) (read-string ?gps_lat :> ?lat) (read-string ?gps_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (c/count :> ?gps_count) (date-num ?date :> ?visit) (c/max ?visit :> ?recent_visit) )) (behavioral targeting: aggregate GPS tracks by recency, frequency) Monday, 28 January 13 51
  • 52. modeling gps Count Geohash Max logs gps_count recent_visit M R gps (flow diagram, gps) Monday, 28 January 13 52
  • 53. modeling ?uuid ?geohash ?gps_count ?recent_visit cf660e041e994929b37cc5645209c8ae 9q8yym 7 1972376866448 342ac6fd3f5f44c6b97724d618d587cf 9q9htz 4 1972376690969 32cc09e69bc042f1ad22fc16ee275e21 9q9hv3 3 1972376670935 342ac6fd3f5f44c6b97724d618d587cf 9q9hv3 3 1972376691356 342ac6fd3f5f44c6b97724d618d587cf 9q9hv6 1 1972376691180 342ac6fd3f5f44c6b97724d618d587cf 9q9hv8 18 1972376691028 342ac6fd3f5f44c6b97724d618d587cf 9q9hv9 7 1972376691101 342ac6fd3f5f44c6b97724d618d587cf 9q9hvb 22 1972376691010 342ac6fd3f5f44c6b97724d618d587cf 9q9hwn 13 1972376690782 342ac6fd3f5f44c6b97724d618d587cf 9q9hwp 58 1972376690965 482dc171ef0342b79134d77de0f31c4f 9q9jh0 15 1972376952532 b1b4d653f5d9468a8dd18a77edcc5143 9q9jh0 18 1972376945348 (GPS personalization) Monday, 28 January 13 53
  • 54. modeling (defn get-reco [tracks shades] "subquery to recommend road segments based on GPS tracks" (<- [?uuid ?road ?geohash ?lat ?lng ?alt ?gps_count ?recent_visit ?road_metric ?tree_metric] (tracks ?uuid ?geohash ?gps_count ?recent_visit) (shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric) )) (finally, the recommender) Monday, 28 January 13 54
  • 55. modeling Recommenders combine multiple signals, generally via weighted averages, to rank personalized results: • GPS of person ∩ road segment • frequency and recency of visit • traffic class and rate • road albedo (sunlight reflection) • tree shade estimator Adjusting the mix allows for further personalization at the end use Monday, 28 January 13 55
  • 57. integration Hadoop is rarely ever used in isolation System integration is a hard problem in Big Data, especially social aspects: breaking down silos Cascading was built for this purpose: • taps across many data frameworks: HBase, Cassandra, MongoDB, etc. GIS Regex tree Scrub export parse-tree species • support for a variety of data serialization: M Estimate Join Geohash height Regex src Avro,Thrift, Kryo, JSON, etc. parse-gis M Tree tree Metadata Failure Traps • planning on multiple topologies: MapReduce, in-memory, tuple spaces, etc. • test-driven development (TDD) at scale • ANSI SQL-92 integration, PMML, etc. Monday, 28 January 13 57
  • 58. integration This example focuses on the batch workflow to examine best practices for parallel processing Integrating with a mobile app requires next steps: • push “reco” output to a Redis cluster (caching layer) via a Cascading tap • leverage Redis “sorted sets” for ranking personalized results • create lightweight API in Node.js + Nginx for low-latency access at scale • collect social interactions in Splunk • instrument via Nagios, New Relic, Flurry, etc. That provides a data service – doesn’t even begin to address: design, user experience, marketing, implementation, etc., for a complete app… Monday, 28 January 13 58
  • 59. integration Batch workflow plus a data service: web web Redis web mobile logsGIS logs cluster app API export Customers Cascading app source sink tap tap source Recommender tap trap source customer tap tap Splunk profile Customer DBs Prefs web Support web Hadoop cluster logs gps review logs tracks Monday, 28 January 13 59
  • 60. integration In terms of deploying a batch workflow, there are several considerations: • build package for a “fat jar” (lein uberjar) • continuous integration • JAR repository • cluster scheduling (e.g., EMR) • instrumentation (Concurrent) • troubleshooting from app layer Monday, 28 January 13 60
  • 62. apps We work on discovery, modeling, integration – long before coding an app. In a linear-logical sense, one might prefer a “waterfall” approach; however, that would undermine core values – mitigating Accidental Complexity – TDD, scalability, fault-tolerance, etc. In lieu of SQL queries, we define a composable set of logical propositions which can be executed, instrumented, tested, etc., independently for best practices at scale in parallel Back to functional relational programming, particularly Datalog’s logic programming, we use subqueries as logical propositions… within a functional context… to leverage the relational model • scalability: specify what you require, not how • testability: disprove the opposites of propositions, to validate Taken together in the context of Cascalog, now let’s build the app… Monday, 28 January 13 62
  • 63. apps (defproject cascading-copa "0.1.0-SNAPSHOT" :description "City of Palo Alto Open Data recommender in Cascalog" :url "https://github.com/Cascading/CoPA" :license {:name "Apache License, Version 2.0" :url "http://www.apache.org/licenses/LICENSE-2.0" :distribution :repo } :uberjar-name "copa.jar" :aot [copa.core] :main copa.core :source-paths ["src/main/clj"] :dependencies [[org.clojure/clojure "1.4.0"] [cascalog "1.10.0"] [cascalog-more-taps "0.3.1-SNAPSHOT"] [clojure-csv/clojure-csv "1.3.2"] [org.clojars.sunng/geohash "1.0.1"] [org.clojure/clojure-contrib "1.2.0"] [date-clj "1.0.1"] ] :profiles {:dev {:dependencies [[midje-cascalog "0.4.0"]]} :provided {:dependencies [ [org.apache.hadoop/hadoop-core "0.20.2-dev"] ]}} ) Monday, 28 January 13 63
  • 65. apps (results) ‣ addr: 115 HAWTHORNE AVE ‣ lat/lng: 37.446, -122.168 ‣ geohash: 9q9jh0 ‣ tree: 413 site 2 ‣ species: Liquidambar styraciflua ‣ est. height: 23 m ‣ shade metric: 4.363 ‣ traffic: local residential, light traffic ‣ recent visit: 1972376952532 ‣ a short walk from my train stop ✔ Monday, 28 January 13 65
  • 66. apps GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit (flow diagram, M R for the whole enchilada) Monday, 28 January 13 66
  • 67. definitions Design principles in the Cascading API pattern language, which help ensure best practices for Big Data apps in an Enterprise context: • specify what is required, not how it must be achieved • provide the “glue” for system integration • same JAR, any scale • users want no surprises • fail the same way twice • plan far ahead These points echo arguments about functional relational programming (FRP) and Accidental Complexity from Moseley/Marks 2006 Monday, 28 January 13 67
  • 69. principle: same JAR, any scale MegaCorp Enterprise IT: Pb’s data 1000+ node private cluster EVP calls you when app fails runtime: days+ Production Cluster: Tb’s data EMR w/ many HPC Instances Ops monitors results runtime: hours – days Staging Cluster: Gb’s data EMR + a few Spot Instances CI shows red or green lights runtime: minutes – hours Your Laptop: Mb’s data Hadoop standalone mode passes unit tests, or not runtime: seconds – minutes Monday, 28 January 13 69
  • 70. systems #!/bin/bash -ex # edit the `BUCKET` variable to use one of your S3 buckets: BUCKET=temp.cascading.org/copa SINK=out   # clear previous output (required by Apache Hadoop) s3cmd del -r s3://$BUCKET/$SINK # load built JAR + input data s3cmd put target/copa.jar s3://$BUCKET/ s3cmd put -r data s3://$BUCKET/   # launch cluster and run elastic-mapreduce --create --name "CoPA" --debug --enable-debugging --log-uri s3n://$BUCKET/logs --jar s3n://$BUCKET/copa.jar --arg s3n://$BUCKET/data/copa.csv --arg s3n://$BUCKET/data/meta_tree.tsv --arg s3n://$BUCKET/data/meta_road.tsv --arg s3n://$BUCKET/data/gps.csv --arg s3n://$BUCKET/$SINK/trap --arg s3n://$BUCKET/$SINK/park --arg s3n://$BUCKET/$SINK/tree --arg s3n://$BUCKET/$SINK/road --arg s3n://$BUCKET/$SINK/shade --arg s3n://$BUCKET/$SINK/gps --arg s3n://$BUCKET/$SINK/reco Monday, 28 January 13 70
  • 72. systems ‣ name node / data node ‣ job tracker / task tracker ‣ submit queue ‣ task slots ‣ HDFS ‣ distributed cache Wikipedia (under the hood) Apache Monday, 28 January 13 72
  • 73. bucket list Monday, 28 January 13 73
  • 74. Could combine this with a variety of data APIs: • Trulia neighborhood data, housing prices • Factual local business (FB Places, etc.) • CommonCrawl open source full web crawl • Wunderground local weather data • WalkScore neighborhood data, walkability • Data.gov US federal open data • Data.NASA.gov NASA open data • DBpedia datasets derived from Wikipedia • GeoWordNet semantic knowledge base • Geolytics demographics, GIS, etc. • Foursquare,Yelp, CityGrid, Localeze,YP • various photo sharing Monday, 28 January 13 74
  • 75. Data Quality: some species names have spelling errors or misclassifications – could be cleaned up and provided back to CoPA to improve municipal services Assumptions have been made about missing data – were these appropriate for the intended use case? There are better ways to handle spatial indexing: k-d trees, etc. The tree data product needs: photos, toxicity, natives vs. invasives, common names, etc. Monday, 28 January 13 75
  • 76. Arguably, this is not a “large” data set: • Palo Alto has 65K population • great location for a POC • prior to deploying in large metro areas • CoPA is a leader in e-gov • app is simpler to study on a laptop Could extend to other cities with Open Data initiatives: SF, SJ, PDX, Seattle, VanBC… Let’s get coverage for all of Ecotopia! Monday, 28 January 13 76
  • 77. Trulia: optimize sales leads using estimated allergy zones, based on buyers’ real estate preferences Calflora: report new observations of invasives endangered species, etc.; infer regions of affinity for releasing beneficial insects City of Palo Alto: assess zoning impact, e.g., oleanders near day care centers; monitor outbreaks of tree diseases (big impact on property values) start-ups: some invasive species are valuable in Chinese medicine while others can be converted to biodiesel – potential win-win for targeted harvest services Monday, 28 January 13 77
  • 78. summary points • geo data is great for municipal infrastructure and for mobile apps • Cascading as a pattern language for Enterprise Data Workflows • design principles in the API/pattern language ensure best practices • focus on the process of structuring data; not un/structured • Cascalog subqueries as composable logical propositions • FRP mitigates the engineering costs of Accidental Complexity • Data Science process: discovery, modeling, integration, apps, systems • Hadoop is rarely ever used in isolation; breaking down silos is the hard problem, which must be socialized to resolve Monday, 28 January 13 78
  • 79. references leiningen.org github.com/nathanmarz/cascalog/wiki sritchie.github.com vimeo.com/16398892 manning.com/marz java.dzone.com/articles/using-lucene- and-cascalog-fast Monday, 28 January 13 79
  • 80. references by Paco Nathan Enterprise Data Workflows with Cascading O’Reilly, 2013 amazon.com/dp/1449358721 Santa Clara, Feb 28, 1:30pm strataconf.com/strata2013 Monday, 28 January 13 80
  • 81. drill-down blog, code/wiki/gists, maven repo, community, products: cascading.org github.org/Cascading conjars.org meetup.com/cascading goo.gl/KQtUL concurrentinc.com we are hiring! Copyright @2013, Concurrent, Inc. Monday, 28 January 13 81