Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
“Using Cascalog to build
            an app based on
            City of Palo Alto Open Data”


               Paco Nathan...
This project began as a machine
          learning workshop for a graduate
          seminar at CMU West

          Many t...
Palo Alto is generally quite
          a pleasant place

           • temperate weather
           • lots of parks, enormo...
Surely, there must be
          an app for that…

          But wait, there isn’t?

          So let’s build one!




    ...
process




source: algaelab.org




Monday, 28 January 13             5
1. unstructured data about municipal infrastructure
          (GIS data: trees, roads, parks)
                            ...
“unstructured” vs. “structured” data
          is actually quite a Big Debate
          refer back to Edgar Codd 1969
    ...
Data Science work must focus on
          the process of structuring data
          which must occur long before the
     ...
references

                by DJ Patil

                Data Jujitsu
                O’Reilly, 2012
                amazo...
references

                by Leo Breiman
                Statistical Modeling:
                The Two Cultures
        ...
Generally speaking, we could approach the matter of developing
          an Open Data app through these steps:
           ...
edoMpUsserD:IUN




          In terms of actual process used in
                                                         ...
For the process used with this Open Data app,
          we chose to use Cascalog
          by Nathan Marz, Sam Ritchie, et...
Q:
            Who uses Cascalog, other than Twitter?

          A:
           • Climate Corp (they’re hiring, ask for Cre...
pro:
           • 10:1 reduction in code volume compared to SQL
           • most advanced uses of Cascading
           • ...
Accidental Complexity:
          Not O(N^2) complexity, but the costs of software
          engineering at scale over time...
discovery




source: 2001 A Space Odyssey




Monday, 28 January 13                      17
discovery
          The City of Palo Alto recently began to support
          Open Data to give the local community greate...
discovery
          GIS about trees in Palo Alto:




Monday, 28 January 13                                 19
discovery
          GIS about roads in Palo Alto:




Monday, 28 January 13                                 20
discovery
      Geographic_Information,,,

          "Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl","
     ...
discovery
          (defn parse-gis [line]
              "leverages parse-csv for complex CSV format in GIS export"
      ...
discovery




                         (convert ad-hoc queries
                        into logical propositions)

Monday,...
discovery
          Identifier:   474
          Tree ID:      412
          Tree:         412 site 1 at 115 HAWTHORNE AV
 ...
discovery




                        (curate valuable metadata)



Monday, 28 January 13                                 ...
discovery
          (defn get-trees [src trap tree_meta]
            "subquery to parse/filter the tree data"
            ...
discovery
          ?blurb! ! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 from pl
          ?tree_id!" 412
  ...
discovery
          // run some analysis and visualization in R
          library(ggplot2)

          dat_folder <- '~/src...
discovery




                        sweetgum




Monday, 28 January 13                          29
discovery

               GIS                                  Regex




                                          tree
  ...
definitions
          The conceptual flow diagram shows a directed, acyclic graph (DAG)
          of taps, tuple streams, f...
definitions
         pattern language: a structured method for
         solving large, complex design problems, where
    ...
discovery
          (defn get-roads [src trap road_meta]
            "subquery to parse/filter the road data"
            ...
discovery
          ?blurb" " " Hawthorne Avenue from Alma Street to High Street
          ?traffic_count"3110
          ?...
discovery
          The road data provides:

           • traffic class (arterial, truck route, residential, etc.)
        ...
discovery

               GIS
              export




                         Regex




                                ...
modeling




source: America’s Next Top Model




Monday, 28 January 13                         37
modeling

          GIS data from Palo Alto provides us with
          geolocation about each item in the export:
        ...
modeling




                                 geohash with 6-digit resolution
                                 approximate...
modeling

         Each road in the GIS export is listed as a block
         between two cross roads, and each may have
  ...
modeling

         Our app analyzes each road segment as a data tuple,
         calculating the center point for each:



...
modeling

         Then uses a geohash to define a grid cell,
         as a boundary (or “canopy”):




                   ...
modeling

         Query to join a road segment tuple with all the trees
         within its geohash boundary:




       ...
modeling

         Use distance-to-midpoint to filter trees which are
         too far away to provide shade:



          ...
modeling

         Calculate a sum of moments for tree height × distance
         from road segment, as an estimator for s...
modeling
          (defn get-shade [trees roads]
            "subquery to join tree and road estimates, maximize for shade...
modeling
          ?road_name" "   Hawthorne Avenue from Alma Street to High Street
          ?geohash"" "    9q9jh0
     ...
modeling


                                Filter
                        tree
                                height




...
modeling




Monday, 28 January 13              49
modeling




Monday, 28 January 13              50
modeling
          (defn get-gps [gps_logs trap]
            "subquery to aggregate and rank GPS tracks per user"
        ...
modeling




                            gps                Count
                                   Geohash              ...
modeling
          ?uuid                              ?geohash   ?gps_count   ?recent_visit
          cf660e041e994929b37c...
modeling
          (defn get-reco [tracks shades]
            "subquery to recommend road segments based on GPS tracks"
  ...
modeling

         Recommenders combine multiple signals,
         generally via weighted averages, to rank
         perso...
integration




source: Wolfram




Monday, 28 January 13                 56
integration

         Hadoop is rarely ever used in isolation
         System integration is a hard problem in Big Data,
 ...
integration

         This example focuses on the batch workflow
         to examine best practices for parallel processing...
integration

          Batch workflow plus a data service:


             web
               web                           ...
integration

         In terms of deploying a batch workflow,
         there are several considerations:

           • buil...
apps




source: Apple




Monday, 28 January 13          61
apps

         We work on discovery, modeling, integration – long before
         coding an app. In a linear-logical sense...
apps
          (defproject cascading-copa "0.1.0-SNAPSHOT"
            :description "City of Palo Alto Open Data recommend...
apps




Monday, 28 January 13          64
apps



                                                           (results)



             ‣   addr: 115 HAWTHORNE AVE
 ...
apps


          GIS                               Regex
                                    tree

                       ...
definitions
         Design principles in the Cascading API pattern language,
         which help ensure best practices fo...
systems




source: Wired




Monday, 28 January 13             68
principle: same JAR, any scale
                                                                      MegaCorp Enterprise I...
systems
          #!/bin/bash -ex
          # edit the `BUCKET` variable to use one of your S3 buckets:
          BUCKET=t...
systems




Monday, 28 January 13             71
systems

                ‣ name node / data node
                ‣ job tracker / task tracker
                ‣ submit que...
bucket
                         list
Monday, 28 January 13            73
Could combine this with a variety of data APIs:
          • Trulia neighborhood data, housing prices
          • Factual l...
Data Quality: some species names have
         spelling errors or misclassifications – could
         be cleaned up and pr...
Arguably, this is not a “large” data set:
          • Palo Alto has 65K population
          • great location for a POC
  ...
Trulia: optimize sales leads using estimated
         allergy zones, based on buyers’ real estate
         preferences

  ...
summary points
                  • geo data is great for municipal infrastructure and for mobile apps
                  • ...
references

                leiningen.org
                github.com/nathanmarz/cascalog/wiki
                sritchie.git...
references

                by Paco Nathan
                Enterprise Data Workflows
                with Cascading
       ...
drill-down

                 blog, code/wiki/gists, maven repo, community, products:
                 cascading.org
      ...
Upcoming SlideShare
Loading in …5
×

of

Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 1 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 2 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 3 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 4 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 5 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 6 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 7 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 8 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 9 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 10 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 11 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 12 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 13 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 14 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 15 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 16 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 17 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 18 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 19 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 20 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 21 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 22 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 23 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 24 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 25 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 26 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 27 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 28 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 29 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 30 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 31 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 32 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 33 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 34 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 35 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 36 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 37 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 38 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 39 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 40 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 41 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 42 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 43 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 44 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 45 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 46 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 47 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 48 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 49 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 50 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 51 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 52 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 53 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 54 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 55 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 56 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 57 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 58 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 59 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 60 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 61 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 62 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 63 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 64 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 65 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 66 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 67 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 68 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 69 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 70 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 71 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 72 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 73 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 74 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 75 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 76 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 77 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 78 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 79 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 80 Using Cascalog to build
 an app based on City of Palo Alto Open Data Slide 81
Upcoming SlideShare
Cascading for the Impatient
Next
Download to read offline and view in fullscreen.

11 Likes

Share

Download to read offline

Using Cascalog to build
 an app based on City of Palo Alto Open Data

Download to read offline

Slides for Open Data Bay Area meetup on 2013-01-29 in SF: http://www.meetup.com/Open-Data-Bay-Area/events/98445822/

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Using Cascalog to build
 an app based on City of Palo Alto Open Data

  1. 1. “Using Cascalog to build an app based on City of Palo Alto Open Data” Paco Nathan Document Collection Tokenize Scrub token Concurrent, Inc. M HashJoin Regex Left token GroupBy R Stop Word token List RHS San Francisco, CA Count @pacoid Word Count Copyright @2013, Concurrent, Inc. Monday, 28 January 13 1
  2. 2. This project began as a machine learning workshop for a graduate seminar at CMU West Many thanks to: Stuart Evans, CMU Distinguished Service Professor Jonathan Reichental, City of Palo Alto CIO We use Cascalog to develop a Big Data workflow Open Source: github.com/Cascading/CoPA/wiki Monday, 28 January 13 2
  3. 3. Palo Alto is generally quite a pleasant place • temperate weather • lots of parks, enormous trees • great coffeehouses • walkable downtown • not particularly crowded • friendly VCs (sort of) On a nice summer day, who wants to be stuck indoors on a phone call? Instead, take it outside – go for a walk Monday, 28 January 13 3
  4. 4. Surely, there must be an app for that… But wait, there isn’t? So let’s build one! source: Apple Monday, 28 January 13 4
  5. 5. process source: algaelab.org Monday, 28 January 13 5
  6. 6. 1. unstructured data about municipal infrastructure (GIS data: trees, roads, parks) ✚ 2. unstructured data about where people like to walk (smartphone GPS logs) ✚ Document Collection Scrub Tokenize token 3. a wee bit o’ curated metadata M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 4. personalized recommendations: “Find a shady spot on a summer day in which to walk near downtown Palo Alto.While on a long conference call. Sippin’ a latte or enjoying some fro-yo.” Monday, 28 January 13 6
  7. 7. “unstructured” vs. “structured” data is actually quite a Big Debate refer back to Edgar Codd 1969 to learn about the Relational Model relational != SQL but I digress… Monday, 28 January 13 7
  8. 8. Data Science work must focus on the process of structuring data which must occur long before the large-scale joins, predictive models, visualizations, etc. So, the process of structuring data is what we examine here: i.e., how to build workflows for Big Data thank you Dr. Codd “A relational model of data for large shared data banks” dl.acm.org/citation.cfm?id=362685 Monday, 28 January 13 8
  9. 9. references by DJ Patil Data Jujitsu O’Reilly, 2012 amazon.com/dp/B008HMN5BE Building Data Science Teams O’Reilly, 2011 amazon.com/dp/B005O4U3ZE Monday, 28 January 13 9
  10. 10. references by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9L also check out RStudio: rstudio.org/ rpubs.com/ Monday, 28 January 13 10
  11. 11. Generally speaking, we could approach the matter of developing an Open Data app through these steps: • clean up the raw, unstructured data from CoPA download (ETL) • before modeling, perform visualization and analysis in RStudio • spend time on ideation and research for potential use cases • iterate on business process for the app workflow • integrate with use cases represented by the workflow taps • apply best practices and TDD at scale • …PROFIT! source: South Park Monday, 28 January 13 11
  12. 12. edoMpUsserD:IUN In terms of actual process used in tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd Data Science, here’s how my teams edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN have worked: tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC wodniW D3 nepO dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN help people ask the ytinummoc ,tneilc :detratS weiV eivoM discovery teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC right questions egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT t a eS e g n a h C dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU wodniW D3 nepO allow automation to modeling place informed bets deliver products at integration scale to customers build smarts into apps product features keep infrastructure systems running, cost-effective Monday, 28 January 13 12
  13. 13. For the process used with this Open Data app, we chose to use Cascalog by Nathan Marz, Sam Ritchie, et al., 2010 a DSL in Clojure which implements Datalog, backed by Cascading Some aspects of CS theory: • Functional Relational Programming • mitigates Accidental Complexity • has been compared with Codd 1969 github.com/nathanmarz/cascalog/wiki Monday, 28 January 13 13
  14. 14. Q: Who uses Cascalog, other than Twitter? A: • Climate Corp (they’re hiring, ask for Crea) • Factual • Nokia Maps • Harvard School of Public Health • YieldBot (PDX) • uSwitch (London) • etc. Monday, 28 January 13 14
  15. 15. pro: • 10:1 reduction in code volume compared to SQL • most advanced uses of Cascading • Leiningen build: simple, no surprises, in Clojure itself • test-driven development (TDD) for Big Data • fault-tolerant workflows which are simple to follow • machine learning, map-reduce, etc., started in LISP years ago anywho con: • learning curve, limited number of Clojure developers • aggregators are the magic, those take effort to learn Monday, 28 January 13 15
  16. 16. Accidental Complexity: Not O(N^2) complexity, but the costs of software engineering at scale over time What happens when you build recommenders, then go work on other projects for six months? What does it cost others to maintain your apps? Cascalog allows for leveraging the same framework, same code base, from Discovery phase through to Systems phase It focuses on the process of structuring data: specify what you require, not how it must be achieved Huge implications for software engineering Monday, 28 January 13 16
  17. 17. discovery source: 2001 A Space Odyssey Monday, 28 January 13 17
  18. 18. discovery The City of Palo Alto recently began to support Open Data to give the local community greater visibility into how their city government operates This effort is intended to encourage students, entrepreneurs, local organizations, etc., to build new apps which contribute to the public good paloalto.opendata.junar.com/dashboards/7576/ geographic-information/ Monday, 28 January 13 18
  19. 19. discovery GIS about trees in Palo Alto: Monday, 28 January 13 19
  20. 20. discovery GIS about roads in Palo Alto: Monday, 28 January 13 20
  21. 21. discovery Geographic_Information,,, "Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29 Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australis Source: davey tree Protected: Designated: Heritage: Appraised Value: Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872 Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point" "Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: Wilkie Way From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID: 598 (Wilkie Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 Year Constructed: 1950 Traffic Count: 596 Traffic Index: residential local Traffic Class: local residential Traffic Date: 08/24/90 Paving Length: 208 Paving Width: 40 Paving Area: 8320 Surface Type: asphalt concrete Surface Thickness: Thickness: 2.0 6.0 (um, bokay…) Base Type Pvmt: Soil Class: 2 crusher run base Soil Value: 15 Base Curb Type: Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1 District Number: 18 Land Use PMMS: 1 Overlay Year: 1990 Overlay Thickness: 1.5 Base Failure Year: 1990 Base Failure Thickness: 6 Surface Treatment Year: Surface Treatment Type: Alligator Severity: none Alligator Extent: 0 Block Severity: none Block Extent: 0 Longitude and Transverse Severity: none Longitude and Transverse Extent: 0 Ravelling Severity: none Ravelling Extent: 0 Ridability Monday, 28Severity: January 13 none Trench Severity: none Trench Extent: 0 21
  22. 22. discovery (defn parse-gis [line] "leverages parse-csv for complex CSV format in GIS export" (first (csv/parse-csv line)) )     (defn etl-gis [gis trap] "subquery to parse data sets from the GIS source tap" (<- [?blurb ?misc ?geo ?kind] (gis ?line) (parse-gis ?line :> ?blurb ?misc ?geo ?kind) (:trap (hfs-textline trap)) )) (specify what you require, not how to achieve it… addressing the 80%) Monday, 28 January 13 22
  23. 23. discovery (convert ad-hoc queries into logical propositions) Monday, 28 January 13 23
  24. 24. discovery Identifier: 474 Tree ID: 412 Tree: 412 site 1 at 115 HAWTHORNE AV Tree Site: 1 Street_Name: HAWTHORNE AV Situs Number: 115 Private: -1 Species: Liquidambar styraciflua Source: davey tree Hardscape: None 37.446001565119,-122.167713417554,0.0 Point (obtain recognizable results) Monday, 28 January 13 24
  25. 25. discovery (curate valuable metadata) Monday, 28 January 13 25
  26. 26. discovery (defn get-trees [src trap tree_meta] "subquery to parse/filter the tree data" (<- [?blurb ?tree_id ?situs ?tree_site ?species ?wikipedia ?calflora ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash ] (src ?blurb ?misc ?geo ?kind) (re-matches #"^s+Private.*Tree ID.*" ?misc) (parse-tree ?misc :> _ ?priv ?tree_id ?situs ?tree_site ?raw_species) ((c/comp s/trim s/lower-case) ?raw_species :> ?species) (tree_meta ?species ?wikipedia ?calflora ?min_height ?max_height) (avg ?min_height ?max_height :> ?avg_height) (geo-tree ?geo :> _ ?tree_lat ?tree_lng ?tree_alt) (read-string ?tree_lat :> ?lat) (read-string ?tree_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (:trap (hfs-textline trap)) )) Monday, 28 January 13 26
  27. 27. discovery ?blurb! ! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 from pl ?tree_id!" 412 ?situs" " 115 ?tree_site" 1 ?species"" liquidambar styraciflua ?wikipedia" http://en.wikipedia.org/wiki/Liquidambar_styraciflua ?calflora" http://calflora.org/cgi-bin/species_query.cgi?where-calrecnum=8598 ?avg_height"27.5 ?tree_lat" 37.446001565119 ?tree_lng" -122.167713417554 ?tree_alt" 0.0 ?geohash"" 9q9jh0 (et voilà, a data product) Monday, 28 January 13 27
  28. 28. discovery // run some analysis and visualization in R library(ggplot2) dat_folder <- '~/src/concur/CoPA/out/tree' data <- read.table(file=paste(dat_folder, "part-00000", sep="/"), sep="t", quote="", na.strings="NULL", header=FALSE, encoding="UTF8")   summary(data) t <- head(sort(table(data$V5), decreasing=TRUE) trees <- as.data.frame.table(t, n=20)) colnames(trees) <- c("species", "count")   m <- ggplot(data, aes(x=V8)) m <- m + ggtitle("Estimated Tree Height (meters)") m + geom_histogram(aes(y = ..density.., fill = ..count..)) + geom_density()   par(mar = c(7, 4, 4, 2) + 0.1) plot(trees, xaxt="n", xlab="") axis(1, labels=FALSE) text(1:nrow(trees), par("usr")[3] - 0.25, srt=45, adj=1, labels=trees$species, xpd=TRUE) grid(nx=nrow(trees)) Monday, 28 January 13 28
  29. 29. discovery sweetgum Monday, 28 January 13 29
  30. 30. discovery GIS Regex tree Scrub export parse-tree species M Estimate Join Geohash height Regex src parse-gis M Tree tree Metadata Failure Traps (flow diagram, gis tree) Monday, 28 January 13 30
  31. 31. definitions The conceptual flow diagram shows a directed, acyclic graph (DAG) of taps, tuple streams, functions, joins, aggregations, assertions, etc. Cascading is formally a pattern language – patterns of “plumbing” fit together to ensure best practices for large-scale parallel processing in risk-aversive environments – hard requirements of Enterprise IT GIS Regex tree Scrub export parse-tree species M Estimate Join Geohash height Regex src parse-gis M Tree tree Metadata Failure Traps In other words, Cascading forces functional programming through an API for JVM-based languages such as Java, Scala, Clojure Through this approach, we define Enterprise Data Workflows Monday, 28 January 13 31
  32. 32. definitions pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices amazon.com/dp/0195019199 design patterns: originated in consensus negotiation for architecture, later used in OOP software engineering amazon.com/dp/0201633612 Monday, 28 January 13 32
  33. 33. discovery (defn get-roads [src trap road_meta] "subquery to parse/filter the road data" (<- [?blurb ?bike_lane ?bus_route ?truck_route ?albedo ?min_lat ?min_lng ?min_alt ?geohash ?traffic_count ?traffic_index ?traffic_class ?paving_length ?paving_width ?paving_area ?surface_type ] (src ?blurb ?misc ?geo ?kind) (re-matches #"^s+Sequence.*Traffic Count.*" ?misc) (parse-road ?misc :> _ ?traffic_count ?traffic_index ?traffic_class ?paving_length ?paving_width ?paving_area ?surface_type ?overlay_year ?bike_lane ?bus_route ?truck_route) (road_meta ?surface_type ?albedo_new ?albedo_worn) (estimate-albedo ?overlay_year ?albedo_new ?albedo_worn :> ?albedo) (bigram ?geo :> ?pt0 ?pt1) (midpoint ?pt0 ?pt1 :> ?lat ?lng ?alt) ;; why filter for min? because there are geo duplicates.. (c/min ?lat :> ?min_lat) (c/min ?lng :> ?min_lng) (c/min ?alt :> ?min_alt) (geohash ?min_lat ?min_lng :> ?geohash) (:trap (hfs-textline trap)) )) Monday, 28 January 13 33
  34. 34. discovery ?blurb" " " Hawthorne Avenue from Alma Street to High Street ?traffic_count"3110 ?traffic_class"local residential ?surface_type" asphalt concrete ?albedo" " " 0.12 ?min_lat"" " 37.446140860599854" ?min_lng " " -122.1674652295435 ?min_alt " " 0.0 ?geohash"" " 9q9jh0 (another data product) Monday, 28 January 13 34
  35. 35. discovery The road data provides: • traffic class (arterial, truck route, residential, etc.) • traffic counts distribution • surface type (asphalt, cement; age) This leads to estimators for noise, reflection, etc. Monday, 28 January 13 35
  36. 36. discovery GIS export Regex road Regex src parse-gis parse-road M M Estimate Road Join Albedo Segments Geohash Failure Traps R Road road Metadata (flow diagram, gis road) Monday, 28 January 13 36
  37. 37. modeling source: America’s Next Top Model Monday, 28 January 13 37
  38. 38. modeling GIS data from Palo Alto provides us with geolocation about each item in the export: latitude, longitude, altitude Geo data is great for managing municipal infrastructure as well as for mobile apps Predictive modeling in our Open Data example focuses on leveraging geolocation We use spatial indexing by creating a grid of geohash values, for efficient parallel processing Cascalog queries collect items with the same geohash values – using them as keys for large-scale joins (Hadoop) Monday, 28 January 13 38
  39. 39. modeling geohash with 6-digit resolution approximates a 5-block square centered lat: 37.445, lng: -122.162 9q9jh0 Monday, 28 January 13 39
  40. 40. modeling Each road in the GIS export is listed as a block between two cross roads, and each may have multiple road segments to represent turns: " -122.161776959558,37.4518836690781,0.0 " -122.161390381489,37.4516410983794,0.0 " -122.160786011735,37.4512589903357,0.0 " -122.160531178368,37.4510977281699,0.0 ( lat1, lng1, alt1 ) ( lat3, lng3, alt3 ) ( lat0, lng0, alt0 ) ( lat2, lng2, alt2 ) NB: segments in the raw GIS have the order of geo coordinates scrambled: (lng, lat, alt) Monday, 28 January 13 40
  41. 41. modeling Our app analyzes each road segment as a data tuple, calculating the center point for each: ( lat, lng, alt ) Monday, 28 January 13 41
  42. 42. modeling Then uses a geohash to define a grid cell, as a boundary (or “canopy”): 9q9jh0 Monday, 28 January 13 42
  43. 43. modeling Query to join a road segment tuple with all the trees within its geohash boundary: 9q9jh0 Monday, 28 January 13 43
  44. 44. modeling Use distance-to-midpoint to filter trees which are too far away to provide shade: X X X Monday, 28 January 13 44
  45. 45. modeling Calculate a sum of moments for tree height × distance from road segment, as an estimator for shade: ∑( h·d ) We also calculate estimators for traffic frequency and noise Monday, 28 January 13 45
  46. 46. modeling (defn get-shade [trees roads] "subquery to join tree and road estimates, maximize for shade" (<- [?road_name ?geohash ?road_lat ?road_lng ?road_alt ?road_metric ?tree_metric] (roads ?road_name _ _ _ ?albedo ?road_lat ?road_lng ?road_alt ?geohash ?traffic_count _ ?traffic_class _ _ _ _) (road-metric ?traffic_class ?traffic_count ?albedo :> ?road_metric) (trees _ _ _ _ _ _ _ ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash) (read-string ?avg_height :> ?height) ;; limit to trees which are higher than people (> ?height 2.0) (tree-distance ?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance) ;; limit to trees within a one-block radius (not meters) (<= ?distance 25.0) (/ ?height ?distance :> ?tree_moment) (c/sum ?tree_moment :> ?sum_tree_moment) ;; magic number 200000.0 used to scale tree moment ;; based on median (/ ?sum_tree_moment 200000.0 :> ?tree_metric) )) Monday, 28 January 13 46
  47. 47. modeling ?road_name" " Hawthorne Avenue from Alma Street to High Street ?geohash"" " 9q9jh0 ?road_lat" " 37.446140860599854 ?road_lng " " -122.1674652295435 ?road_alt " " 0.0 ?road_metric" [1.0 0.5488121277250486 0.88] ?tree_metric" 4.36321007861036 (another data product) Monday, 28 January 13 47
  48. 48. modeling Filter tree height M Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road shade traffic (flow diagram, shade) Monday, 28 January 13 48
  49. 49. modeling Monday, 28 January 13 49
  50. 50. modeling Monday, 28 January 13 50
  51. 51. modeling (defn get-gps [gps_logs trap] "subquery to aggregate and rank GPS tracks per user" (<- [?uuid ?geohash ?gps_count ?recent_visit] (gps_logs ?date ?uuid ?gps_lat ?gps_lng ?alt ?speed ?heading ?elapsed ?distance) (read-string ?gps_lat :> ?lat) (read-string ?gps_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (c/count :> ?gps_count) (date-num ?date :> ?visit) (c/max ?visit :> ?recent_visit) )) (behavioral targeting: aggregate GPS tracks by recency, frequency) Monday, 28 January 13 51
  52. 52. modeling gps Count Geohash Max logs gps_count recent_visit M R gps (flow diagram, gps) Monday, 28 January 13 52
  53. 53. modeling ?uuid ?geohash ?gps_count ?recent_visit cf660e041e994929b37cc5645209c8ae 9q8yym 7 1972376866448 342ac6fd3f5f44c6b97724d618d587cf 9q9htz 4 1972376690969 32cc09e69bc042f1ad22fc16ee275e21 9q9hv3 3 1972376670935 342ac6fd3f5f44c6b97724d618d587cf 9q9hv3 3 1972376691356 342ac6fd3f5f44c6b97724d618d587cf 9q9hv6 1 1972376691180 342ac6fd3f5f44c6b97724d618d587cf 9q9hv8 18 1972376691028 342ac6fd3f5f44c6b97724d618d587cf 9q9hv9 7 1972376691101 342ac6fd3f5f44c6b97724d618d587cf 9q9hvb 22 1972376691010 342ac6fd3f5f44c6b97724d618d587cf 9q9hwn 13 1972376690782 342ac6fd3f5f44c6b97724d618d587cf 9q9hwp 58 1972376690965 482dc171ef0342b79134d77de0f31c4f 9q9jh0 15 1972376952532 b1b4d653f5d9468a8dd18a77edcc5143 9q9jh0 18 1972376945348 (GPS personalization) Monday, 28 January 13 53
  54. 54. modeling (defn get-reco [tracks shades] "subquery to recommend road segments based on GPS tracks" (<- [?uuid ?road ?geohash ?lat ?lng ?alt ?gps_count ?recent_visit ?road_metric ?tree_metric] (tracks ?uuid ?geohash ?gps_count ?recent_visit) (shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric) )) (finally, the recommender) Monday, 28 January 13 54
  55. 55. modeling Recommenders combine multiple signals, generally via weighted averages, to rank personalized results: • GPS of person ∩ road segment • frequency and recency of visit • traffic class and rate • road albedo (sunlight reflection) • tree shade estimator Adjusting the mix allows for further personalization at the end use Monday, 28 January 13 55
  56. 56. integration source: Wolfram Monday, 28 January 13 56
  57. 57. integration Hadoop is rarely ever used in isolation System integration is a hard problem in Big Data, especially social aspects: breaking down silos Cascading was built for this purpose: • taps across many data frameworks: HBase, Cassandra, MongoDB, etc. GIS Regex tree Scrub export parse-tree species • support for a variety of data serialization: M Estimate Join Geohash height Regex src Avro,Thrift, Kryo, JSON, etc. parse-gis M Tree tree Metadata Failure Traps • planning on multiple topologies: MapReduce, in-memory, tuple spaces, etc. • test-driven development (TDD) at scale • ANSI SQL-92 integration, PMML, etc. Monday, 28 January 13 57
  58. 58. integration This example focuses on the batch workflow to examine best practices for parallel processing Integrating with a mobile app requires next steps: • push “reco” output to a Redis cluster (caching layer) via a Cascading tap • leverage Redis “sorted sets” for ranking personalized results • create lightweight API in Node.js + Nginx for low-latency access at scale • collect social interactions in Splunk • instrument via Nagios, New Relic, Flurry, etc. That provides a data service – doesn’t even begin to address: design, user experience, marketing, implementation, etc., for a complete app… Monday, 28 January 13 58
  59. 59. integration Batch workflow plus a data service: web web Redis web mobile logsGIS logs cluster app API export Customers Cascading app source sink tap tap source Recommender tap trap source customer tap tap Splunk profile Customer DBs Prefs web Support web Hadoop cluster logs gps review logs tracks Monday, 28 January 13 59
  60. 60. integration In terms of deploying a batch workflow, there are several considerations: • build package for a “fat jar” (lein uberjar) • continuous integration • JAR repository • cluster scheduling (e.g., EMR) • instrumentation (Concurrent) • troubleshooting from app layer Monday, 28 January 13 60
  61. 61. apps source: Apple Monday, 28 January 13 61
  62. 62. apps We work on discovery, modeling, integration – long before coding an app. In a linear-logical sense, one might prefer a “waterfall” approach; however, that would undermine core values – mitigating Accidental Complexity – TDD, scalability, fault-tolerance, etc. In lieu of SQL queries, we define a composable set of logical propositions which can be executed, instrumented, tested, etc., independently for best practices at scale in parallel Back to functional relational programming, particularly Datalog’s logic programming, we use subqueries as logical propositions… within a functional context… to leverage the relational model • scalability: specify what you require, not how • testability: disprove the opposites of propositions, to validate Taken together in the context of Cascalog, now let’s build the app… Monday, 28 January 13 62
  63. 63. apps (defproject cascading-copa "0.1.0-SNAPSHOT" :description "City of Palo Alto Open Data recommender in Cascalog" :url "https://github.com/Cascading/CoPA" :license {:name "Apache License, Version 2.0" :url "http://www.apache.org/licenses/LICENSE-2.0" :distribution :repo } :uberjar-name "copa.jar" :aot [copa.core] :main copa.core :source-paths ["src/main/clj"] :dependencies [[org.clojure/clojure "1.4.0"] [cascalog "1.10.0"] [cascalog-more-taps "0.3.1-SNAPSHOT"] [clojure-csv/clojure-csv "1.3.2"] [org.clojars.sunng/geohash "1.0.1"] [org.clojure/clojure-contrib "1.2.0"] [date-clj "1.0.1"] ] :profiles {:dev {:dependencies [[midje-cascalog "0.4.0"]]} :provided {:dependencies [ [org.apache.hadoop/hadoop-core "0.20.2-dev"] ]}} ) Monday, 28 January 13 63
  64. 64. apps Monday, 28 January 13 64
  65. 65. apps (results) ‣ addr: 115 HAWTHORNE AVE ‣ lat/lng: 37.446, -122.168 ‣ geohash: 9q9jh0 ‣ tree: 413 site 2 ‣ species: Liquidambar styraciflua ‣ est. height: 23 m ‣ shade metric: 4.363 ‣ traffic: local residential, light traffic ‣ recent visit: 1972376952532 ‣ a short walk from my train stop ✔ Monday, 28 January 13 65
  66. 66. apps GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit (flow diagram, M R for the whole enchilada) Monday, 28 January 13 66
  67. 67. definitions Design principles in the Cascading API pattern language, which help ensure best practices for Big Data apps in an Enterprise context: • specify what is required, not how it must be achieved • provide the “glue” for system integration • same JAR, any scale • users want no surprises • fail the same way twice • plan far ahead These points echo arguments about functional relational programming (FRP) and Accidental Complexity from Moseley/Marks 2006 Monday, 28 January 13 67
  68. 68. systems source: Wired Monday, 28 January 13 68
  69. 69. principle: same JAR, any scale MegaCorp Enterprise IT: Pb’s data 1000+ node private cluster EVP calls you when app fails runtime: days+ Production Cluster: Tb’s data EMR w/ many HPC Instances Ops monitors results runtime: hours – days Staging Cluster: Gb’s data EMR + a few Spot Instances CI shows red or green lights runtime: minutes – hours Your Laptop: Mb’s data Hadoop standalone mode passes unit tests, or not runtime: seconds – minutes Monday, 28 January 13 69
  70. 70. systems #!/bin/bash -ex # edit the `BUCKET` variable to use one of your S3 buckets: BUCKET=temp.cascading.org/copa SINK=out   # clear previous output (required by Apache Hadoop) s3cmd del -r s3://$BUCKET/$SINK # load built JAR + input data s3cmd put target/copa.jar s3://$BUCKET/ s3cmd put -r data s3://$BUCKET/   # launch cluster and run elastic-mapreduce --create --name "CoPA" --debug --enable-debugging --log-uri s3n://$BUCKET/logs --jar s3n://$BUCKET/copa.jar --arg s3n://$BUCKET/data/copa.csv --arg s3n://$BUCKET/data/meta_tree.tsv --arg s3n://$BUCKET/data/meta_road.tsv --arg s3n://$BUCKET/data/gps.csv --arg s3n://$BUCKET/$SINK/trap --arg s3n://$BUCKET/$SINK/park --arg s3n://$BUCKET/$SINK/tree --arg s3n://$BUCKET/$SINK/road --arg s3n://$BUCKET/$SINK/shade --arg s3n://$BUCKET/$SINK/gps --arg s3n://$BUCKET/$SINK/reco Monday, 28 January 13 70
  71. 71. systems Monday, 28 January 13 71
  72. 72. systems ‣ name node / data node ‣ job tracker / task tracker ‣ submit queue ‣ task slots ‣ HDFS ‣ distributed cache Wikipedia (under the hood) Apache Monday, 28 January 13 72
  73. 73. bucket list Monday, 28 January 13 73
  74. 74. Could combine this with a variety of data APIs: • Trulia neighborhood data, housing prices • Factual local business (FB Places, etc.) • CommonCrawl open source full web crawl • Wunderground local weather data • WalkScore neighborhood data, walkability • Data.gov US federal open data • Data.NASA.gov NASA open data • DBpedia datasets derived from Wikipedia • GeoWordNet semantic knowledge base • Geolytics demographics, GIS, etc. • Foursquare,Yelp, CityGrid, Localeze,YP • various photo sharing Monday, 28 January 13 74
  75. 75. Data Quality: some species names have spelling errors or misclassifications – could be cleaned up and provided back to CoPA to improve municipal services Assumptions have been made about missing data – were these appropriate for the intended use case? There are better ways to handle spatial indexing: k-d trees, etc. The tree data product needs: photos, toxicity, natives vs. invasives, common names, etc. Monday, 28 January 13 75
  76. 76. Arguably, this is not a “large” data set: • Palo Alto has 65K population • great location for a POC • prior to deploying in large metro areas • CoPA is a leader in e-gov • app is simpler to study on a laptop Could extend to other cities with Open Data initiatives: SF, SJ, PDX, Seattle, VanBC… Let’s get coverage for all of Ecotopia! Monday, 28 January 13 76
  77. 77. Trulia: optimize sales leads using estimated allergy zones, based on buyers’ real estate preferences Calflora: report new observations of invasives endangered species, etc.; infer regions of affinity for releasing beneficial insects City of Palo Alto: assess zoning impact, e.g., oleanders near day care centers; monitor outbreaks of tree diseases (big impact on property values) start-ups: some invasive species are valuable in Chinese medicine while others can be converted to biodiesel – potential win-win for targeted harvest services Monday, 28 January 13 77
  78. 78. summary points • geo data is great for municipal infrastructure and for mobile apps • Cascading as a pattern language for Enterprise Data Workflows • design principles in the API/pattern language ensure best practices • focus on the process of structuring data; not un/structured • Cascalog subqueries as composable logical propositions • FRP mitigates the engineering costs of Accidental Complexity • Data Science process: discovery, modeling, integration, apps, systems • Hadoop is rarely ever used in isolation; breaking down silos is the hard problem, which must be socialized to resolve Monday, 28 January 13 78
  79. 79. references leiningen.org github.com/nathanmarz/cascalog/wiki sritchie.github.com vimeo.com/16398892 manning.com/marz java.dzone.com/articles/using-lucene- and-cascalog-fast Monday, 28 January 13 79
  80. 80. references by Paco Nathan Enterprise Data Workflows with Cascading O’Reilly, 2013 amazon.com/dp/1449358721 Santa Clara, Feb 28, 1:30pm strataconf.com/strata2013 Monday, 28 January 13 80
  81. 81. drill-down blog, code/wiki/gists, maven repo, community, products: cascading.org github.org/Cascading conjars.org meetup.com/cascading goo.gl/KQtUL concurrentinc.com we are hiring! Copyright @2013, Concurrent, Inc. Monday, 28 January 13 81
  • gandhinath

    Oct. 21, 2015
  • n9e9o9

    Aug. 21, 2014
  • mcaca441

    May. 12, 2014
  • campeterson

    Jan. 23, 2014
  • vshulyak

    Nov. 1, 2013
  • cascading

    Oct. 27, 2013
  • caidong

    Mar. 26, 2013
  • schee

    Feb. 6, 2013
  • chaoh

    Feb. 3, 2013
  • fv3386

    Feb. 3, 2013
  • davekincaid

    Jan. 31, 2013

Slides for Open Data Bay Area meetup on 2013-01-29 in SF: http://www.meetup.com/Open-Data-Bay-Area/events/98445822/

Views

Total views

9,572

On Slideshare

0

From embeds

0

Number of embeds

994

Actions

Downloads

50

Shares

0

Comments

0

Likes

11

×