Using Cascalog to build
 an app based on City of Palo Alto Open Data
 

Using Cascalog to build
 an app based on City of Palo Alto Open Data

on

  • 7,812 views

Slides for Open Data Bay Area meetup on 2013-01-29 in SF: http://www.meetup.com/Open-Data-Bay-Area/events/98445822/

Slides for Open Data Bay Area meetup on 2013-01-29 in SF: http://www.meetup.com/Open-Data-Bay-Area/events/98445822/

Statistics

Views

Total Views
7,812
Views on SlideShare
6,819
Embed Views
993

Actions

Likes
11
Downloads
39
Comments
0

7 Embeds 993

http://liber118.com 859
https://twitter.com 81
http://zest.to 31
http://www.opendatabayarea.org 15
http://11.frame.zest.to 5
http://odba.overtlycharming.com 1
http://www.liber118.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Using Cascalog to build
 an app based on City of Palo Alto Open Data Using Cascalog to build
 an app based on City of Palo Alto Open Data Presentation Transcript

  • “Using Cascalog to build an app based on City of Palo Alto Open Data” Paco Nathan Document Collection Tokenize Scrub token Concurrent, Inc. M HashJoin Regex Left token GroupBy R Stop Word token List RHS San Francisco, CA Count @pacoid Word Count Copyright @2013, Concurrent, Inc.Monday, 28 January 13 1
  • This project began as a machine learning workshop for a graduate seminar at CMU West Many thanks to: Stuart Evans, CMU Distinguished Service Professor Jonathan Reichental, City of Palo Alto CIO We use Cascalog to develop a Big Data workflow Open Source: github.com/Cascading/CoPA/wikiMonday, 28 January 13 2
  • Palo Alto is generally quite a pleasant place • temperate weather • lots of parks, enormous trees • great coffeehouses • walkable downtown • not particularly crowded • friendly VCs (sort of) On a nice summer day, who wants to be stuck indoors on a phone call? Instead, take it outside – go for a walkMonday, 28 January 13 3
  • Surely, there must be an app for that… But wait, there isn’t? So let’s build one! source: AppleMonday, 28 January 13 4
  • processsource: algaelab.orgMonday, 28 January 13 5
  • 1. unstructured data about municipal infrastructure (GIS data: trees, roads, parks) ✚ 2. unstructured data about where people like to walk (smartphone GPS logs) ✚ Document Collection Scrub Tokenize token 3. a wee bit o’ curated metadata M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 4. personalized recommendations: “Find a shady spot on a summer day in which to walk near downtown Palo Alto.While on a long conference call. Sippin’ a latte or enjoying some fro-yo.”Monday, 28 January 13 6
  • “unstructured” vs. “structured” data is actually quite a Big Debate refer back to Edgar Codd 1969 to learn about the Relational Model relational != SQL but I digress…Monday, 28 January 13 7
  • Data Science work must focus on the process of structuring data which must occur long before the large-scale joins, predictive models, visualizations, etc. So, the process of structuring data is what we examine here: i.e., how to build workflows for Big Data thank you Dr. Codd “A relational model of data for large shared data banks” dl.acm.org/citation.cfm?id=362685Monday, 28 January 13 8
  • references by DJ Patil Data Jujitsu O’Reilly, 2012 amazon.com/dp/B008HMN5BE Building Data Science Teams O’Reilly, 2011 amazon.com/dp/B005O4U3ZEMonday, 28 January 13 9
  • references by Leo Breiman Statistical Modeling: The Two Cultures Statistical Science, 2001 bit.ly/eUTh9L also check out RStudio: rstudio.org/ rpubs.com/Monday, 28 January 13 10
  • Generally speaking, we could approach the matter of developing an Open Data app through these steps: • clean up the raw, unstructured data from CoPA download (ETL) • before modeling, perform visualization and analysis in RStudio • spend time on ideation and research for potential use cases • iterate on business process for the app workflow • integrate with use cases represented by the workflow taps • apply best practices and TDD at scale • …PROFIT! source: South ParkMonday, 28 January 13 11
  • edoMpUsserD:IUN In terms of actual process used in tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd Data Science, here’s how my teams edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN have worked: tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC wodniW D3 nepO dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN help people ask the ytinummoc ,tneilc :detratS weiV eivoM discovery teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC right questions egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA dekcilCeliforPyM:IUN edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT t a eS e g n a h C dneirF ddA revO tcudorP pilF lenaP yrotnevnI tneilC lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP bew :metI na yuB edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU wodniW D3 nepO allow automation to modeling place informed bets deliver products at integration scale to customers build smarts into apps product features keep infrastructure systems running, cost-effectiveMonday, 28 January 13 12
  • For the process used with this Open Data app, we chose to use Cascalog by Nathan Marz, Sam Ritchie, et al., 2010 a DSL in Clojure which implements Datalog, backed by Cascading Some aspects of CS theory: • Functional Relational Programming • mitigates Accidental Complexity • has been compared with Codd 1969 github.com/nathanmarz/cascalog/wikiMonday, 28 January 13 13
  • Q: Who uses Cascalog, other than Twitter? A: • Climate Corp (they’re hiring, ask for Crea) • Factual • Nokia Maps • Harvard School of Public Health • YieldBot (PDX) • uSwitch (London) • etc.Monday, 28 January 13 14
  • pro: • 10:1 reduction in code volume compared to SQL • most advanced uses of Cascading • Leiningen build: simple, no surprises, in Clojure itself • test-driven development (TDD) for Big Data • fault-tolerant workflows which are simple to follow • machine learning, map-reduce, etc., started in LISP years ago anywho con: • learning curve, limited number of Clojure developers • aggregators are the magic, those take effort to learnMonday, 28 January 13 15
  • Accidental Complexity: Not O(N^2) complexity, but the costs of software engineering at scale over time What happens when you build recommenders, then go work on other projects for six months? What does it cost others to maintain your apps? Cascalog allows for leveraging the same framework, same code base, from Discovery phase through to Systems phase It focuses on the process of structuring data: specify what you require, not how it must be achieved Huge implications for software engineeringMonday, 28 January 13 16
  • discoverysource: 2001 A Space OdysseyMonday, 28 January 13 17
  • discovery The City of Palo Alto recently began to support Open Data to give the local community greater visibility into how their city government operates This effort is intended to encourage students, entrepreneurs, local organizations, etc., to build new apps which contribute to the public good paloalto.opendata.junar.com/dashboards/7576/ geographic-information/Monday, 28 January 13 18
  • discovery GIS about trees in Palo Alto:Monday, 28 January 13 19
  • discovery GIS about roads in Palo Alto:Monday, 28 January 13 20
  • discovery Geographic_Information,,, "Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29 Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australis Source: davey tree Protected: Designated: Heritage: Appraised Value: Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872 Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point" "Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: Wilkie Way From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID: 598 (Wilkie Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 Year Constructed: 1950 Traffic Count: 596 Traffic Index: residential local Traffic Class: local residential Traffic Date: 08/24/90 Paving Length: 208 Paving Width: 40 Paving Area: 8320 Surface Type: asphalt concrete Surface Thickness: Thickness: 2.0 6.0 (um, bokay…) Base Type Pvmt: Soil Class: 2 crusher run base Soil Value: 15 Base Curb Type: Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1 District Number: 18 Land Use PMMS: 1 Overlay Year: 1990 Overlay Thickness: 1.5 Base Failure Year: 1990 Base Failure Thickness: 6 Surface Treatment Year: Surface Treatment Type: Alligator Severity: none Alligator Extent: 0 Block Severity: none Block Extent: 0 Longitude and Transverse Severity: none Longitude and Transverse Extent: 0 Ravelling Severity: none Ravelling Extent: 0 RidabilityMonday, 28Severity: January 13 none Trench Severity: none Trench Extent: 0 21
  • discovery (defn parse-gis [line] "leverages parse-csv for complex CSV format in GIS export" (first (csv/parse-csv line)) )     (defn etl-gis [gis trap] "subquery to parse data sets from the GIS source tap" (<- [?blurb ?misc ?geo ?kind] (gis ?line) (parse-gis ?line :> ?blurb ?misc ?geo ?kind) (:trap (hfs-textline trap)) )) (specify what you require, not how to achieve it… addressing the 80%)Monday, 28 January 13 22
  • discovery (convert ad-hoc queries into logical propositions)Monday, 28 January 13 23
  • discovery Identifier: 474 Tree ID: 412 Tree: 412 site 1 at 115 HAWTHORNE AV Tree Site: 1 Street_Name: HAWTHORNE AV Situs Number: 115 Private: -1 Species: Liquidambar styraciflua Source: davey tree Hardscape: None 37.446001565119,-122.167713417554,0.0 Point (obtain recognizable results)Monday, 28 January 13 24
  • discovery (curate valuable metadata)Monday, 28 January 13 25
  • discovery (defn get-trees [src trap tree_meta] "subquery to parse/filter the tree data" (<- [?blurb ?tree_id ?situs ?tree_site ?species ?wikipedia ?calflora ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash ] (src ?blurb ?misc ?geo ?kind) (re-matches #"^s+Private.*Tree ID.*" ?misc) (parse-tree ?misc :> _ ?priv ?tree_id ?situs ?tree_site ?raw_species) ((c/comp s/trim s/lower-case) ?raw_species :> ?species) (tree_meta ?species ?wikipedia ?calflora ?min_height ?max_height) (avg ?min_height ?max_height :> ?avg_height) (geo-tree ?geo :> _ ?tree_lat ?tree_lng ?tree_alt) (read-string ?tree_lat :> ?lat) (read-string ?tree_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (:trap (hfs-textline trap)) ))Monday, 28 January 13 26
  • discovery ?blurb! ! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 from pl ?tree_id!" 412 ?situs" " 115 ?tree_site" 1 ?species"" liquidambar styraciflua ?wikipedia" http://en.wikipedia.org/wiki/Liquidambar_styraciflua ?calflora" http://calflora.org/cgi-bin/species_query.cgi?where-calrecnum=8598 ?avg_height"27.5 ?tree_lat" 37.446001565119 ?tree_lng" -122.167713417554 ?tree_alt" 0.0 ?geohash"" 9q9jh0 (et voilà, a data product)Monday, 28 January 13 27
  • discovery // run some analysis and visualization in R library(ggplot2) dat_folder <- ~/src/concur/CoPA/out/tree data <- read.table(file=paste(dat_folder, "part-00000", sep="/"), sep="t", quote="", na.strings="NULL", header=FALSE, encoding="UTF8")   summary(data) t <- head(sort(table(data$V5), decreasing=TRUE) trees <- as.data.frame.table(t, n=20)) colnames(trees) <- c("species", "count")   m <- ggplot(data, aes(x=V8)) m <- m + ggtitle("Estimated Tree Height (meters)") m + geom_histogram(aes(y = ..density.., fill = ..count..)) + geom_density()   par(mar = c(7, 4, 4, 2) + 0.1) plot(trees, xaxt="n", xlab="") axis(1, labels=FALSE) text(1:nrow(trees), par("usr")[3] - 0.25, srt=45, adj=1, labels=trees$species, xpd=TRUE) grid(nx=nrow(trees))Monday, 28 January 13 28
  • discovery sweetgumMonday, 28 January 13 29
  • discovery GIS Regex tree Scrub export parse-tree species M Estimate Join Geohash height Regex src parse-gis M Tree tree Metadata Failure Traps (flow diagram, gis tree)Monday, 28 January 13 30
  • definitions The conceptual flow diagram shows a directed, acyclic graph (DAG) of taps, tuple streams, functions, joins, aggregations, assertions, etc. Cascading is formally a pattern language – patterns of “plumbing” fit together to ensure best practices for large-scale parallel processing in risk-aversive environments – hard requirements of Enterprise IT GIS Regex tree Scrub export parse-tree species M Estimate Join Geohash height Regex src parse-gis M Tree tree Metadata Failure Traps In other words, Cascading forces functional programming through an API for JVM-based languages such as Java, Scala, Clojure Through this approach, we define Enterprise Data WorkflowsMonday, 28 January 13 31
  • definitions pattern language: a structured method for solving large, complex design problems, where the syntax of the language promotes the use of best practices amazon.com/dp/0195019199 design patterns: originated in consensus negotiation for architecture, later used in OOP software engineering amazon.com/dp/0201633612Monday, 28 January 13 32
  • discovery (defn get-roads [src trap road_meta] "subquery to parse/filter the road data" (<- [?blurb ?bike_lane ?bus_route ?truck_route ?albedo ?min_lat ?min_lng ?min_alt ?geohash ?traffic_count ?traffic_index ?traffic_class ?paving_length ?paving_width ?paving_area ?surface_type ] (src ?blurb ?misc ?geo ?kind) (re-matches #"^s+Sequence.*Traffic Count.*" ?misc) (parse-road ?misc :> _ ?traffic_count ?traffic_index ?traffic_class ?paving_length ?paving_width ?paving_area ?surface_type ?overlay_year ?bike_lane ?bus_route ?truck_route) (road_meta ?surface_type ?albedo_new ?albedo_worn) (estimate-albedo ?overlay_year ?albedo_new ?albedo_worn :> ?albedo) (bigram ?geo :> ?pt0 ?pt1) (midpoint ?pt0 ?pt1 :> ?lat ?lng ?alt) ;; why filter for min? because there are geo duplicates.. (c/min ?lat :> ?min_lat) (c/min ?lng :> ?min_lng) (c/min ?alt :> ?min_alt) (geohash ?min_lat ?min_lng :> ?geohash) (:trap (hfs-textline trap)) ))Monday, 28 January 13 33
  • discovery ?blurb" " " Hawthorne Avenue from Alma Street to High Street ?traffic_count"3110 ?traffic_class"local residential ?surface_type" asphalt concrete ?albedo" " " 0.12 ?min_lat"" " 37.446140860599854" ?min_lng " " -122.1674652295435 ?min_alt " " 0.0 ?geohash"" " 9q9jh0 (another data product)Monday, 28 January 13 34
  • discovery The road data provides: • traffic class (arterial, truck route, residential, etc.) • traffic counts distribution • surface type (asphalt, cement; age) This leads to estimators for noise, reflection, etc.Monday, 28 January 13 35
  • discovery GIS export Regex road Regex src parse-gis parse-road M M Estimate Road Join Albedo Segments Geohash Failure Traps R Road road Metadata (flow diagram, gis road)Monday, 28 January 13 36
  • modelingsource: America’s Next Top ModelMonday, 28 January 13 37
  • modeling GIS data from Palo Alto provides us with geolocation about each item in the export: latitude, longitude, altitude Geo data is great for managing municipal infrastructure as well as for mobile apps Predictive modeling in our Open Data example focuses on leveraging geolocation We use spatial indexing by creating a grid of geohash values, for efficient parallel processing Cascalog queries collect items with the same geohash values – using them as keys for large-scale joins (Hadoop)Monday, 28 January 13 38
  • modeling geohash with 6-digit resolution approximates a 5-block square centered lat: 37.445, lng: -122.162 9q9jh0Monday, 28 January 13 39
  • modeling Each road in the GIS export is listed as a block between two cross roads, and each may have multiple road segments to represent turns: " -122.161776959558,37.4518836690781,0.0 " -122.161390381489,37.4516410983794,0.0 " -122.160786011735,37.4512589903357,0.0 " -122.160531178368,37.4510977281699,0.0 ( lat1, lng1, alt1 ) ( lat3, lng3, alt3 ) ( lat0, lng0, alt0 ) ( lat2, lng2, alt2 ) NB: segments in the raw GIS have the order of geo coordinates scrambled: (lng, lat, alt)Monday, 28 January 13 40
  • modeling Our app analyzes each road segment as a data tuple, calculating the center point for each: ( lat, lng, alt )Monday, 28 January 13 41
  • modeling Then uses a geohash to define a grid cell, as a boundary (or “canopy”): 9q9jh0Monday, 28 January 13 42
  • modeling Query to join a road segment tuple with all the trees within its geohash boundary: 9q9jh0Monday, 28 January 13 43
  • modeling Use distance-to-midpoint to filter trees which are too far away to provide shade: X X XMonday, 28 January 13 44
  • modeling Calculate a sum of moments for tree height × distance from road segment, as an estimator for shade: ∑( h·d ) We also calculate estimators for traffic frequency and noiseMonday, 28 January 13 45
  • modeling (defn get-shade [trees roads] "subquery to join tree and road estimates, maximize for shade" (<- [?road_name ?geohash ?road_lat ?road_lng ?road_alt ?road_metric ?tree_metric] (roads ?road_name _ _ _ ?albedo ?road_lat ?road_lng ?road_alt ?geohash ?traffic_count _ ?traffic_class _ _ _ _) (road-metric ?traffic_class ?traffic_count ?albedo :> ?road_metric) (trees _ _ _ _ _ _ _ ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash) (read-string ?avg_height :> ?height) ;; limit to trees which are higher than people (> ?height 2.0) (tree-distance ?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance) ;; limit to trees within a one-block radius (not meters) (<= ?distance 25.0) (/ ?height ?distance :> ?tree_moment) (c/sum ?tree_moment :> ?sum_tree_moment) ;; magic number 200000.0 used to scale tree moment ;; based on median (/ ?sum_tree_moment 200000.0 :> ?tree_metric) ))Monday, 28 January 13 46
  • modeling ?road_name" " Hawthorne Avenue from Alma Street to High Street ?geohash"" " 9q9jh0 ?road_lat" " 37.446140860599854 ?road_lng " " -122.1674652295435 ?road_alt " " 0.0 ?road_metric" [1.0 0.5488121277250486 0.88] ?tree_metric" 4.36321007861036 (another data product)Monday, 28 January 13 47
  • modeling Filter tree height M Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road shade traffic (flow diagram, shade)Monday, 28 January 13 48
  • modelingMonday, 28 January 13 49
  • modelingMonday, 28 January 13 50
  • modeling (defn get-gps [gps_logs trap] "subquery to aggregate and rank GPS tracks per user" (<- [?uuid ?geohash ?gps_count ?recent_visit] (gps_logs ?date ?uuid ?gps_lat ?gps_lng ?alt ?speed ?heading ?elapsed ?distance) (read-string ?gps_lat :> ?lat) (read-string ?gps_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (c/count :> ?gps_count) (date-num ?date :> ?visit) (c/max ?visit :> ?recent_visit) )) (behavioral targeting: aggregate GPS tracks by recency, frequency)Monday, 28 January 13 51
  • modeling gps Count Geohash Max logs gps_count recent_visit M R gps (flow diagram, gps)Monday, 28 January 13 52
  • modeling ?uuid ?geohash ?gps_count ?recent_visit cf660e041e994929b37cc5645209c8ae 9q8yym 7 1972376866448 342ac6fd3f5f44c6b97724d618d587cf 9q9htz 4 1972376690969 32cc09e69bc042f1ad22fc16ee275e21 9q9hv3 3 1972376670935 342ac6fd3f5f44c6b97724d618d587cf 9q9hv3 3 1972376691356 342ac6fd3f5f44c6b97724d618d587cf 9q9hv6 1 1972376691180 342ac6fd3f5f44c6b97724d618d587cf 9q9hv8 18 1972376691028 342ac6fd3f5f44c6b97724d618d587cf 9q9hv9 7 1972376691101 342ac6fd3f5f44c6b97724d618d587cf 9q9hvb 22 1972376691010 342ac6fd3f5f44c6b97724d618d587cf 9q9hwn 13 1972376690782 342ac6fd3f5f44c6b97724d618d587cf 9q9hwp 58 1972376690965 482dc171ef0342b79134d77de0f31c4f 9q9jh0 15 1972376952532 b1b4d653f5d9468a8dd18a77edcc5143 9q9jh0 18 1972376945348 (GPS personalization)Monday, 28 January 13 53
  • modeling (defn get-reco [tracks shades] "subquery to recommend road segments based on GPS tracks" (<- [?uuid ?road ?geohash ?lat ?lng ?alt ?gps_count ?recent_visit ?road_metric ?tree_metric] (tracks ?uuid ?geohash ?gps_count ?recent_visit) (shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric) )) (finally, the recommender)Monday, 28 January 13 54
  • modeling Recommenders combine multiple signals, generally via weighted averages, to rank personalized results: • GPS of person ∩ road segment • frequency and recency of visit • traffic class and rate • road albedo (sunlight reflection) • tree shade estimator Adjusting the mix allows for further personalization at the end useMonday, 28 January 13 55
  • integrationsource: WolframMonday, 28 January 13 56
  • integration Hadoop is rarely ever used in isolation System integration is a hard problem in Big Data, especially social aspects: breaking down silos Cascading was built for this purpose: • taps across many data frameworks: HBase, Cassandra, MongoDB, etc. GIS Regex tree Scrub export parse-tree species • support for a variety of data serialization: M Estimate Join Geohash height Regex src Avro,Thrift, Kryo, JSON, etc. parse-gis M Tree tree Metadata Failure Traps • planning on multiple topologies: MapReduce, in-memory, tuple spaces, etc. • test-driven development (TDD) at scale • ANSI SQL-92 integration, PMML, etc.Monday, 28 January 13 57
  • integration This example focuses on the batch workflow to examine best practices for parallel processing Integrating with a mobile app requires next steps: • push “reco” output to a Redis cluster (caching layer) via a Cascading tap • leverage Redis “sorted sets” for ranking personalized results • create lightweight API in Node.js + Nginx for low-latency access at scale • collect social interactions in Splunk • instrument via Nagios, New Relic, Flurry, etc. That provides a data service – doesn’t even begin to address: design, user experience, marketing, implementation, etc., for a complete app…Monday, 28 January 13 58
  • integration Batch workflow plus a data service: web web Redis web mobile logsGIS logs cluster app API export Customers Cascading app source sink tap tap source Recommender tap trap source customer tap tap Splunk profile Customer DBs Prefs web Support web Hadoop cluster logs gps review logs tracksMonday, 28 January 13 59
  • integration In terms of deploying a batch workflow, there are several considerations: • build package for a “fat jar” (lein uberjar) • continuous integration • JAR repository • cluster scheduling (e.g., EMR) • instrumentation (Concurrent) • troubleshooting from app layerMonday, 28 January 13 60
  • appssource: AppleMonday, 28 January 13 61
  • apps We work on discovery, modeling, integration – long before coding an app. In a linear-logical sense, one might prefer a “waterfall” approach; however, that would undermine core values – mitigating Accidental Complexity – TDD, scalability, fault-tolerance, etc. In lieu of SQL queries, we define a composable set of logical propositions which can be executed, instrumented, tested, etc., independently for best practices at scale in parallel Back to functional relational programming, particularly Datalog’s logic programming, we use subqueries as logical propositions… within a functional context… to leverage the relational model • scalability: specify what you require, not how • testability: disprove the opposites of propositions, to validate Taken together in the context of Cascalog, now let’s build the app…Monday, 28 January 13 62
  • apps (defproject cascading-copa "0.1.0-SNAPSHOT" :description "City of Palo Alto Open Data recommender in Cascalog" :url "https://github.com/Cascading/CoPA" :license {:name "Apache License, Version 2.0" :url "http://www.apache.org/licenses/LICENSE-2.0" :distribution :repo } :uberjar-name "copa.jar" :aot [copa.core] :main copa.core :source-paths ["src/main/clj"] :dependencies [[org.clojure/clojure "1.4.0"] [cascalog "1.10.0"] [cascalog-more-taps "0.3.1-SNAPSHOT"] [clojure-csv/clojure-csv "1.3.2"] [org.clojars.sunng/geohash "1.0.1"] [org.clojure/clojure-contrib "1.2.0"] [date-clj "1.0.1"] ] :profiles {:dev {:dependencies [[midje-cascalog "0.4.0"]]} :provided {:dependencies [ [org.apache.hadoop/hadoop-core "0.20.2-dev"] ]}} )Monday, 28 January 13 63
  • appsMonday, 28 January 13 64
  • apps (results) ‣ addr: 115 HAWTHORNE AVE ‣ lat/lng: 37.446, -122.168 ‣ geohash: 9q9jh0 ‣ tree: 413 site 2 ‣ species: Liquidambar styraciflua ‣ est. height: 23 m ‣ shade metric: 4.363 ‣ traffic: local residential, light traffic ‣ recent visit: 1972376952532 ‣ a short walk from my train stop ✔Monday, 28 January 13 65
  • apps GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit (flow diagram, M R for the whole enchilada)Monday, 28 January 13 66
  • definitions Design principles in the Cascading API pattern language, which help ensure best practices for Big Data apps in an Enterprise context: • specify what is required, not how it must be achieved • provide the “glue” for system integration • same JAR, any scale • users want no surprises • fail the same way twice • plan far ahead These points echo arguments about functional relational programming (FRP) and Accidental Complexity from Moseley/Marks 2006Monday, 28 January 13 67
  • systemssource: WiredMonday, 28 January 13 68
  • principle: same JAR, any scale MegaCorp Enterprise IT: Pb’s data 1000+ node private cluster EVP calls you when app fails runtime: days+ Production Cluster: Tb’s data EMR w/ many HPC Instances Ops monitors results runtime: hours – days Staging Cluster: Gb’s data EMR + a few Spot Instances CI shows red or green lights runtime: minutes – hours Your Laptop: Mb’s data Hadoop standalone mode passes unit tests, or not runtime: seconds – minutesMonday, 28 January 13 69
  • systems #!/bin/bash -ex # edit the `BUCKET` variable to use one of your S3 buckets: BUCKET=temp.cascading.org/copa SINK=out   # clear previous output (required by Apache Hadoop) s3cmd del -r s3://$BUCKET/$SINK # load built JAR + input data s3cmd put target/copa.jar s3://$BUCKET/ s3cmd put -r data s3://$BUCKET/   # launch cluster and run elastic-mapreduce --create --name "CoPA" --debug --enable-debugging --log-uri s3n://$BUCKET/logs --jar s3n://$BUCKET/copa.jar --arg s3n://$BUCKET/data/copa.csv --arg s3n://$BUCKET/data/meta_tree.tsv --arg s3n://$BUCKET/data/meta_road.tsv --arg s3n://$BUCKET/data/gps.csv --arg s3n://$BUCKET/$SINK/trap --arg s3n://$BUCKET/$SINK/park --arg s3n://$BUCKET/$SINK/tree --arg s3n://$BUCKET/$SINK/road --arg s3n://$BUCKET/$SINK/shade --arg s3n://$BUCKET/$SINK/gps --arg s3n://$BUCKET/$SINK/recoMonday, 28 January 13 70
  • systemsMonday, 28 January 13 71
  • systems ‣ name node / data node ‣ job tracker / task tracker ‣ submit queue ‣ task slots ‣ HDFS ‣ distributed cache Wikipedia (under the hood) ApacheMonday, 28 January 13 72
  • bucket listMonday, 28 January 13 73
  • Could combine this with a variety of data APIs: • Trulia neighborhood data, housing prices • Factual local business (FB Places, etc.) • CommonCrawl open source full web crawl • Wunderground local weather data • WalkScore neighborhood data, walkability • Data.gov US federal open data • Data.NASA.gov NASA open data • DBpedia datasets derived from Wikipedia • GeoWordNet semantic knowledge base • Geolytics demographics, GIS, etc. • Foursquare,Yelp, CityGrid, Localeze,YP • various photo sharingMonday, 28 January 13 74
  • Data Quality: some species names have spelling errors or misclassifications – could be cleaned up and provided back to CoPA to improve municipal services Assumptions have been made about missing data – were these appropriate for the intended use case? There are better ways to handle spatial indexing: k-d trees, etc. The tree data product needs: photos, toxicity, natives vs. invasives, common names, etc.Monday, 28 January 13 75
  • Arguably, this is not a “large” data set: • Palo Alto has 65K population • great location for a POC • prior to deploying in large metro areas • CoPA is a leader in e-gov • app is simpler to study on a laptop Could extend to other cities with Open Data initiatives: SF, SJ, PDX, Seattle, VanBC… Let’s get coverage for all of Ecotopia!Monday, 28 January 13 76
  • Trulia: optimize sales leads using estimated allergy zones, based on buyers’ real estate preferences Calflora: report new observations of invasives endangered species, etc.; infer regions of affinity for releasing beneficial insects City of Palo Alto: assess zoning impact, e.g., oleanders near day care centers; monitor outbreaks of tree diseases (big impact on property values) start-ups: some invasive species are valuable in Chinese medicine while others can be converted to biodiesel – potential win-win for targeted harvest servicesMonday, 28 January 13 77
  • summary points • geo data is great for municipal infrastructure and for mobile apps • Cascading as a pattern language for Enterprise Data Workflows • design principles in the API/pattern language ensure best practices • focus on the process of structuring data; not un/structured • Cascalog subqueries as composable logical propositions • FRP mitigates the engineering costs of Accidental Complexity • Data Science process: discovery, modeling, integration, apps, systems • Hadoop is rarely ever used in isolation; breaking down silos is the hard problem, which must be socialized to resolveMonday, 28 January 13 78
  • references leiningen.org github.com/nathanmarz/cascalog/wiki sritchie.github.com vimeo.com/16398892 manning.com/marz java.dzone.com/articles/using-lucene- and-cascalog-fastMonday, 28 January 13 79
  • references by Paco Nathan Enterprise Data Workflows with Cascading O’Reilly, 2013 amazon.com/dp/1449358721 Santa Clara, Feb 28, 1:30pm strataconf.com/strata2013Monday, 28 January 13 80
  • drill-down blog, code/wiki/gists, maven repo, community, products: cascading.org github.org/Cascading conjars.org meetup.com/cascading goo.gl/KQtUL concurrentinc.com we are hiring! Copyright @2013, Concurrent, Inc.Monday, 28 January 13 81