Using Cascalog to build  an app based on City of Palo Alto Open Data

“Using Cascalog to build
an app based on
City of Palo Alto Open Data”

Paco Nathan Document
Collection

Tokenize
Scrub
token

Concurrent, Inc.
M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

San Francisco, CA Count

@pacoid
Word
Count

Copyright @2013, Concurrent, Inc.

Monday, 28 January 13 1

This project began as a machine
learning workshop for a graduate
seminar at CMU West

Many thanks to:

Stuart Evans,
CMU Distinguished Service Professor

Jonathan Reichental,
City of Palo Alto CIO

We use Cascalog to develop
a Big Data workﬂow

Open Source:
github.com/Cascading/CoPA/wiki


Palo Alto is generally quite
a pleasant place

• temperate weather
• lots of parks, enormous trees
• great coffeehouses
• walkable downtown
• not particularly crowded
• friendly VCs (sort of)

On a nice summer day, who wants
to be stuck indoors on a phone call?
Instead, take it outside –
go for a walk


Surely, there must be
an app for that…

But wait, there isn’t?

So let’s build one!

source: Apple


process

source: algaelab.org


1. unstructured data about municipal infrastructure
(GIS data: trees, roads, parks)
✚
2. unstructured data about where people like to walk
(smartphone GPS logs)
✚ Document
Collection

Scrub
Tokenize
token

3. a wee bit o’ curated metadata
M

HashJoin Regex
Left token
GroupBy R
Stop Word token
List
RHS

Count

Word
Count

4. personalized recommendations:
“Find a shady spot on a summer day in which to walk
near downtown Palo Alto.While on a long conference call.
Sippin’ a latte or enjoying some fro-yo.”


“unstructured” vs. “structured” data
is actually quite a Big Debate
refer back to Edgar Codd 1969
to learn about the Relational Model
relational != SQL
but I digress…


Data Science work must focus on
the process of structuring data
which must occur long before the
large-scale joins, predictive models,
visualizations, etc.
So, the process of structuring data is
what we examine here:
i.e., how to build workﬂows
for Big Data

thank you Dr. Codd
“A relational model of data for large shared data banks”
dl.acm.org/citation.cfm?id=362685


references

by DJ Patil

Data Jujitsu
O’Reilly, 2012
amazon.com/dp/B008HMN5BE

Building Data Science Teams
O’Reilly, 2011
amazon.com/dp/B005O4U3ZE


references

by Leo Breiman
Statistical Modeling:
The Two Cultures
Statistical Science, 2001
bit.ly/eUTh9L

also check out RStudio:
rstudio.org/
rpubs.com/


Generally speaking, we could approach the matter of developing
an Open Data app through these steps:
• clean up the raw, unstructured data from CoPA download (ETL)
• before modeling, perform visualization and analysis in RStudio
• spend time on ideation and research for potential use cases
• iterate on business process for the app workﬂow
• integrate with use cases represented by the workﬂow taps
• apply best practices and TDD at scale
• …PROFIT!

source: South Park


edoMpUsserD:IUN

In terms of actual process used in
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd

Data Science, here’s how my teams
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN

have worked:
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
taeS egnahC
wodniW D3 nepO
dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
lenaP tidE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN

help people ask the
ytinummoc ,tneilc :detratS weiV eivoM

discovery
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC

right questions
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU

edoMpUsserD:IUN
tcudorP ylppA lenaP yrotnevnI tneilC
tcudorP evomeR lenaP yrotnevnI tneilC
edoMmooRyM:IUN
edoMmooRcilbuP:IUN
ydduB ddA
nigoL etisbeW
vd
edoMsdneirF:IUN
edoMtahC:IUN
egasseM a evaeL
G1 :gniniamer ecaps sserddA
dekcilCeliforPyM:IUN
edoMstiderCyuB:IUN
tohspanS a ekaT
egapemoH nwO tisiV
elbbuB a epyT
t a eS e g n a h C

dneirF ddA
revO tcudorP pilF lenaP yrotnevnI tneilC
lenaP tidE
woN tahC
teP yalP
teP deeF
2 petS egaP traC esahcruP edaM remotsuC
M215 :gniniamer ecaps sserddA
gnihtolC no tuP
bew :metI na yuB
edoMeivoM:IUN
ytinummoc ,tneilc :detratS weiV eivoM
teP weN etaerC
detrats etius tset :tseTytivitcennoC
emag pazyeh dehcnuaL
eciov mooRcilbuP tahC
egasseM yadhtriB
edoMlairotuT:IUN
ybbol semag dehcnuaL
noitartsigeR euqinU
wodniW D3 nepO
allow automation to
modeling place informed bets

deliver products at
integration scale to customers

build smarts into
apps product features

keep infrastructure
systems running, cost-effective


For the process used with this Open Data app,
we chose to use Cascalog
by Nathan Marz, Sam Ritchie, et al., 2010
a DSL in Clojure which implements
Datalog, backed by Cascading

Some aspects of CS theory:

• Functional Relational Programming
• mitigates Accidental Complexity
• has been compared with Codd 1969

github.com/nathanmarz/cascalog/wiki


Q:
Who uses Cascalog, other than Twitter?

A:
• Climate Corp (they’re hiring, ask for Crea)
• Factual
• Nokia Maps
• Harvard School of Public Health
• YieldBot (PDX)
• uSwitch (London)
• etc.


pro:
• 10:1 reduction in code volume compared to SQL
• most advanced uses of Cascading
• Leiningen build: simple, no surprises, in Clojure itself
• test-driven development (TDD) for Big Data
• fault-tolerant workﬂows which are simple to follow
• machine learning, map-reduce, etc., started in LISP
years ago anywho
con:
• learning curve, limited number of Clojure developers
• aggregators are the magic, those take effort to learn


Accidental Complexity:
Not O(N^2) complexity, but the costs of software
engineering at scale over time
What happens when you build recommenders,
then go work on other projects for six months?
What does it cost others to maintain your apps?
Cascalog allows for leveraging the same framework,
same code base, from Discovery phase through
to Systems phase
It focuses on the process of structuring data:
specify what you require, not how it must be achieved
Huge implications for software engineering


discovery

source: 2001 A Space Odyssey


discovery
The City of Palo Alto recently began to support
Open Data to give the local community greater
visibility into how their city government operates
This effort is intended to encourage students,
entrepreneurs, local organizations, etc., to build
new apps which contribute to the public good
paloalto.opendata.junar.com/dashboards/7576/
geographic-information/


discovery
GIS about trees in Palo Alto:


discovery
GIS about roads in Palo Alto:


discovery
Geographic_Information,,,

"Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl","
Private: -1 Tree ID: 29 Street_Name: ADDISON AV Situs
Number: 203 Tree Site: 2 Species: Celtis australis
Source: davey tree Protected: Designated: Heritage:
Appraised Value: Hardscape: None Identifier: 40 Active
Numeric: 1 Location Feature ID: 13872 Provisional:
Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point"
"Wilkie Way from West Meadow Drive to Victoria Place"," Sequence:
20 Street_Name: Wilkie Way From Street PMMS: West Meadow
Drive To Street PMMS: Victoria Place Street ID: 598 (Wilkie
Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS:
567 Year Constructed: 1950 Traffic Count: 596 Traffic
Index: residential local Traffic Class: local residential
Traffic Date: 08/24/90 Paving Length: 208 Paving Width: 40
Paving Area: 8320 Surface Type: asphalt concrete Surface
Thickness:
Thickness:
2.0
6.0 (um, bokay…)
Base Type Pvmt:
Soil Class: 2
crusher run base
Soil Value: 15
Base
Curb Type:
Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1
District Number: 18 Land Use PMMS: 1 Overlay Year: 1990
Overlay Thickness: 1.5 Base Failure Year: 1990 Base Failure
Thickness: 6 Surface Treatment Year: Surface Treatment
Type: Alligator Severity: none Alligator Extent: 0
Block Severity: none Block Extent: 0 Longitude and
Transverse Severity: none Longitude and Transverse Extent: 0
Ravelling Severity: none Ravelling Extent: 0 Ridability
Monday, 28Severity:
January 13 none Trench Severity: none Trench Extent: 0 21

discovery
(defn parse-gis [line]
"leverages parse-csv for complex CSV format in GIS export"
(first (csv/parse-csv line))
)

(defn etl-gis [gis trap]
"subquery to parse data sets from the GIS source tap"
(<- [?blurb ?misc ?geo ?kind]
(gis ?line)
(parse-gis ?line :> ?blurb ?misc ?geo ?kind)
(:trap (hfs-textline trap))
))

(specify what you require,
not how to achieve it…
addressing the 80%)


discovery

(convert ad-hoc queries
into logical propositions)


discovery
Identifier: 474
Tree ID: 412
Tree: 412 site 1 at 115 HAWTHORNE AV
Tree Site: 1
Street_Name: HAWTHORNE AV
Situs Number: 115
Private: -1
Species: Liquidambar styraciflua
Source: davey tree
Hardscape: None
37.446001565119,-122.167713417554,0.0
Point

(obtain recognizable
results)


discovery

(curate valuable metadata)


discovery
(defn get-trees [src trap tree_meta]
"subquery to parse/filter the tree data"
(<- [?blurb ?tree_id ?situs ?tree_site
?species ?wikipedia ?calflora ?avg_height
?tree_lat ?tree_lng ?tree_alt ?geohash
]
(src ?blurb ?misc ?geo ?kind)
(re-matches #"^s+Private.*Tree ID.*" ?misc)
(parse-tree
?misc :> _ ?priv ?tree_id ?situs ?tree_site ?raw_species)
((c/comp s/trim s/lower-case) ?raw_species :> ?species)
(tree_meta
?species ?wikipedia ?calflora ?min_height ?max_height)
(avg ?min_height ?max_height :> ?avg_height)
(geo-tree ?geo :> _ ?tree_lat ?tree_lng ?tree_alt)
(read-string ?tree_lat :> ?lat)
(read-string ?tree_lng :> ?lng)
(geohash ?lat ?lng :> ?geohash)
))


discovery
?blurb! ! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 from pl
?tree_id!" 412
?situs" " 115
?tree_site" 1
?species"" liquidambar styraciflua
?wikipedia" http://en.wikipedia.org/wiki/Liquidambar_styraciflua
?calflora" http://calflora.org/cgi-bin/species_query.cgi?where-calrecnum=8598
?avg_height"27.5
?tree_lat" 37.446001565119
?tree_lng" -122.167713417554
?tree_alt" 0.0
?geohash"" 9q9jh0

(et voilà, a data product)


discovery
// run some analysis and visualization in R
library(ggplot2)

dat_folder <- '~/src/concur/CoPA/out/tree'
data <- read.table(file=paste(dat_folder, "part-00000", sep="/"),
sep="t", quote="", na.strings="NULL",
header=FALSE, encoding="UTF8")

summary(data)

t <- head(sort(table(data$V5), decreasing=TRUE)
trees <- as.data.frame.table(t, n=20))
colnames(trees) <- c("species", "count")

m <- ggplot(data, aes(x=V8))
m <- m + ggtitle("Estimated Tree Height (meters)")
m + geom_histogram(aes(y = ..density.., fill = ..count..)) +
geom_density()

par(mar = c(7, 4, 4, 2) + 0.1)
plot(trees, xaxt="n", xlab="")
axis(1, labels=FALSE)
text(1:nrow(trees), par("usr")[3] - 0.25, srt=45, adj=1,
labels=trees$species, xpd=TRUE)
grid(nx=nrow(trees))


discovery

sweetgum


discovery

GIS Regex

tree
Scrub
export parse-tree species

M
Estimate
Join Geohash
height

Regex
src

parse-gis
M Tree
tree
Metadata

Failure
Traps

(flow diagram, gis tree)


definitions
The conceptual flow diagram shows a directed, acyclic graph (DAG)
of taps, tuple streams, functions, joins, aggregations, assertions, etc.
Cascading is formally a pattern language – patterns of “plumbing”
fit together to ensure best practices for large-scale parallel processing
in risk-aversive environments – hard requirements of Enterprise IT
GIS Regex

tree
Scrub

M
Estimate
Join Geohash
height

Regex
src

parse-gis
M Tree
tree
Metadata

Failure
Traps

In other words, Cascading forces functional programming
through an API for JVM-based languages such as Java, Scala, Clojure
Through this approach, we define Enterprise Data Workflows


definitions
pattern language: a structured method for
solving large, complex design problems, where
the syntax of the language promotes the use
of best practices

amazon.com/dp/0195019199

design patterns: originated in consensus
negotiation for architecture, later used in
OOP software engineering



discovery
(defn get-roads [src trap road_meta]
"subquery to parse/filter the road data"
(<- [?blurb ?bike_lane ?bus_route ?truck_route ?albedo
?min_lat ?min_lng ?min_alt ?geohash
?traffic_count ?traffic_index ?traffic_class
?paving_length ?paving_width ?paving_area ?surface_type
]
(src ?blurb ?misc ?geo ?kind)
(re-matches #"^s+Sequence.*Traffic Count.*" ?misc)
(parse-road ?misc :> _
?traffic_count ?traffic_index ?traffic_class
?paving_length ?paving_width ?paving_area ?surface_type
?overlay_year ?bike_lane ?bus_route ?truck_route)
(road_meta ?surface_type ?albedo_new ?albedo_worn)
(estimate-albedo
?overlay_year ?albedo_new ?albedo_worn :> ?albedo)
(bigram ?geo :> ?pt0 ?pt1)
(midpoint ?pt0 ?pt1 :> ?lat ?lng ?alt)
;; why filter for min? because there are geo duplicates..
(c/min ?lat :> ?min_lat)
(c/min ?lng :> ?min_lng)
(c/min ?alt :> ?min_alt)
(geohash ?min_lat ?min_lng :> ?geohash)
))


discovery
?blurb" " " Hawthorne Avenue from Alma Street to High Street
?traffic_count"3110
?traffic_class"local residential
?surface_type" asphalt concrete
?albedo" " " 0.12
?min_lat"" " 37.446140860599854"
?min_lng " " -122.1674652295435
?min_alt " " 0.0
?geohash"" " 9q9jh0

(another data product)


discovery
The road data provides:

• traffic class (arterial, truck route, residential, etc.)
• traffic counts distribution
• surface type (asphalt, cement; age)
This leads to estimators for noise, reflection, etc.


discovery

GIS
export

Regex

road
Regex

src
parse-gis parse-road
M

M
Estimate Road
Join
Albedo Segments
Geohash
Failure
Traps

R
Road
road
Metadata

(flow diagram, gis road)


modeling

source: America’s Next Top Model


modeling

GIS data from Palo Alto provides us with
geolocation about each item in the export:
latitude, longitude, altitude
Geo data is great for managing municipal
infrastructure as well as for mobile apps
Predictive modeling in our Open Data
example focuses on leveraging geolocation
We use spatial indexing by creating
a grid of geohash values, for efﬁcient
parallel processing
Cascalog queries collect items with the
same geohash values – using them as keys
for large-scale joins (Hadoop)


modeling

geohash with 6-digit resolution
approximates a 5-block square
centered lat: 37.445, lng: -122.162

9q9jh0


modeling

Each road in the GIS export is listed as a block
between two cross roads, and each may have
multiple road segments to represent turns:
" -122.161776959558,37.4518836690781,0.0
" -122.161390381489,37.4516410983794,0.0
" -122.160786011735,37.4512589903357,0.0
" -122.160531178368,37.4510977281699,0.0

( lat1, lng1, alt1 )


NB: segments in the raw GIS have the order
of geo coordinates scrambled: (lng, lat, alt)


modeling

Our app analyzes each road segment as a data tuple,
calculating the center point for each:

( lat, lng, alt )


modeling

Then uses a geohash to deﬁne a grid cell,
as a boundary (or “canopy”):

9q9jh0


modeling

Query to join a road segment tuple with all the trees
within its geohash boundary:

9q9jh0


modeling

Use distance-to-midpoint to ﬁlter trees which are
too far away to provide shade:

X X

X


modeling

Calculate a sum of moments for tree height × distance
from road segment, as an estimator for shade:

∑( h·d )

We also calculate estimators for trafﬁc frequency
and noise


modeling
(defn get-shade [trees roads]
"subquery to join tree and road estimates, maximize for shade"
(<- [?road_name ?geohash ?road_lat ?road_lng
?road_alt ?road_metric ?tree_metric]
(roads ?road_name _ _ _
?albedo ?road_lat ?road_lng ?road_alt ?geohash
?traffic_count _ ?traffic_class _ _ _ _)
(road-metric
?traffic_class ?traffic_count ?albedo :> ?road_metric)
(trees _ _ _ _ _ _ _
?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash)
(read-string ?avg_height :> ?height)
;; limit to trees which are higher than people
(> ?height 2.0)
(tree-distance
?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance)
;; limit to trees within a one-block radius (not meters)
(<= ?distance 25.0)
(/ ?height ?distance :> ?tree_moment)
(c/sum ?tree_moment :> ?sum_tree_moment)
;; magic number 200000.0 used to scale tree moment
;; based on median
(/ ?sum_tree_moment 200000.0 :> ?tree_metric)
))


modeling
?road_name" " Hawthorne Avenue from Alma Street to High Street
?geohash"" " 9q9jh0
?road_lat" " 37.446140860599854
?road_lng " " -122.1674652295435
?road_alt " " 0.0
?road_metric" [1.0 0.5488121277250486 0.88]
?tree_metric" 4.36321007861036

(another data product)


modeling

Filter
tree
height

M
Calculate Filter Sum
Join
distance distance moment Filter
sum_moment

Estimate R M R M
road shade
traffic

(flow diagram, shade)


modeling


modeling
(defn get-gps [gps_logs trap]
"subquery to aggregate and rank GPS tracks per user"
(<- [?uuid ?geohash ?gps_count ?recent_visit]
(gps_logs
?date ?uuid ?gps_lat ?gps_lng ?alt ?speed ?heading
?elapsed ?distance)
(read-string ?gps_lat :> ?lat)
(read-string ?gps_lng :> ?lng)
(geohash ?lat ?lng :> ?geohash)
(c/count :> ?gps_count)
(date-num ?date :> ?visit)
(c/max ?visit :> ?recent_visit)
))

(behavioral targeting:
aggregate GPS tracks by
recency, frequency)


modeling

gps Count
Geohash Max
logs gps_count
recent_visit

M R
gps

(flow diagram, gps)


modeling
?uuid ?geohash ?gps_count ?recent_visit
cf660e041e994929b37cc5645209c8ae 9q8yym 7 1972376866448
342ac6fd3f5f44c6b97724d618d587cf 9q9htz 4 1972376690969
32cc09e69bc042f1ad22fc16ee275e21 9q9hv3 3 1972376670935
342ac6fd3f5f44c6b97724d618d587cf 9q9hv3 3 1972376691356
342ac6fd3f5f44c6b97724d618d587cf 9q9hvb 22 1972376691010
342ac6fd3f5f44c6b97724d618d587cf 9q9hwn 13 1972376690782
342ac6fd3f5f44c6b97724d618d587cf 9q9hwp 58 1972376690965
482dc171ef0342b79134d77de0f31c4f 9q9jh0 15 1972376952532
b1b4d653f5d9468a8dd18a77edcc5143 9q9jh0 18 1972376945348

(GPS personalization)


modeling
(defn get-reco [tracks shades]
"subquery to recommend road segments based on GPS tracks"
(<- [?uuid ?road ?geohash ?lat ?lng ?alt
?gps_count ?recent_visit ?road_metric ?tree_metric]
(tracks ?uuid ?geohash ?gps_count ?recent_visit)
(shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric)
))

(finally, the recommender)


modeling

Recommenders combine multiple signals,
generally via weighted averages, to rank
personalized results:

• GPS of person ∩ road segment
• frequency and recency of visit
• trafﬁc class and rate
• road albedo (sunlight reﬂection)
• tree shade estimator
Adjusting the mix allows for further
personalization at the end use


integration

source: Wolfram


integration

Hadoop is rarely ever used in isolation
System integration is a hard problem in Big Data,
especially social aspects: breaking down silos
Cascading was built for this purpose:

• taps across many data frameworks:
HBase, Cassandra, MongoDB, etc. GIS Regex

tree
Scrub

• support for a variety of data serialization:
M
Estimate
Join Geohash
height

Regex

src
Avro,Thrift, Kryo, JSON, etc.
parse-gis
M Tree
tree
Metadata

Failure
Traps

• planning on multiple topologies:
MapReduce, in-memory, tuple spaces, etc.

• test-driven development (TDD) at scale
• ANSI SQL-92 integration, PMML, etc.


integration

This example focuses on the batch workﬂow
to examine best practices for parallel processing
Integrating with a mobile app requires next steps:

• push “reco” output to a Redis cluster
(caching layer) via a Cascading tap
• leverage Redis “sorted sets” for ranking
personalized results
• create lightweight API in Node.js + Nginx
for low-latency access at scale
• collect social interactions in Splunk
• instrument via Nagios, New Relic, Flurry, etc.
That provides a data service – doesn’t even begin
to address: design, user experience, marketing,
implementation, etc., for a complete app…


integration

Batch workﬂow plus a data service:

web
web Redis web mobile
logsGIS
logs cluster app API
export Customers

Cascading app
source sink
tap tap

source
Recommender tap

trap source customer
tap tap Splunk
profile
Customer
DBs
Prefs

web
Support web
Hadoop cluster logs gps
review logs
tracks


integration

In terms of deploying a batch workﬂow,
there are several considerations:

• build package for a “fat jar” (lein uberjar)
• continuous integration
• JAR repository
• cluster scheduling (e.g., EMR)
• instrumentation (Concurrent)
• troubleshooting from app layer


apps

source: Apple


apps

We work on discovery, modeling, integration – long before
coding an app. In a linear-logical sense, one might prefer a “waterfall”
approach; however, that would undermine core values – mitigating
Accidental Complexity – TDD, scalability, fault-tolerance, etc.
In lieu of SQL queries, we deﬁne a composable set of logical
propositions which can be executed, instrumented, tested, etc.,
independently for best practices at scale in parallel
Back to functional relational programming, particularly Datalog’s
logic programming, we use subqueries as logical propositions…
within a functional context… to leverage the relational model

• scalability: specify what you require, not how
• testability: disprove the opposites of propositions, to validate
Taken together in the context of Cascalog, now let’s build the app…


apps
(defproject cascading-copa "0.1.0-SNAPSHOT"
:description "City of Palo Alto Open Data recommender in Cascalog"
:url "https://github.com/Cascading/CoPA"
:license {:name "Apache License, Version 2.0"
:url "http://www.apache.org/licenses/LICENSE-2.0"
:distribution :repo
}
:uberjar-name "copa.jar"
:aot [copa.core]
:main copa.core
:source-paths ["src/main/clj"]
:dependencies [[org.clojure/clojure "1.4.0"]
[cascalog "1.10.0"]
[cascalog-more-taps "0.3.1-SNAPSHOT"]
[clojure-csv/clojure-csv "1.3.2"]
[org.clojars.sunng/geohash "1.0.1"]
[org.clojure/clojure-contrib "1.2.0"]
[date-clj "1.0.1"]
]
:profiles {:dev {:dependencies [[midje-cascalog "0.4.0"]]}
:provided {:dependencies [
[org.apache.hadoop/hadoop-core "0.20.2-dev"]
]}}
)


apps


apps

(results)

‣ addr: 115 HAWTHORNE AVE
‣ lat/lng: 37.446, -122.168
‣ geohash: 9q9jh0
‣ tree: 413 site 2
‣ species: Liquidambar styraciflua
‣ est. height: 23 m
‣ shade metric: 4.363
‣ traffic: local residential, light traffic
‣ recent visit: 1972376952532
‣ a short walk from my train stop ✔


apps

GIS Regex
tree

Scrub

M M
Estimate
Join Geohash
height

Regex
src

parse-gis
Tree Filter
tree
Metadata height

Failure M
Traps
Calculate Filter Sum
Join
distance distance moment Filter
sum_moment

Estimate R M R M
road
road

Regex
traffic
parse-road
shade

Estimate Road
Join
Albedo Segments
Geohash Join

M
R
Road
Metadata gps R
gps reco
logs

Count
Geohash Max
gps_count
recent_visit

(flow diagram,
M R

for the
whole enchilada)

definitions
Design principles in the Cascading API pattern language,
which help ensure best practices for Big Data apps in
an Enterprise context:
• specify what is required, not how it must be achieved
• provide the “glue” for system integration
• same JAR, any scale
• users want no surprises
• fail the same way twice
• plan far ahead
These points echo arguments about functional relational
programming (FRP) and Accidental Complexity
from Moseley/Marks 2006


systems

source: Wired


principle: same JAR, any scale
MegaCorp Enterprise IT:
Pb’s data
1000+ node private cluster
EVP calls you when app fails
runtime: days+

Production Cluster:
Tb’s data
EMR w/ many HPC Instances
Ops monitors results
runtime: hours – days

Staging Cluster:
Gb’s data
EMR + a few Spot Instances
CI shows red or green lights
runtime: minutes – hours

Your Laptop:
Mb’s data
Hadoop standalone mode
passes unit tests, or not
runtime: seconds – minutes


systems
#!/bin/bash -ex
# edit the `BUCKET` variable to use one of your S3 buckets:
BUCKET=temp.cascading.org/copa
SINK=out

# clear previous output (required by Apache Hadoop)
s3cmd del -r s3://$BUCKET/$SINK
# load built JAR + input data
s3cmd put target/copa.jar s3://$BUCKET/
s3cmd put -r data s3://$BUCKET/

# launch cluster and run
elastic-mapreduce --create --name "CoPA"
--debug --enable-debugging --log-uri s3n://$BUCKET/logs
--jar s3n://$BUCKET/copa.jar
--arg s3n://$BUCKET/data/copa.csv
--arg s3n://$BUCKET/data/meta_tree.tsv
--arg s3n://$BUCKET/data/meta_road.tsv
--arg s3n://$BUCKET/data/gps.csv
--arg s3n://$BUCKET/$SINK/trap
--arg s3n://$BUCKET/$SINK/park
--arg s3n://$BUCKET/$SINK/tree
--arg s3n://$BUCKET/$SINK/road
--arg s3n://$BUCKET/$SINK/shade
--arg s3n://$BUCKET/$SINK/gps
--arg s3n://$BUCKET/$SINK/reco


systems


systems

‣ name node / data node
‣ job tracker / task tracker
‣ submit queue
‣ task slots
‣ HDFS
‣ distributed cache

Wikipedia

(under
the
hood)
Apache


bucket
list

Could combine this with a variety of data APIs:
• Trulia neighborhood data, housing prices
• Factual local business (FB Places, etc.)
• CommonCrawl open source full web crawl
• Wunderground local weather data
• WalkScore neighborhood data, walkability
• Data.gov US federal open data
• Data.NASA.gov NASA open data
• DBpedia datasets derived from Wikipedia
• GeoWordNet semantic knowledge base
• Geolytics demographics, GIS, etc.
• Foursquare,Yelp, CityGrid, Localeze,YP
• various photo sharing


Data Quality: some species names have
spelling errors or misclassifications – could
be cleaned up and provided back to CoPA
to improve municipal services

Assumptions have been made about
missing data – were these appropriate
for the intended use case?

There are better ways to handle spatial
indexing: k-d trees, etc.

The tree data product needs: photos,
toxicity, natives vs. invasives,
common names, etc.


Arguably, this is not a “large” data set:
• Palo Alto has 65K population
• great location for a POC
• prior to deploying in large metro areas
• CoPA is a leader in e-gov
• app is simpler to study on a laptop

Could extend to other cities with Open Data
initiatives:
SF, SJ, PDX, Seattle, VanBC…

Let’s get coverage for all of Ecotopia!


Trulia: optimize sales leads using estimated
allergy zones, based on buyers’ real estate
preferences

Calflora: report new observations of invasives
endangered species, etc.; infer regions of affinity
for releasing beneficial insects

City of Palo Alto: assess zoning impact,
e.g., oleanders near day care centers; monitor
outbreaks of tree diseases (big impact on
property values)

start-ups: some invasive species are valuable
in Chinese medicine while others can be
converted to biodiesel – potential win-win
for targeted harvest services


summary points
• geo data is great for municipal infrastructure and for mobile apps
• Cascading as a pattern language for Enterprise Data Workﬂows
• design principles in the API/pattern language ensure best practices
• focus on the process of structuring data; not un/structured
• Cascalog subqueries as composable logical propositions
• FRP mitigates the engineering costs of Accidental Complexity
• Data Science process: discovery, modeling, integration, apps, systems
• Hadoop is rarely ever used in isolation; breaking down silos is the
hard problem, which must be socialized to resolve


references

leiningen.org
github.com/nathanmarz/cascalog/wiki
sritchie.github.com
vimeo.com/16398892
manning.com/marz
java.dzone.com/articles/using-lucene-
and-cascalog-fast


references

by Paco Nathan
Enterprise Data Workﬂows
with Cascading
O’Reilly, 2013

Santa Clara, Feb 28, 1:30pm
strataconf.com/strata2013


drill-down

blog, code/wiki/gists, maven repo, community, products:
cascading.org
github.org/Cascading
conjars.org
meetup.com/cascading
goo.gl/KQtUL
concurrentinc.com

we are hiring! Copyright @2013, Concurrent, Inc.


Using Cascalog to build  an app based on City of Palo Alto Open Data

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to Using Cascalog to build  an app based on City of Palo Alto Open Data

Similar to Using Cascalog to build  an app based on City of Palo Alto Open Data (11)

More from Paco Nathan

More from Paco Nathan (20)

Recently uploaded

Recently uploaded (20)