July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Paco Nathan
liber118.com/pxn/
“Using Cascalog with
Palo Alto Open Data”
Licensed under a Creative Commons Attribution-
NonCommercial-NoDerivs 3.0 Unported License.
LA Clojure User Group
1Friday, 19 July 13

Cascading / Cascalog / Scalding
Enterprise Data Workflows with Cascading
Cluster Computing with Mesos
Using Cascalog with Palo Alto Open Data
2Friday, 19 July 13

Cascading – origins
API author Chris Wensel worked as a system architect
at an Enterprise firm well-known for many popular
data products.
Wensel was following the Nutch open source project –
where Hadoop started.
Observation: would be difficult to find Java developers
to write complex Enterprise apps in MapReduce –
potential blocker for leveraging new open source
technology.
3Friday, 19 July 13

Cascading – functional programming
Key insight: MapReduce is based on functional programming
– back to LISP in 1970s. Apache Hadoop use cases are
mostly about data pipelines, which are functional in nature.
To ease staffing problems as “Main Street” Enterprise firms
began to embrace Hadoop, Cascading was introduced
in late 2007, as a new Java API to implement functional
programming for large-scale data workflows:
• leverages JVM and Java-based tools without any
need to create new languages
• allows programmers who have J2EE expertise
to leverage the economics of Hadoop clusters
4Friday, 19 July 13

Cascading – functional programming
• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc.,
have invested in open source projects atop Cascading
– used for their large-scale production deployments
• new case studies for Cascading apps are mostly
based on domain-speciﬁc languages (DSLs) in JVM
languages which emphasize functional programming:
Cascalog in Clojure (2010)
Scalding in Scala (2012)
github.com/nathanmarz/cascalog/wiki
github.com/twitter/scalding/wiki
Why Adopting the Declarative Programming PracticesWill ImproveYour Return fromTechnology
Dan Woods, 2013-04-17 Forbes
forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-
practices-will-improve-your-return-from-technology/
5Friday, 19 July 13

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Cascading – integrations
• partners: Microsoft Azure, Hortonworks,
Amazon AWS, MapR, EMC, SpringSource,
Cloudera
• taps: Memcached, Cassandra, MongoDB,
HBase, JDBC, Parquet, etc.
• serialization: Avro, Thrift, Kryo,
JSON, etc.
• topologies: Apache Hadoop,
tuple spaces, local mode
6Friday, 19 July 13

Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
7Friday, 19 July 13

Cascading – deployments
• case studies: Climate Corp, Twitter, Etsy,
Williams-Sonoma, uSwitch, Airbnb, Nokia,
YieldBot, Square, Harvard, Factual, etc.
• use cases: ETL, marketing funnel, anti-fraud,
social media, retail pricing, search analytics,
recommenders, eCRM, utility grids, telecom,
genomics, climatology, agronomics, etc.
workﬂow abstraction addresses:
• stafﬁng bottleneck;
• system integration;
• operational complexity;
• test-driven development
8Friday, 19 July 13

Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
1 map
1 reduce
18 lines code gist.github.com/3900702
WordCount – conceptual ﬂow diagram
cascading.org/category/impatient
9Friday, 19 July 13

WordCount – Cascading app in Java
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// write a DOT file and run the flow
Flow wcFlow = flowConnector.connect( flowDef );
wcFlow.writeDOT( "dot/wc.dot" );
wcFlow.complete();
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M
10Friday, 19 July 13

mapreduce
Every('wc')[Count[decl:'count']]
Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']
GroupBy('wc')[by:['token']]
Each('token')[RegexSplitGenerator[decl:'token'][args:1]]
Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']
[head]
[tail]
[{2}:'token', 'count']
[{1}:'token']
[{2}:'doc_id', 'text']
[{2}:'doc_id', 'text']
wc[{1}:'token']
[{1}:'token']
[{1}:'token']
[{1}:'token']
WordCount – generated ﬂow diagram
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

(ns impatient.core
  (:use [cascalog.api]
        [cascalog.more-taps :only (hfs-delimited)])
  (:require [clojure.string :as s]
            [cascalog.ops :as c])
  (:gen-class))
(defmapcatop split [line]
  "reads in a line of string and splits it by regex"
  (s/split line #"[[](),.)s]+"))
(defn -main [in out & args]
  (?<- (hfs-delimited out)
       [?word ?count]
       ((hfs-delimited in :skip-header? true) _ ?line)
       (split ?line :> ?word)
       (c/count ?count)))
; Paul Lam
; github.com/Quantisan/Impatient
WordCount – Cascalog / Clojure
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

github.com/nathanmarz/cascalog/wiki
• implements Datalog in Clojure, with predicates backed
by Cascading – for a highly declarative language
• run ad-hoc queries from the Clojure REPL –
approx. 10:1 code reduction compared with SQL
• composable subqueries, used for test-driven development
(TDD) practices at scale
• Leiningen build: simple, no surprises, in Clojure itself
• more new deployments than other Cascading DSLs –
Climate Corp is largest use case: 90% Clojure/Cascalog
• has a learning curve, limited number of Clojure developers
• aggregators are the magic, and those take effort to learn
WordCount – Cascalog / Clojure
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

import com.twitter.scalding._

class WordCount(args : Args) extends Job(args) {
Tsv(args("doc"),
('doc_id, 'text),
skipHeader = true)
.read
.flatMap('text -> 'token) {
text : String => text.split("[ [](),.]")
}
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
}
WordCount – Scalding / Scala
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

github.com/twitter/scalding/wiki
• extends the Scala collections API so that distributed lists
become “pipes” backed by Cascading
• code is compact, easy to understand
• nearly 1:1 between elements of conceptual ﬂow diagram
and function calls
• extensive libraries are available for linear algebra, abstract
algebra, machine learning – e.g., Matrix API, Algebird, etc.
• signiﬁcant investments by Twitter, Etsy, eBay, etc.
• great for data services at scale
• less learning curve than Cascalog
WordCount – Scalding / Scala
Document
Collection
Word
Count
Tokenize
GroupBy
token Count
R
M

Workflow Abstraction – pattern language
Cascading uses a “plumbing” metaphor in the Java API,
to define workflows out of familiar elements: Pipes, Taps,
Tuple Flows, Filters, Joins, Traps, etc.
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Data is represented as flows of tuples. Operations within
the flows bring functional programming aspects into Java
A Pattern Language
Christopher Alexander, et al.
amazon.com/dp/0195019199

Workflow Abstraction – literate programming
Cascading workflows generate their own visual
documentation: flow diagrams
in formal terms, flow diagrams leverage a methodology
called literate programming
provides intuitive, visual representations for apps –
great for cross-team collaboration
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R
Literate Programming
Don Knuth
literateprogramming.com

Workflow Abstraction – business process
following the essence of literate programming, Cascading
workflows provide statements of business process
this recalls a sense of business process management
for Enterprise apps (think BPM/BPEL for Big Data)
Cascading creates a separation of concerns between
business process and implementation details (Hadoop, etc.)
this is especially apparent in large-scale Cascalog apps:
“Specify what you require, not how to achieve it.”
by virtue of the pattern language, the flow planner then
determines how to translate business process into efficient,
parallel jobs at scale

Anatomy of an Enterprise app
Deﬁnition a typical Enterprise workﬂow which crosses through
multiple departments, languages, and technologies…
ETL
data
prep
predictive
model
data
sources
end
uses

ETL
data
prep
predictive
model
data
sources
end
uses
ANSI SQL for ETL

ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic

ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive models

ETL
data
prep
predictive
model
data
sources
end
uses
SAS for predictive modelsANSI SQL for ETL most of the licensing costs…

ETL
data
prep
predictive
model
data
sources
end
usesJ2EE for business logic
most of the project costs…

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
SAS, R, etc. → PMML
business logic in Java,
Clojure, Scala, etc.
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
Cascading allows multiple departments to combine their workﬂow components
into an integrated app – one among many, typically – based on 100% open source
a compiler sees it all…
cascading.org

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
FlowDef flowDef = FlowDef.flowDef()
.setName( "etl" )
.addSource( "example.employee", emplTap )
.addSource( "example.sales", salesTap )
.addSink( "results", resultsTap );

SQLPlanner sqlPlanner = new SQLPlanner()
.setSql( sqlStatement );

flowDef.addAssemblyPlanner( sqlPlanner );
cascading.org

ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
FlowDef flowDef = FlowDef.flowDef()
.setName( "classifier" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );

PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlModel ) )
.retainOnlyActiveIncomingFields();

flowDef.addAssemblyPlanner( pmmlPlanner );

cascading.org
ETL
data
prep
predictive
model
data
sources
end
uses
Lingual:
DW → ANSI SQL
Pattern:
sink taps for
Memcached, HBase,
MongoDB, etc.
source taps for
Cassandra, JDBC,
Splunk, etc.
visual collaboration for the business logic is a great
way to improve how teams work together
Failure
Traps
bonus
allocation
employee
PMML
classifier
quarterly
sales
Join
Count
leads

Lingual – CSV data in local ﬁle system
cascading.org/lingual

Lingual – shell prompt, catalog

Lingual – queries

# load the JDBC package
library(RJDBC)

# set up the driver
drv <- JDBC("cascading.lingual.jdbc.Driver",
"~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar")

# set up a database connection to a local repository
connection <- dbConnect(drv,
"jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/
tables;schema=EMPLOYEES")

# query the repository: in this case the MySQL sample database (CSV files)
df <- dbGetQuery(connection,
"SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")
head(df)

# use R functions to summarize and visualize part of the data
df$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25
summary(df$hire_age)
library(ggplot2)
m <- ggplot(df, aes(x=hire_age))
m <- m + ggtitle("Age at hire, people named Gina")
m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()
Lingual – connecting Hadoop and R

> summary(df$hire_age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.86 27.89 31.70 31.61 35.01 43.92
Lingual – connecting Hadoop and R

Hadoop
Cluster
source
tap
source
tap sink
tap
trap
tap
customer
profile DBsCustomer
Prefs
logs
logs
Logs
Data
Workflow
Cache
Customers
Support
Web
App
Reporting
Analytics
Cubes
sink
tap
Modeling PMML
Pattern – model scoring
• migrate workloads: SAS,Teradata, etc.,
exporting predictive models as PMML
• great open source tools – R, Weka,
KNIME, Matlab, RapidMiner, etc.
• integrate with other libraries –
Matrix API, etc.
• leverage PMML as another kind
of DSL
cascading.org/pattern

• established XML standard for predictive model markup
• organized by Data Mining Group (DMG), since 1997
http://dmg.org/
• members: IBM, SAS, Visa, NASA, Equifax, Microstrategy,
Microsoft, etc.
• PMML concepts for metadata, ensembles, etc., translate
directly into Cascading tuple ﬂows
“PMML is the leading standard for statistical and data mining models and
supported by over 20 vendors and organizations.With PMML, it is easy
to develop a model on one system using one application and deploy the
model on another system using another application.”
PMML – standard
wikipedia.org/wiki/Predictive_Model_Markup_Language

PMML – vendor coverage

• Association Rules: AssociationModel element
• Cluster Models: ClusteringModel element
• Decision Trees: TreeModel element
• Naïve Bayes Classiﬁers: NaiveBayesModel element
• Neural Networks: NeuralNetwork element
• Regression: RegressionModel and GeneralRegressionModel elements
• Rulesets: RuleSetModel element
• Sequences: SequenceModel element
• SupportVector Machines: SupportVectorMachineModel element
• Text Models: TextModel element
• Time Series: TimeSeriesModel element
PMML – model coverage
ibm.com/developerworks/industry/library/ind-PMML2/

## train a RandomForest model

f <- as.formula("as.factor(label) ~ .")
fit <- randomForest(f, data_train, ntree=50)

## test the model on the holdout test set

print(fit$importance)
print(fit)

predicted <- predict(fit, data)
data$predicted <- predicted
confuse <- table(pred = predicted, true = data[,1])
print(confuse)

## export predicted labels to TSV

write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"),
quote=FALSE, sep="t", row.names=FALSE)

## export RF model to PMML

saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))
Pattern – create a model in R

<?xml version="1.0"?>
<PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.dmg.org/PMML-4_0
http://www.dmg.org/v4-0/pmml-4-0.xsd">
<Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">
  <Extension name="user" value="ceteri" extender="Rattle/PMML"/>
  <Application name="Rattle/PMML" version="1.2.30"/>
  <Timestamp>2012-10-22 19:39:28</Timestamp>
</Header>
<DataDictionary numberOfFields="4">
  <DataField name="label" optype="categorical" dataType="string">
   <Value value="0"/>
   <Value value="1"/>
  </DataField>
  <DataField name="var0" optype="continuous" dataType="double"/>
</DataDictionary>
<MiningModel modelName="randomForest_Model" functionName="classification">
  <MiningSchema>
   <MiningField name="label" usageType="predicted"/>
   <MiningField name="var0" usageType="active"/>
  </MiningSchema>
  <Segmentation multipleModelMethod="majorityVote">
   <Segment id="1">
    <True/>
    <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">
     <MiningSchema>
      <MiningField name="label" usageType="predicted"/>
     </MiningSchema>
...
Pattern – capture model parameters as PMML

public static void main( String[] args ) throws RuntimeException {
String inputPath = args[ 0 ];
String classifyPath = args[ 1 ];
// set up the config properties
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
// create source and sink taps
Tap inputTap = new Hfs( new TextDelimited( true, "t" ), inputPath );
Tap classifyTap = new Hfs( new TextDelimited( true, "t" ), classifyPath );
// handle command line options
OptionParser optParser = new OptionParser();
optParser.accepts( "pmml" ).withRequiredArg();
OptionSet options = optParser.parse( args );

// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef().setName( "classify" )
.addSource( "input", inputTap )
.addSink( "classify", classifyTap );

if( options.hasArgument( "pmml" ) ) {
String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 );
PMMLPlanner pmmlPlanner = new PMMLPlanner()
.setPMMLInput( new File( pmmlPath ) )
.retainOnlyActiveIncomingFields()
.setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model
flowDef.addAssemblyPlanner( pmmlPlanner );
}

// write a DOT file and run the flow
Flow classifyFlow = flowConnector.connect( flowDef );
classifyFlow.writeDOT( "dot/classify.dot" );
classifyFlow.complete();
}
Pattern – score a model, within an app

Customer
Orders
Classify
Scored
Orders
GroupBy
token
Count
PMML
Model
M R
Failure
Traps
Assert
Confusion
Matrix
Pattern – score a model, using pre-deﬁned Cascading app

Roadmap – existing algorithms for scoring
•

Random Forest
• Decision Trees
• Linear Regression
• GLM
• Logistic Regression
• K-Means Clustering
• Hierarchical Clustering
• Multinomial
• SupportVector Machines (prepared for release)
also, model chaining and general support for ensembles

Roadmap – next priorities for scoring
•

Time Series (ARIMA forecast)
• Association Rules (basket analysis)
• Naïve Bayes
• Neural Networks
algorithms extended based on customer use cases –
contact groups.google.com/forum/?fromgroups#!forum/pattern-user

Q3 1997: inﬂection point
Four independent teams were working toward horizontal
scale-out of workﬂows based on commodity hardware
This effort prepared the way for huge Internet successes
in the 1997 holiday season… AMZN, EBAY, Inktomi
(YHOO Search), then GOOG
MapReduce and the Apache Hadoop open source stack
emerged from this

RDBMS
Stakeholder
SQL Query
result sets
Excel pivot tables
PowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BI
Analysts
optimized
code
Circa 1996: pre- inﬂection point

RDBMS
Stakeholder
SQL Query
result sets
Excel pivot tables
PowerPoint slide decks
Web App
Customers
transactions
Product
strategy
Engineering
requirements
BI
Analysts
optimized
code
Circa 1996: pre- inﬂection point
“throw it over the wall”

RDBMS
SQL Query
result sets
recommenders
+
classiﬁers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes

RDBMS
SQL Query
result sets
recommenders
+
classiﬁers
Web Apps
customer
transactions
Algorithmic
Modeling
Logs
event
history
aggregation
dashboards
Product
Engineering
UX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
Circa 2001: post- big ecommerce successes
“data products”

Workﬂow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere

Workﬂow
RDBMS
near timebatch
services
transactions,
content
social
interactions
Web Apps,
Mobile, etc.History
Data Products Customers
RDBMS
Log
Events
In-Memory
Data Grid
Hadoop,
etc.
Cluster Scheduler
Prod
Eng
DW
Use Cases Across Topologies
s/w
dev
data
science
discovery
+
modeling
Planner
Ops
dashboard
metrics
business
process
optimized
capacitytaps
Data
Scientist
App Dev
Ops
Domain
Expert
introduced
capability
existing
SDLC
Circa 2013: clusters everywhere
“optimize topologies”

Operating Systems, redux
meanwhile, GOOG is 3+ generations ahead,
with much improved ROI on data centers
John Wilkes, et al.
Borg/Omega: data center “secret sauce”
youtu.be/0ZFMlO98Jkc
0%
25%
50%
75%
100%
RAILS CPU
LOAD
MEMCACHED
CPU LOAD
0%
25%
50%
75%
100%
HADOOP CPU
LOAD
0%
25%
50%
75%
100%
t t
0%
25%
50%
75%
100%
Rails
Memcached
Hadoop
COMBINED CPU LOAD (RAILS,
MEMCACHED, HADOOP)
Florian Leibert, Chronos/Mesos @ Airbnb
Mesos, open source cloud OS – like Borg
goo.gl/jPtTP

Mesos
mesos.apache.org
Return of the Borg: HowTwitter Rebuilt Google’s SecretWeapon
Cade Metz
wired.com/wiredenterprise/2013/03/google-
borg-twitter-mesos/

Mesos
a common substrate for cluster computing
• scale to 10,000s of nodes using fast, event-driven C++ impl
• improve utilization across workloads
• run long-lived services (e.g., Hypertable and HBase) on the
same nodes as batch app and share resources
• build new cluster computing frameworks without reinventing low-level
facilities, and have them coexist with existing work
• run multiple instances/versions of Hadoop on the same cluster to isolate
production and experimental jobs
• reshape cluster resources based on ML from app history
• reduce latency in transferring data products from one cluster to another
• enable new kinds of apps, which combine frameworks with lower latency

Palo Alto is quite a pleasant place
• temperate weather
• lots of parks, enormous trees
• great coffeehouses
• walkable downtown
• not particularly crowded
On a nice summer day, who wants to be stuck
indoors on a phone call?
Instead, take it outside – go for a walk
And example open source project:
github.com/Cascading/CoPA/wiki

1. Open Data about municipal infrastructure
(GIS data: trees, roads, parks)
✚
2. Big Data about where people like to walk
(smartphone GPS logs)
✚
3. some curated metadata
(which surfaces the value)
4. personalized recommendations:
“Find a shady spot on a summer day in which to walk
near downtown Palo Alto.While on a long conference call.
Sipping a latte or enjoying some fro-yo.”
Scrub
token
Document
Collection
Tokenize
Word
Count
GroupBy
token
Count
Stop Word
List
Regex
token
HashJoin
Left
RHS
M
R

The City of Palo Alto recently began to support Open Data
to give the local community greater visibility into how
their city government operates
This effort is intended to encourage students, entrepreneurs,
local organizations, etc., to build new apps which contribute
to the public good
paloalto.opendata.junar.com/dashboards/7576/geographic-information/
discovery

GIS about trees in Palo Alto:
discovery

Geographic_Information,,,
"Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29
Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australis
Source: davey tree Protected: Designated: Heritage: Appraised Value:
Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872
Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point"
"Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: Wilkie
Way From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID:
598 (Wilkie Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 Year
Constructed: 1950 Traffic Count: 596 Traffic Index: residential local Traffic
Class: local residential Traffic Date: 08/24/90 Paving Length: 208 Paving Width:
40 Paving Area: 8320 Surface Type: asphalt concrete Surface Thickness: 2.0 Base
Type Pvmt: crusher run base Base Thickness: 6.0 Soil Class: 2 Soil Value: 15
Curb Type: Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1 District
Number: 18 Land Use PMMS: 1 Overlay Year: 1990 Overlay Thickness: 1.5 Base
Failure Year: 1990 Base Failure Thickness: 6 Surface Treatment Year: Surface
Treatment Type: Alligator Severity: none Alligator Extent: 0 Block Severity:
none Block Extent: 0 Longitude and Transverse Severity: none Longitude and Transverse
Extent: 0 Ravelling Severity: none Ravelling Extent: 0 Ridability Severity: none
Trench Severity: none Trench Extent: 0 Rutting Severity: none Rutting Extent: 0
Road Performance: UL (Urban Local) Bike Lane: 0 Bus Route: 0 Truck Route: 0
Remediation: Deduct Value: 100 Priority: Pavement Condition: excellent
Street Cut Fee per SqFt: 10.00 Source Date: 6/10/2009 User Modified By: mnicols
Identifier System: 21410 ","-122.1249640794,37.4155803115645,0.0
-122.124661859039,37.4154224594993,0.0 -122.124587720719,37.4153758330704,0.0
-122.12451895942,37.4153242300888,0.0 -122.124456098457,37.4152680432944,0.0
-122.124399616238,37.4152077003122,0.0 -122.124374937753,37.4151774433318,0.0 ","Line"
discovery
(unstructured data…)

(defn parse-gis [line]
"leverages parse-csv for complex CSV format in GIS export"
(first (csv/parse-csv line))
)

(defn etl-gis [gis trap]
"subquery to parse data sets from the GIS source tap"
(<- [?blurb ?misc ?geo ?kind]
(gis ?line)
(parse-gis ?line :> ?blurb ?misc ?geo ?kind)
(:trap (hfs-textline trap))
))
discovery
(specify what you require,
not how to achieve it…
80/20 rule of data prep cost)

discovery
(ad-hoc queries get refined
into composable predicates)
Identifier: 474
Tree ID: 412
Tree: 412 site 1 at 115 HAWTHORNE AV
Tree Site: 1
Street_Name: HAWTHORNE AV
Situs Number: 115
Private: -1
Species: Liquidambar styraciflua
Source: davey tree
Hardscape: None
37.446001565119,-122.167713417554,0.0
Point

discovery
(curate valuable metadata)

(defn get-trees [src trap tree_meta]
"subquery to parse/filter the tree data"
(<- [?blurb ?tree_id ?situs ?tree_site
?species ?wikipedia ?calflora ?avg_height
?tree_lat ?tree_lng ?tree_alt ?geohash
]
(src ?blurb ?misc ?geo ?kind)
(re-matches #"^s+Private.*Tree ID.*" ?misc)
(parse-tree ?misc :> _ ?priv ?tree_id ?situs ?tree_site ?raw_species)
((c/comp s/trim s/lower-case) ?raw_species :> ?species)
(tree_meta ?species ?wikipedia ?calflora ?min_height ?max_height)
(avg ?min_height ?max_height :> ?avg_height)
(geo-tree ?geo :> _ ?tree_lat ?tree_lng ?tree_alt)
(read-string ?tree_lat :> ?lat)
(read-string ?tree_lng :> ?lng)
(geohash ?lat ?lng :> ?geohash)
(:trap (hfs-textline trap))
))
discovery
?blurb!! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22
?tree_id! " 412
?situs"" 115
?tree_site" 1
?species" " liquidambar styraciflua
?wikipedia" http://en.wikipedia.org/wiki/Liquidambar_styraciflua
?calflora http://calflora.org/cgi-bin/species_query.cgi?where-calre
?avg_height" 27.5
?tree_lat" 37.446001565119
?tree_lng" -122.167713417554
?tree_alt" 0.0
?geohash" " 9q9jh0

// run analysis and visualization in R
library(ggplot2)
dat_folder <- '~/src/concur/CoPA/out/tree'
data <- read.table(file=paste(dat_folder, "part-00000", sep="/"),
sep="t", quote="", na.strings="NULL", header=FALSE, encoding="UTF8")

summary(data)
t <- head(sort(table(data$V5), decreasing=TRUE)
trees <- as.data.frame.table(t, n=20))
colnames(trees) <- c("species", "count")

m <- ggplot(data, aes(x=V8))
m <- m + ggtitle("Estimated Tree Height (meters)")
m + geom_histogram(aes(y = ..density.., fill = ..count..)) + geom_density()

par(mar = c(7, 4, 4, 2) + 0.1)
plot(trees, xaxt="n", xlab="")
axis(1, labels=FALSE)
text(1:nrow(trees), par("usr")[3] - 0.25, srt=45, adj=1,
labels=trees$species, xpd=TRUE)
grid(nx=nrow(trees))
discovery

discovery
sweetgum
analysis of the tree data:

M
tree
GIS
export
Regex
parse-gis
src
Scrub
species
Geohash
Regex
parse-tree
tree
Tree
Metadata
Join
Failure
Traps
Estimate
height
M
discovery
(flow diagram, gis tree)

9q9jh0
geohash with 6-digit resolution
approximates a 5-block square
centered lat: 37.445, lng: -122.162
modeling

Each road in the GIS export is listed as a block between two
cross roads, and each may have multiple road segments to
represent turns:
" -122.161776959558,37.4518836690781,0.0
" -122.161390381489,37.4516410983794,0.0
" -122.160786011735,37.4512589903357,0.0
" -122.160531178368,37.4510977281699,0.0
modeling
( lat0, lng0, alt0 )
NB: segments in the raw GIS have the order of geo coordinates
scrambled: (lng, lat, alt)

9q9jh0
X X
X
Filter trees which are too far away to provide shade. Calculate a sum
of moments for tree height × distance, as an estimator for shade:
modeling

(defn get-shade [trees roads]
"subquery to join tree and road estimates, maximize for shade"
(<- [?road_name ?geohash ?road_lat ?road_lng
?road_alt ?road_metric ?tree_metric]
(roads ?road_name _ _ _
?albedo ?road_lat ?road_lng ?road_alt ?geohash
?traffic_count _ ?traffic_class _ _ _ _)
(road-metric
?traffic_class ?traffic_count ?albedo :> ?road_metric)
(trees _ _ _ _ _ _ _
?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash)
(read-string ?avg_height :> ?height)
;; limit to trees which are higher than people
(> ?height 2.0)
(tree-distance
?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance)
;; limit to trees within a one-block radius (not meters)
(<= ?distance 25.0)
(/ ?height ?distance :> ?tree_moment)
(c/sum ?tree_moment :> ?sum_tree_moment)
;; magic number 200000.0 used to scale tree moment
;; based on median
(/ ?sum_tree_moment 200000.0 :> ?tree_metric)
))
modeling

M
tree
Join
Calculate
distance
shade
Filter
height
Sum
moment
REstimate
traffic
R
road
Filter
distance
M M
Filter
sum_moment
(flow diagram, shade)
modeling

(defn get-gps [gps_logs trap]
"subquery to aggregate and rank GPS tracks per user"
(<- [?uuid ?geohash ?gps_count ?recent_visit]
(gps_logs
?date ?uuid ?gps_lat ?gps_lng ?alt ?speed ?heading
?elapsed ?distance)
(read-string ?gps_lat :> ?lat)
(read-string ?gps_lng :> ?lng)
(geohash ?lat ?lng :> ?geohash)
(c/count :> ?gps_count)
(date-num ?date :> ?visit)
(c/max ?visit :> ?recent_visit)
))
modeling
?uuid ?geohash ?gps_count ?recent_visit
cf660e041e994929b37cc5645209c8ae 9q8yym 7 1972376866448
342ac6fd3f5f44c6b97724d618d587cf 9q9htz 4 1972376690969
32cc09e69bc042f1ad22fc16ee275e21 9q9hv3 3 1972376670935
342ac6fd3f5f44c6b97724d618d587cf 9q9hv3 3 1972376691356
342ac6fd3f5f44c6b97724d618d587cf 9q9hwn 13 1972376690782
342ac6fd3f5f44c6b97724d618d587cf 9q9hwp 58 1972376690965
482dc171ef0342b79134d77de0f31c4f 9q9jh0 15 1972376952532
b1b4d653f5d9468a8dd18a77edcc5143 9q9jh0 18 1972376945348

Recommenders often combine multiple signals, via weighted
averages, to rank personalized results:
• GPS of person ∩ road segment
• frequency and recency of visit
• trafﬁc class and rate
• road albedo (sunlight reﬂection)
• tree shade estimator
Adjusting the mix allows for further personalization at the end use
modeling
(defn get-reco [tracks shades]
"subquery to recommend road segments based on GPS tracks"
(<- [?uuid ?road ?geohash ?lat ?lng ?alt
?gps_count ?recent_visit ?road_metric ?tree_metric]
(tracks ?uuid ?geohash ?gps_count ?recent_visit)
(shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric)
))

‣ addr: 115 HAWTHORNE AVE
‣ lat/lng: 37.446, -122.168
‣ geohash: 9q9jh0
‣ tree: 413 site 2
‣ species: Liquidambar styraciflua
‣ est. height: 23 m
‣ shade metric: 4.363
‣ traffic: local residential, light traffic
‣ recent visit: 1972376952532
‣ a short walk from my train stop ✔
apps

Could combine this with a variety of data APIs:
• Trulia neighborhood data, housing prices
• Factual local business (FB Places, etc.)
• CommonCrawl open source full web crawl
• Wunderground local weather data
• WalkScore neighborhood data, walkability
• Data.gov US federal open data
• Data.NASA.gov NASA open data
• DBpedia datasets derived fromWikipedia
• GeoWordNet semantic knowledge base
• Geolytics demographics, GIS, etc.
• Foursquare,Yelp, CityGrid, Localeze,YP
• various photo sharing
apps

Enterprise DataWorkﬂows
with Cascading
O’Reilly, 2013
shop.oreilly.com/product/
0636920028536.do
Follow-Up…
also, check out this newsletter
for updates:
liber118.com/pxn/

Follow-Up…
blog, developer community, code/wiki/gists, maven repo,
commercial products, etc.:
cascading.org
zest.to/group11
github.com/Cascading
conjars.org
goo.gl/KQtUL
concurrentinc.com

July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Similar to July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Similar to July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data" (20)

More from Paco Nathan

More from Paco Nathan (20)

Recently uploaded

Recently uploaded (20)

July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"