SlideShare a Scribd company logo
1 of 77
Download to read offline
Using Split-Apply-Combine for Data Analysis in
Clojure
Bay Area Clojure Group
June 6, 2013
Tom Faulhaber
twitter: @tomfaulhaber
github: tomfaulhaber
Saturday, June 8, 13
Saturday, June 8, 13
Saturday, June 8, 13
Saturday, June 8, 13
Saturday, June 8, 13
Saturday, June 8, 13
Data Structures for Data
Analysis
Saturday, June 8, 13
The Vector
	 	[265.0 259.98 266.89 262.22 ...]
	
Saturday, June 8, 13
The Vector
	 (mean	[265.0 259.98 266.89 262.22 ...])

 ➜
263.697
Saturday, June 8, 13
The Vector
	(apply min	[265.0 259.98 266.89 262.22 ...])

 ➜
257.21
Saturday, June 8, 13
The Vector
	(apply max	[265.0 259.98 266.89 262.22 ...])

 ➜
269.75
Saturday, June 8, 13
The Vector
	 (sd	[265.0 259.98 266.89 262.22 ...])

 ➜
3.815
Saturday, June 8, 13
The Vector
	 (quantile	[265.0 259.98 266.89 262.22 ...])

 ➜
[257.21 260.105 264.27 266.175 269.75]
Saturday, June 8, 13
The Vector
	 	[265.0 259.98 266.89 262.22 ...]
	
Saturday, June 8, 13
The Vector
	(histogram	[265.0 259.98 266.89 262.22 ...])

 ➜
Saturday, June 8, 13
The Vector
	 	[265.0 259.98 266.89 262.22 ...]
	
Saturday, June 8, 13
The Vector
	 (line-chart	[265.0 259.98 266.89 262.22 ...])

 ➜
Saturday, June 8, 13
The Matrix
Saturday, June 8, 13
The Matrix
1 Dimension
0
1
2
3
4
5
6
0 1 2 3 4
Saturday, June 8, 13
The Matrix
1 Dimension
0
1
2
3
4
5
6
0 1 2 3 4
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
2 Dimensions
Saturday, June 8, 13
The Matrix
1 Dimension
0
1
2
3
4
5
6
0 1 2 3 4
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
2 Dimensions
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
0
1
2
3
3 Dimensions
Saturday, June 8, 13
Key-Value Pairs
{	"IBM" 	 [205.18 203.79 202.79 201.02 ...],
	"MSFT" 	[27.93 27.44 27.5 27.34 ...],
	"AMZN" 	[265.0 259.98 266.89 262.22 ...]}
Using Key-Value pairs can organize multiple data
units (such as trials, customers, etc.) or collect
parameter data
Saturday, June 8, 13
The Dataset
2013-02-05
2013-02-01
Date
2013-02-04
2013-02-04
2013-02-04
2013-02-01
2013-02-01
261.46 266.89268.03AMZN 4012900262.00 266.89
MSFT 27.4427.87 50540000 27.0328.02 27.42
203.57205.02 201.99IBM 204.19 3188800203.79
AMZN 259.98264.68 259.98 3723600259.07262.78
27.93 27.51MSFT 28.05 5556590027.5527.67
204.65 203.37IBM 203.84 3370700205.35 205.18
265.00268.93 6115000AMZN 268.93 262.80 265.00
Adj CloseVolumeCloseLowHighOpenSymbol
...
Saturday, June 8, 13
The Dataset
2013-02-05
2013-02-01
Date
2013-02-04
2013-02-04
2013-02-04
2013-02-01
2013-02-01
261.46 266.89268.03AMZN 4012900262.00 266.89
MSFT 27.4427.87 50540000 27.0328.02 27.42
203.57205.02 201.99IBM 204.19 3188800203.79
AMZN 259.98264.68 259.98 3723600259.07262.78
27.93 27.51MSFT 28.05 5556590027.5527.67
204.65 203.37IBM 203.84 3370700205.35 205.18
265.00268.93 6115000AMZN 268.93 262.80 265.00
Adj CloseVolumeCloseLowHighOpenSymbol
...
Items in column have same type
Saturday, June 8, 13
The Dataset
2013-02-05
2013-02-01
Date
2013-02-04
2013-02-04
2013-02-04
2013-02-01
2013-02-01
261.46 266.89268.03AMZN 4012900262.00 266.89
MSFT 27.4427.87 50540000 27.0328.02 27.42
203.57205.02 201.99IBM 204.19 3188800203.79
AMZN 259.98264.68 259.98 3723600259.07262.78
27.93 27.51MSFT 28.05 5556590027.5527.67
204.65 203.37IBM 203.84 3370700205.35 205.18
265.00268.93 6115000AMZN 268.93 262.80 265.00
Adj CloseVolumeCloseLowHighOpenSymbol
...
Across a row, there may be different types
Saturday, June 8, 13
The Dataset
2013-02-05
2013-02-01
Date
2013-02-04
2013-02-04
2013-02-04
2013-02-01
2013-02-01
261.46 266.89268.03AMZN 4012900262.00 266.89
MSFT 27.4427.87 50540000 27.0328.02 27.42
203.57205.02 201.99IBM 204.19 3188800203.79
AMZN 259.98264.68 259.98 3723600259.07262.78
27.93 27.51MSFT 28.05 5556590027.5527.67
204.65 203.37IBM 203.84 3370700205.35 205.18
265.00268.93 6115000AMZN 268.93 262.80 265.00
Adj CloseVolumeCloseLowHighOpenSymbol
...
Saturday, June 8, 13
The Dataset
2013-02-05
2013-02-01
Date
2013-02-04
2013-02-04
2013-02-04
2013-02-01
2013-02-01
261.46 266.89268.03AMZN 4012900262.00 266.89
MSFT 27.4427.87 50540000 27.0328.02 27.42
203.57205.02 201.99IBM 204.19 3188800203.79
AMZN 259.98264.68 259.98 3723600259.07262.78
27.93 27.51MSFT 28.05 5556590027.5527.67
204.65 203.37IBM 203.84 3370700205.35 205.18
265.00268.93 6115000AMZN 268.93 262.80 265.00
Adj CloseVolumeCloseLowHighOpenSymbol
...
Identifiers
Saturday, June 8, 13
The Dataset
2013-02-05
2013-02-01
Date
2013-02-04
2013-02-04
2013-02-04
2013-02-01
2013-02-01
261.46 266.89268.03AMZN 4012900262.00 266.89
MSFT 27.4427.87 50540000 27.0328.02 27.42
203.57205.02 201.99IBM 204.19 3188800203.79
AMZN 259.98264.68 259.98 3723600259.07262.78
27.93 27.51MSFT 28.05 5556590027.5527.67
204.65 203.37IBM 203.84 3370700205.35 205.18
265.00268.93 6115000AMZN 268.93 262.80 265.00
Adj CloseVolumeCloseLowHighOpenSymbol
...
Identifiers Measurements
Saturday, June 8, 13
Split-Apply-Combine
Saturday, June 8, 13
Split-Apply-Combine
Pattern described by Hadley
Wickham and implemented in the
plyr library for R.
Home page: http://plyr.had.co.nz
JSS Journal of Statistical Software
April 2011, Volume 40, Issue 1. http://www.jstatsoft.org/
The Split-Apply-Combine Strategy for Data
Analysis
Hadley Wickham
Rice University
Abstract
Many data analysis problems involve the application of a split-apply-combine strategy,
where you break up a big problem into manageable pieces, operate on each piece inde-
pendently and then put all the pieces back together. This insight gives rise to a new R
package that allows you to smoothly apply this strategy, without having to worry about
the type of structure in which your data is stored.
The paper includes two case studies showing how these insights make it easier to work
with batting records for veteran baseball players and a large 3d array of spatio-temporal
ozone measurements.
Keywords: R, apply, split, data analysis.
1. Introduction
What do we do when we analyze data? What are common actions and what are common
mistakes? Given the importance of this activity in statistics, there is remarkably little research
on how data analysis happens. This paper attempts to remedy a very small part of that lack by
describing one common data analysis pattern: Split-apply-combine. You see the split-apply-
combine strategy whenever you break up a big problem into manageable pieces, operate on
each piece independently and then put all the pieces back together. This crops up in all stages
of an analysis:
During data preparation, when performing group-wise ranking, standardization, or nor-
malization, or in general when creating new variables that are most easily calculated on
a per-group basis.
When creating summaries for display or analysis, for example, when calculating marginal
means, or conditioning a table of counts by dividing out group sums.
Saturday, June 8, 13
Split
Apply
Combine
Saturday, June 8, 13
Split
Apply
Combine
the object based on dimension(s) or identifiers
(yielding segments of the same type)
Saturday, June 8, 13
Split
Apply
Combine
the object based on dimension(s) or identifiers
(yielding segments of the same type)
a function to each segment producing a new segment of
the target type. The function can aggregate or transform
the segment.
Saturday, June 8, 13
Split
Apply
Combine
the object based on dimension(s) or identifiers
(yielding segments of the same type)
a function to each segment producing a new segment of
the target type. The function can aggregate or transform
the segment.
the results into an output type (possibly of higher
dimension)
Saturday, June 8, 13
Variations based on interface
	 Output
Input
Array Data.Frame List Discarded
Array
Data.Frame
List
aaply adply alply a_ply
daply ddply dlply d_ply
laply ldply llply l_ply
From: Wickham, The Split-Apply-Combine Strategy for Data Analysis
Saturday, June 8, 13
Splitting Matrices - 2D
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
Split each element to a
scalar
Saturday, June 8, 13
Splitting Matrices - 2D
Split each column to a
vector
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
Saturday, June 8, 13
Splitting Matrices - 2D
Split each row to a vector
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
Saturday, June 8, 13
Splitting Matrices - 2D
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
Split each element to a
scalar
Split each column to a
vector
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
Split each row to a vector
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
Saturday, June 8, 13
Splitting Matrices - 3D
0 1 2 3 4 5 6 7
1
2
3
4
5
6
1
2
3
0
0
Split each element to a
scalar
Saturday, June 8, 13
Splitting Matrices - 3D
1 2 3 4 5 6 7
0
1
2
3
0
0
1
2
3
4
5
6
Split each row x=c1, y=c2 to
a vector
Saturday, June 8, 13
Splitting Matrices - 3D
Split each row x=c1, z=c2 to
a vector
0 1 2 3 4 5 6 7
1
2
3
4
5
6
0
0
1
2
3
Saturday, June 8, 13
Splitting Matrices - 3D
0 1 2 3 4 5 6 7
1
2
3
4
5
6
1
2
3
0
0
Split each row y=c1, z=c2 to
a vector
Saturday, June 8, 13
Splitting Matrices - 3D
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
0
1
2
3
Split each slice x=c to a 2D
matrix
Saturday, June 8, 13
Splitting Matrices - 3D
Split each slice y=c to a 2D
matrix
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
0
1
2
3
Saturday, June 8, 13
Splitting Matrices - 3D
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
0
1
2
3
Split each slice z=c to a 2D
matrix
Saturday, June 8, 13
Splitting Matrices - 3D
0 1 2 3 4 5 6 7
1
2
3
4
5
6
1
2
3
0
0
Split each element to a scalar
1 2 3 4 5 6 7
0
1
2
3
0
0
1
2
3
4
5
6
Split each row x=c1, y=c2 to a vector
Split each row x=c1, z=c2 to a vector
0 1 2 3 4 5 6 7
1
2
3
4
5
6
0
0
1
2
3
0 1 2 3 4 5 6 7
1
2
3
4
5
6
1
2
3
0
0
Split each row y=c1, z=c2 to a vector
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
0
1
2
3
Split each slice x=c to a 2D matrix
Split each slice y=c to a 2D matrix
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
0
1
2
3
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
0
1
2
3
Split each slice z=c to a 2D matrix
Saturday, June 8, 13
Splitting a Dataset
Split by Symbol
2013-02-05
2013-02-01
Date
2013-02-04
2013-02-04
2013-02-04
2013-02-01
2013-02-01
261.46 266.89268.03AMZN 4012900262.00 266.89
MSFT 27.4427.87 50540000 27.0328.02 27.42
203.57205.02 201.99IBM 204.19 3188800203.79
AMZN 259.98264.68 259.98 3723600259.07262.78
27.93 27.51MSFT 28.05 5556590027.5527.67
204.65 203.37IBM 203.84 3370700205.35 205.18
265.00268.93 6115000AMZN 268.93 262.80 265.00
Adj CloseVolumeCloseLowHighOpenSymbol
...
2013-02-05
Date
2013-02-04
2013-02-01
261.46 266.89268.03AMZN 4012900262.00 266.89
AMZN 259.98264.68 259.98 3723600259.07262.78
265.00268.93 6115000AMZN 268.93 262.80 265.00
Adj CloseVolumeCloseLowHighOpenSymbol
2013-02-01
Date
2013-02-04 203.57205.02 201.99IBM 204.19 3188800203.79
204.65 203.37IBM 203.84 3370700205.35 205.18
Adj CloseVolumeCloseLowHighOpenSymbol
Date
2013-02-04
2013-02-01
MSFT 27.4427.87 50540000 27.0328.02 27.42
27.93 27.51MSFT 28.05 5556590027.5527.67
Adj CloseVolumeCloseLowHighOpenSymbol
Saturday, June 8, 13
Splitting a Dataset
Split by Date
... 2013-02-05
2013-02-01
Date
2013-02-04
2013-02-04
2013-02-04
2013-02-01
2013-02-01
261.46 266.89268.03AMZN 4012900262.00 266.89
MSFT 27.4427.87 50540000 27.0328.02 27.42
203.57205.02 201.99IBM 204.19 3188800203.79
AMZN 259.98264.68 259.98 3723600259.07262.78
27.93 27.51MSFT 28.05 5556590027.5527.67
204.65 203.37IBM 203.84 3370700205.35 205.18
265.00268.93 6115000AMZN 268.93 262.80 265.00
Adj CloseVolumeCloseLowHighOpenSymbol
2013-02-01
Date
2013-02-01
2013-02-01 27.93 27.51MSFT 28.05 5556590027.5527.67
204.65 203.37IBM 203.84 3370700205.35 205.18
265.00268.93 6115000AMZN 268.93 262.80 265.00
Adj CloseVolumeCloseLowHighOpenSymbol
Date
2013-02-04
2013-02-04
2013-02-04
MSFT 27.4427.87 50540000 27.0328.02 27.42
203.57205.02 201.99IBM 204.19 3188800203.79
AMZN 259.98264.68 259.98 3723600259.07262.78
Adj CloseVolumeCloseLowHighOpenSymbol
2013-02-05
Date
261.46 266.89268.03AMZN 4012900262.00 266.89
Adj CloseVolumeCloseLowHighOpenSymbol
Saturday, June 8, 13
Splitting a Dataset
Split by Date
... 2013-02-05
2013-02-01
Date
2013-02-04
2013-02-04
2013-02-04
2013-02-01
2013-02-01
261.46 266.89268.03AMZN 4012900262.00 266.89
MSFT 27.4427.87 50540000 27.0328.02 27.42
203.57205.02 201.99IBM 204.19 3188800203.79
AMZN 259.98264.68 259.98 3723600259.07262.78
27.93 27.51MSFT 28.05 5556590027.5527.67
204.65 203.37IBM 203.84 3370700205.35 205.18
265.00268.93 6115000AMZN 268.93 262.80 265.00
Adj CloseVolumeCloseLowHighOpenSymbol
2013-02-01
Date
2013-02-01
2013-02-01 27.93 27.51MSFT 28.05 5556590027.5527.67
204.65 203.37IBM 203.84 3370700205.35 205.18
265.00268.93 6115000AMZN 268.93 262.80 265.00
Adj CloseVolumeCloseLowHighOpenSymbol
Date
2013-02-04
2013-02-04
2013-02-04
MSFT 27.4427.87 50540000 27.0328.02 27.42
203.57205.02 201.99IBM 204.19 3188800203.79
AMZN 259.98264.68 259.98 3723600259.07262.78
Adj CloseVolumeCloseLowHighOpenSymbol
2013-02-05
Date
261.46 266.89268.03AMZN 4012900262.00 266.89
Adj CloseVolumeCloseLowHighOpenSymbol
We’ll see more advanced splitting in the case study
Saturday, June 8, 13
Apply
0
0
1
2
3
Saturday, June 8, 13
Apply
	 (func	 )
	
0
0
1
2
3
Saturday, June 8, 13
Apply
	 (func	 )
	
0
0
1
2
3
	

 ➜
result
Saturday, June 8, 13
Apply
	 (func	 )
	
result must be appropriate for output type
0
0
1
2
3
	

 ➜
result
Saturday, June 8, 13
Combine
Assemble apply results into output
5
4
3
2
1
0
0
1
2
3
5
4
3
2
1
0
0
1
2
3
Saturday, June 8, 13
Implementing ddply in
Clojure
Saturday, June 8, 13
Implementing ddply
(ns split-apply-combine.ply
"Implementation of the split-apply-combine functions, similar to R's plyr library."
(:use [incanter.core :only [$data col-names conj-rows dataset]])
(:require [split-apply-combine.core :as sac]))
(defn fast-conj-rows
"A simple version of conj-rows that runs much faster"
[& datasets]
(when (seq datasets)
(dataset (col-names (first datasets))
(mapcat :rows datasets))))
(defn expr-to-fn
[expr]
(let [row-param (gensym "row-")
kw-map (sac/build-keyword-map expr)]
`(fn [~row-param]
(let [~@(apply concat
(for [[kw sym] kw-map]
[sym `(get ~row-param ~kw ~kw)]))]
~(sac/convert-keywords expr kw-map)))))
(defn exprs-to-fns
[group-by]
(if (coll? group-by)
(vec (for [item group-by]
(if (and (coll? item)
(coll? (second item))
(not (#{'fn 'fn*} (first (second item)))))
[(first item) (expr-to-fn (second item))]
item)))
group-by))
(defn split-ds
"Perform a split operation on data, which must be a dataset, using the group-by-fns
to choose bins. group-by-fns can either be a single function or a collection of
functions. In the latter case, the results will be combined to create a key for
the bin. Returns a map of the group-by-fns results to datasets including all
the rows that had the given result.
Note that keyword column names are the most common functions to use for the
group-by."
[group-by-fns data]
(let [cols (col-names data)
group-by-fn (if (= 1 (count group-by-fns))
(first group-by-fns)
(apply juxt group-by-fns))]
(loop [cur (:rows data) row-groups {}]
(if (empty? cur)
(for [[group rows] row-groups] [group (dataset cols rows)])
(recur (next cur)
(let [row (first cur)
k (group-by-fn row)
a (row-groups k)]
(assoc row-groups k (if a (conj a row) [row]))))))))
(defn apply-ds
"Apply fun to each group in grouped-data returning a sequence of pairs of the
original group-keys and the result of applying the function the dataset. See
split-ds for information on the grouped-data data structure."
[fun grouped-data]
(for [[group split-data] grouped-data]
[group (fun split-data)]))
(defn combine-ds
"Combine the datasets in grouped-data into a single dataset including the
columns specified in the group-by argument as having the values found in
the keys in the grouped data.
If there are columns that are in both the key and the dataset, the values
in the key have precedence."
[group-by grouped-data]
(let [group-by (if (coll? group-by) group-by [group-by])
group-by-filter (complement (set group-by))]
(apply fast-conj-rows
(for [[group data] grouped-data]
(let [grouped-cols (zipmap group-by group)
union-cols (concat group-by
(filter group-by-filter
(col-names data)))]
(dataset union-cols
(map #(merge % grouped-cols)
(:rows data))))))))
(defn ddply*
"Split-apply-combine from datasets to datasets.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and combines the result of that
back into a single dataset.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword keyfn] where the
function keyfun is applied to each row to generate the key for that row. When
the groups are combined, keyword is used as the column name for the resulting
column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Examples:
(ddply* :Symbol
(transform :Change = (diff0 :Close))
stock-data)
(ddply* [[:Month #((juxt year month) (:timestamp %)]]
(colwise :Volume sum)
stock-data)"
([group-by fun]
(ddply* group-by fun $data))
([group-by fun data]
(let [group-by (if (coll? group-by) group-by [group-by])
group-by (for [item group-by]
(if (coll? item) item [item item]))]
(->> data
(split-ds (map second group-by))
(apply-ds fun)
(combine-ds (map first group-by))))))
(defmacro ddply
"Split-apply-combine from datasets to datasets. This macro is a wrapper on ddply*
which provides translation of simple column-referencing expressions in the group-by
argument.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and combines the result of that
back into a single dataset.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword key-expr] where the
exression key-expr is tranformed to a function and in expr are expanded to accessors
on rows. The resulting function is applied to each row to generate the key for
that row. When the groups are combined, keyword is used as the column name for
the resulting column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Examples:
(ddply :Symbol
(transform :Change = (diff0 :Close))
stock-data)
(ddply [[:Month ((juxt year month) :timestamp]]]
(colwise :Volume sum)
stock-data)"
([group-by fun]
`(ddply* ~(exprs-to-fns group-by) ~fun $data))
([group-by fun data]
`(ddply* ~(exprs-to-fns group-by) ~fun ~data)))
(defn d_ply*
"Split-apply-combine from datasets to nothing. This version ignores the output of
fun and is used for fun's side effects.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and then drops the result.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword keyfn] where the
function keyfun is applied to each row to generate the key for that row. When
the groups are combined, keyword is used as the column name for the resulting
column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Example:
(d_ply* :Symbol
#(view (bar-chart :Date :Volume :data %))
stock-data)"
([group-by fun]
(ddply* group-by fun $data))
([group-by fun data]
(let [group-by (if (coll? group-by) group-by [group-by])
group-by (for [item group-by]
(if (coll? item) item [item item]))]
(dorun
(->> data
(split-ds (map second group-by))
(apply-ds fun))))))
(defmacro d_ply
"Split-apply-combine from datasets to nothing. This version ignores the output of
fun and is used for fun's side effects. This macro is a wrapper on d_ply*
which provides translation of simple column-referencing expressions in the group-by
argument.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and then drops the result.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword keyfn] where the
function keyfun is applied to each row to generate the key for that row. When
the groups are combined, keyword is used as the column name for the resulting
column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Example:
(d_ply :Symbol
#(view (bar-chart :Date :Volume :data %))
stock-data)"
([group-by fun]
`(d_ply* ~(exprs-to-fns group-by) ~fun $data))
([group-by fun data]
`(d_ply* ~(exprs-to-fns group-by) ~fun ~data)))
Saturday, June 8, 13
Implementing ddply - Split
(ns split-apply-combine.ply
"Implementation of the split-apply-combine functions, similar to R's plyr library."
(:use [incanter.core :only [$data col-names conj-rows dataset]])
(:require [split-apply-combine.core :as sac]))
(defn fast-conj-rows
"A simple version of conj-rows that runs much faster"
[& datasets]
(when (seq datasets)
(dataset (col-names (first datasets))
(mapcat :rows datasets))))
(defn expr-to-fn
[expr]
(let [row-param (gensym "row-")
kw-map (sac/build-keyword-map expr)]
`(fn [~row-param]
(let [~@(apply concat
(for [[kw sym] kw-map]
[sym `(get ~row-param ~kw ~kw)]))]
~(sac/convert-keywords expr kw-map)))))
(defn exprs-to-fns
[group-by]
(if (coll? group-by)
(vec (for [item group-by]
(if (and (coll? item)
(coll? (second item))
(not (#{'fn 'fn*} (first (second item)))))
[(first item) (expr-to-fn (second item))]
item)))
group-by))
(defn split-ds
"Perform a split operation on data, which must be a dataset, using the group-by-fns
to choose bins. group-by-fns can either be a single function or a collection of
functions. In the latter case, the results will be combined to create a key for
the bin. Returns a map of the group-by-fns results to datasets including all
the rows that had the given result.
Note that keyword column names are the most common functions to use for the
group-by."
[group-by-fns data]
(let [cols (col-names data)
group-by-fn (if (= 1 (count group-by-fns))
(first group-by-fns)
(apply juxt group-by-fns))]
(loop [cur (:rows data) row-groups {}]
(if (empty? cur)
(for [[group rows] row-groups] [group (dataset cols rows)])
(recur (next cur)
(let [row (first cur)
k (group-by-fn row)
a (row-groups k)]
(assoc row-groups k (if a (conj a row) [row]))))))))
(defn apply-ds
"Apply fun to each group in grouped-data returning a sequence of pairs of the
original group-keys and the result of applying the function the dataset. See
split-ds for information on the grouped-data data structure."
[fun grouped-data]
(for [[group split-data] grouped-data]
[group (fun split-data)]))
(defn combine-ds
"Combine the datasets in grouped-data into a single dataset including the
columns specified in the group-by argument as having the values found in
the keys in the grouped data.
If there are columns that are in both the key and the dataset, the values
in the key have precedence."
[group-by grouped-data]
(let [group-by (if (coll? group-by) group-by [group-by])
group-by-filter (complement (set group-by))]
(apply fast-conj-rows
(for [[group data] grouped-data]
(let [grouped-cols (zipmap group-by group)
union-cols (concat group-by
(filter group-by-filter
(col-names data)))]
(dataset union-cols
(map #(merge % grouped-cols)
(:rows data))))))))
(defn ddply*
"Split-apply-combine from datasets to datasets.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and combines the result of that
back into a single dataset.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword keyfn] where the
function keyfun is applied to each row to generate the key for that row. When
the groups are combined, keyword is used as the column name for the resulting
column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Examples:
(ddply* :Symbol
(transform :Change = (diff0 :Close))
stock-data)
(ddply* [[:Month #((juxt year month) (:timestamp %)]]
(colwise :Volume sum)
stock-data)"
([group-by fun]
(ddply* group-by fun $data))
([group-by fun data]
(let [group-by (if (coll? group-by) group-by [group-by])
group-by (for [item group-by]
(if (coll? item) item [item item]))]
(->> data
(split-ds (map second group-by))
(apply-ds fun)
(combine-ds (map first group-by))))))
(defmacro ddply
"Split-apply-combine from datasets to datasets. This macro is a wrapper on ddply*
which provides translation of simple column-referencing expressions in the group-by
argument.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and combines the result of that
back into a single dataset.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword key-expr] where the
exression key-expr is tranformed to a function and in expr are expanded to accessors
on rows. The resulting function is applied to each row to generate the key for
that row. When the groups are combined, keyword is used as the column name for
the resulting column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Examples:
(ddply :Symbol
(transform :Change = (diff0 :Close))
stock-data)
(ddply [[:Month ((juxt year month) :timestamp]]]
(colwise :Volume sum)
stock-data)"
([group-by fun]
`(ddply* ~(exprs-to-fns group-by) ~fun $data))
([group-by fun data]
`(ddply* ~(exprs-to-fns group-by) ~fun ~data)))
(defn d_ply*
"Split-apply-combine from datasets to nothing. This version ignores the output of
fun and is used for fun's side effects.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and then drops the result.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword keyfn] where the
function keyfun is applied to each row to generate the key for that row. When
the groups are combined, keyword is used as the column name for the resulting
column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Example:
(d_ply* :Symbol
#(view (bar-chart :Date :Volume :data %))
stock-data)"
([group-by fun]
(ddply* group-by fun $data))
([group-by fun data]
(let [group-by (if (coll? group-by) group-by [group-by])
group-by (for [item group-by]
(if (coll? item) item [item item]))]
(dorun
(->> data
(split-ds (map second group-by))
(apply-ds fun))))))
(defmacro d_ply
"Split-apply-combine from datasets to nothing. This version ignores the output of
fun and is used for fun's side effects. This macro is a wrapper on d_ply*
which provides translation of simple column-referencing expressions in the group-by
argument.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and then drops the result.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword keyfn] where the
function keyfun is applied to each row to generate the key for that row. When
the groups are combined, keyword is used as the column name for the resulting
column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Example:
(d_ply :Symbol
#(view (bar-chart :Date :Volume :data %))
stock-data)"
([group-by fun]
`(d_ply* ~(exprs-to-fns group-by) ~fun $data))
([group-by fun data]
`(d_ply* ~(exprs-to-fns group-by) ~fun ~data)))
(defn split-ds
[group-by-fns data]
(let [cols (col-names data)
group-by-fn (if (= 1 (count group-by-fns))
(first group-by-fns)
(apply juxt group-by-fns))]
(loop [cur (:rows data) row-groups {}]
(if (empty? cur)
(for [[group rows] row-groups] [group (dataset cols rows)])
(recur (next cur)
(let [row (first cur)
k (group-by-fn row)
a (row-groups k)]
(assoc row-groups k (if a (conj a row) [row]))))))))
Saturday, June 8, 13
Implementing ddply - Apply
(ns split-apply-combine.ply
"Implementation of the split-apply-combine functions, similar to R's plyr library."
(:use [incanter.core :only [$data col-names conj-rows dataset]])
(:require [split-apply-combine.core :as sac]))
(defn fast-conj-rows
"A simple version of conj-rows that runs much faster"
[& datasets]
(when (seq datasets)
(dataset (col-names (first datasets))
(mapcat :rows datasets))))
(defn expr-to-fn
[expr]
(let [row-param (gensym "row-")
kw-map (sac/build-keyword-map expr)]
`(fn [~row-param]
(let [~@(apply concat
(for [[kw sym] kw-map]
[sym `(get ~row-param ~kw ~kw)]))]
~(sac/convert-keywords expr kw-map)))))
(defn exprs-to-fns
[group-by]
(if (coll? group-by)
(vec (for [item group-by]
(if (and (coll? item)
(coll? (second item))
(not (#{'fn 'fn*} (first (second item)))))
[(first item) (expr-to-fn (second item))]
item)))
group-by))
(defn split-ds
"Perform a split operation on data, which must be a dataset, using the group-by-fns
to choose bins. group-by-fns can either be a single function or a collection of
functions. In the latter case, the results will be combined to create a key for
the bin. Returns a map of the group-by-fns results to datasets including all
the rows that had the given result.
Note that keyword column names are the most common functions to use for the
group-by."
[group-by-fns data]
(let [cols (col-names data)
group-by-fn (if (= 1 (count group-by-fns))
(first group-by-fns)
(apply juxt group-by-fns))]
(loop [cur (:rows data) row-groups {}]
(if (empty? cur)
(for [[group rows] row-groups] [group (dataset cols rows)])
(recur (next cur)
(let [row (first cur)
k (group-by-fn row)
a (row-groups k)]
(assoc row-groups k (if a (conj a row) [row]))))))))
(defn apply-ds
"Apply fun to each group in grouped-data returning a sequence of pairs of the
original group-keys and the result of applying the function the dataset. See
split-ds for information on the grouped-data data structure."
[fun grouped-data]
(for [[group split-data] grouped-data]
[group (fun split-data)]))
(defn combine-ds
"Combine the datasets in grouped-data into a single dataset including the
columns specified in the group-by argument as having the values found in
the keys in the grouped data.
If there are columns that are in both the key and the dataset, the values
in the key have precedence."
[group-by grouped-data]
(let [group-by (if (coll? group-by) group-by [group-by])
group-by-filter (complement (set group-by))]
(apply fast-conj-rows
(for [[group data] grouped-data]
(let [grouped-cols (zipmap group-by group)
union-cols (concat group-by
(filter group-by-filter
(col-names data)))]
(dataset union-cols
(map #(merge % grouped-cols)
(:rows data))))))))
(defn ddply*
"Split-apply-combine from datasets to datasets.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and combines the result of that
back into a single dataset.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword keyfn] where the
function keyfun is applied to each row to generate the key for that row. When
the groups are combined, keyword is used as the column name for the resulting
column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Examples:
(ddply* :Symbol
(transform :Change = (diff0 :Close))
stock-data)
(ddply* [[:Month #((juxt year month) (:timestamp %)]]
(colwise :Volume sum)
stock-data)"
([group-by fun]
(ddply* group-by fun $data))
([group-by fun data]
(let [group-by (if (coll? group-by) group-by [group-by])
group-by (for [item group-by]
(if (coll? item) item [item item]))]
(->> data
(split-ds (map second group-by))
(apply-ds fun)
(combine-ds (map first group-by))))))
(defmacro ddply
"Split-apply-combine from datasets to datasets. This macro is a wrapper on ddply*
which provides translation of simple column-referencing expressions in the group-by
argument.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and combines the result of that
back into a single dataset.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword key-expr] where the
exression key-expr is tranformed to a function and in expr are expanded to accessors
on rows. The resulting function is applied to each row to generate the key for
that row. When the groups are combined, keyword is used as the column name for
the resulting column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Examples:
(ddply :Symbol
(transform :Change = (diff0 :Close))
stock-data)
(ddply [[:Month ((juxt year month) :timestamp]]]
(colwise :Volume sum)
stock-data)"
([group-by fun]
`(ddply* ~(exprs-to-fns group-by) ~fun $data))
([group-by fun data]
`(ddply* ~(exprs-to-fns group-by) ~fun ~data)))
(defn d_ply*
"Split-apply-combine from datasets to nothing. This version ignores the output of
fun and is used for fun's side effects.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and then drops the result.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword keyfn] where the
function keyfun is applied to each row to generate the key for that row. When
the groups are combined, keyword is used as the column name for the resulting
column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Example:
(d_ply* :Symbol
#(view (bar-chart :Date :Volume :data %))
stock-data)"
([group-by fun]
(ddply* group-by fun $data))
([group-by fun data]
(let [group-by (if (coll? group-by) group-by [group-by])
group-by (for [item group-by]
(if (coll? item) item [item item]))]
(dorun
(->> data
(split-ds (map second group-by))
(apply-ds fun))))))
(defmacro d_ply
"Split-apply-combine from datasets to nothing. This version ignores the output of
fun and is used for fun's side effects. This macro is a wrapper on d_ply*
which provides translation of simple column-referencing expressions in the group-by
argument.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and then drops the result.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword keyfn] where the
function keyfun is applied to each row to generate the key for that row. When
the groups are combined, keyword is used as the column name for the resulting
column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Example:
(d_ply :Symbol
#(view (bar-chart :Date :Volume :data %))
stock-data)"
([group-by fun]
`(d_ply* ~(exprs-to-fns group-by) ~fun $data))
([group-by fun data]
`(d_ply* ~(exprs-to-fns group-by) ~fun ~data)))
(defn apply-ds
[fun grouped-data]
(for [[group split-data] grouped-data]
[group (fun split-data)]))
Saturday, June 8, 13
Implementing ddply - Combine
(ns split-apply-combine.ply
"Implementation of the split-apply-combine functions, similar to R's plyr library."
(:use [incanter.core :only [$data col-names conj-rows dataset]])
(:require [split-apply-combine.core :as sac]))
(defn fast-conj-rows
"A simple version of conj-rows that runs much faster"
[& datasets]
(when (seq datasets)
(dataset (col-names (first datasets))
(mapcat :rows datasets))))
(defn expr-to-fn
[expr]
(let [row-param (gensym "row-")
kw-map (sac/build-keyword-map expr)]
`(fn [~row-param]
(let [~@(apply concat
(for [[kw sym] kw-map]
[sym `(get ~row-param ~kw ~kw)]))]
~(sac/convert-keywords expr kw-map)))))
(defn exprs-to-fns
[group-by]
(if (coll? group-by)
(vec (for [item group-by]
(if (and (coll? item)
(coll? (second item))
(not (#{'fn 'fn*} (first (second item)))))
[(first item) (expr-to-fn (second item))]
item)))
group-by))
(defn split-ds
"Perform a split operation on data, which must be a dataset, using the group-by-fns
to choose bins. group-by-fns can either be a single function or a collection of
functions. In the latter case, the results will be combined to create a key for
the bin. Returns a map of the group-by-fns results to datasets including all
the rows that had the given result.
Note that keyword column names are the most common functions to use for the
group-by."
[group-by-fns data]
(let [cols (col-names data)
group-by-fn (if (= 1 (count group-by-fns))
(first group-by-fns)
(apply juxt group-by-fns))]
(loop [cur (:rows data) row-groups {}]
(if (empty? cur)
(for [[group rows] row-groups] [group (dataset cols rows)])
(recur (next cur)
(let [row (first cur)
k (group-by-fn row)
a (row-groups k)]
(assoc row-groups k (if a (conj a row) [row]))))))))
(defn apply-ds
"Apply fun to each group in grouped-data returning a sequence of pairs of the
original group-keys and the result of applying the function the dataset. See
split-ds for information on the grouped-data data structure."
[fun grouped-data]
(for [[group split-data] grouped-data]
[group (fun split-data)]))
(defn combine-ds
"Combine the datasets in grouped-data into a single dataset including the
columns specified in the group-by argument as having the values found in
the keys in the grouped data.
If there are columns that are in both the key and the dataset, the values
in the key have precedence."
[group-by grouped-data]
(let [group-by (if (coll? group-by) group-by [group-by])
group-by-filter (complement (set group-by))]
(apply fast-conj-rows
(for [[group data] grouped-data]
(let [grouped-cols (zipmap group-by group)
union-cols (concat group-by
(filter group-by-filter
(col-names data)))]
(dataset union-cols
(map #(merge % grouped-cols)
(:rows data))))))))
(defn ddply*
"Split-apply-combine from datasets to datasets.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and combines the result of that
back into a single dataset.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword keyfn] where the
function keyfun is applied to each row to generate the key for that row. When
the groups are combined, keyword is used as the column name for the resulting
column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Examples:
(ddply* :Symbol
(transform :Change = (diff0 :Close))
stock-data)
(ddply* [[:Month #((juxt year month) (:timestamp %)]]
(colwise :Volume sum)
stock-data)"
([group-by fun]
(ddply* group-by fun $data))
([group-by fun data]
(let [group-by (if (coll? group-by) group-by [group-by])
group-by (for [item group-by]
(if (coll? item) item [item item]))]
(->> data
(split-ds (map second group-by))
(apply-ds fun)
(combine-ds (map first group-by))))))
(defmacro ddply
"Split-apply-combine from datasets to datasets. This macro is a wrapper on ddply*
which provides translation of simple column-referencing expressions in the group-by
argument.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and combines the result of that
back into a single dataset.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword key-expr] where the
exression key-expr is tranformed to a function and in expr are expanded to accessors
on rows. The resulting function is applied to each row to generate the key for
that row. When the groups are combined, keyword is used as the column name for
the resulting column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Examples:
(ddply :Symbol
(transform :Change = (diff0 :Close))
stock-data)
(ddply [[:Month ((juxt year month) :timestamp]]]
(colwise :Volume sum)
stock-data)"
([group-by fun]
`(ddply* ~(exprs-to-fns group-by) ~fun $data))
([group-by fun data]
`(ddply* ~(exprs-to-fns group-by) ~fun ~data)))
(defn d_ply*
"Split-apply-combine from datasets to nothing. This version ignores the output of
fun and is used for fun's side effects.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and then drops the result.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword keyfn] where the
function keyfun is applied to each row to generate the key for that row. When
the groups are combined, keyword is used as the column name for the resulting
column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Example:
(d_ply* :Symbol
#(view (bar-chart :Date :Volume :data %))
stock-data)"
([group-by fun]
(ddply* group-by fun $data))
([group-by fun data]
(let [group-by (if (coll? group-by) group-by [group-by])
group-by (for [item group-by]
(if (coll? item) item [item item]))]
(dorun
(->> data
(split-ds (map second group-by))
(apply-ds fun))))))
(defmacro d_ply
"Split-apply-combine from datasets to nothing. This version ignores the output of
fun and is used for fun's side effects. This macro is a wrapper on d_ply*
which provides translation of simple column-referencing expressions in the group-by
argument.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and then drops the result.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword keyfn] where the
function keyfun is applied to each row to generate the key for that row. When
the groups are combined, keyword is used as the column name for the resulting
column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Example:
(d_ply :Symbol
#(view (bar-chart :Date :Volume :data %))
stock-data)"
([group-by fun]
`(d_ply* ~(exprs-to-fns group-by) ~fun $data))
([group-by fun data]
`(d_ply* ~(exprs-to-fns group-by) ~fun ~data)))
(defn combine-ds
[group-by grouped-data]
(let [group-by (if (coll? group-by) group-by [group-by])
group-by-filter (complement (set group-by))]
(apply fast-conj-rows
(for [[group data] grouped-data]
(let [grouped-cols (zipmap group-by group)
union-cols (concat group-by
(filter group-by-filter
(col-names data)))]
(dataset union-cols
(map #(merge % grouped-cols)
(:rows data))))))))
Saturday, June 8, 13
Implementing ddply - Putting it all together
(ns split-apply-combine.ply
"Implementation of the split-apply-combine functions, similar to R's plyr library."
(:use [incanter.core :only [$data col-names conj-rows dataset]])
(:require [split-apply-combine.core :as sac]))
(defn fast-conj-rows
"A simple version of conj-rows that runs much faster"
[& datasets]
(when (seq datasets)
(dataset (col-names (first datasets))
(mapcat :rows datasets))))
(defn expr-to-fn
[expr]
(let [row-param (gensym "row-")
kw-map (sac/build-keyword-map expr)]
`(fn [~row-param]
(let [~@(apply concat
(for [[kw sym] kw-map]
[sym `(get ~row-param ~kw ~kw)]))]
~(sac/convert-keywords expr kw-map)))))
(defn exprs-to-fns
[group-by]
(if (coll? group-by)
(vec (for [item group-by]
(if (and (coll? item)
(coll? (second item))
(not (#{'fn 'fn*} (first (second item)))))
[(first item) (expr-to-fn (second item))]
item)))
group-by))
(defn split-ds
"Perform a split operation on data, which must be a dataset, using the group-by-fns
to choose bins. group-by-fns can either be a single function or a collection of
functions. In the latter case, the results will be combined to create a key for
the bin. Returns a map of the group-by-fns results to datasets including all
the rows that had the given result.
Note that keyword column names are the most common functions to use for the
group-by."
[group-by-fns data]
(let [cols (col-names data)
group-by-fn (if (= 1 (count group-by-fns))
(first group-by-fns)
(apply juxt group-by-fns))]
(loop [cur (:rows data) row-groups {}]
(if (empty? cur)
(for [[group rows] row-groups] [group (dataset cols rows)])
(recur (next cur)
(let [row (first cur)
k (group-by-fn row)
a (row-groups k)]
(assoc row-groups k (if a (conj a row) [row]))))))))
(defn apply-ds
"Apply fun to each group in grouped-data returning a sequence of pairs of the
original group-keys and the result of applying the function the dataset. See
split-ds for information on the grouped-data data structure."
[fun grouped-data]
(for [[group split-data] grouped-data]
[group (fun split-data)]))
(defn combine-ds
"Combine the datasets in grouped-data into a single dataset including the
columns specified in the group-by argument as having the values found in
the keys in the grouped data.
If there are columns that are in both the key and the dataset, the values
in the key have precedence."
[group-by grouped-data]
(let [group-by (if (coll? group-by) group-by [group-by])
group-by-filter (complement (set group-by))]
(apply fast-conj-rows
(for [[group data] grouped-data]
(let [grouped-cols (zipmap group-by group)
union-cols (concat group-by
(filter group-by-filter
(col-names data)))]
(dataset union-cols
(map #(merge % grouped-cols)
(:rows data))))))))
(defn ddply*
"Split-apply-combine from datasets to datasets.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and combines the result of that
back into a single dataset.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword keyfn] where the
function keyfun is applied to each row to generate the key for that row. When
the groups are combined, keyword is used as the column name for the resulting
column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Examples:
(ddply* :Symbol
(transform :Change = (diff0 :Close))
stock-data)
(ddply* [[:Month #((juxt year month) (:timestamp %)]]
(colwise :Volume sum)
stock-data)"
([group-by fun]
(ddply* group-by fun $data))
([group-by fun data]
(let [group-by (if (coll? group-by) group-by [group-by])
group-by (for [item group-by]
(if (coll? item) item [item item]))]
(->> data
(split-ds (map second group-by))
(apply-ds fun)
(combine-ds (map first group-by))))))
(defmacro ddply
"Split-apply-combine from datasets to datasets. This macro is a wrapper on ddply*
which provides translation of simple column-referencing expressions in the group-by
argument.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and combines the result of that
back into a single dataset.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword key-expr] where the
exression key-expr is tranformed to a function and in expr are expanded to accessors
on rows. The resulting function is applied to each row to generate the key for
that row. When the groups are combined, keyword is used as the column name for
the resulting column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Examples:
(ddply :Symbol
(transform :Change = (diff0 :Close))
stock-data)
(ddply [[:Month ((juxt year month) :timestamp]]]
(colwise :Volume sum)
stock-data)"
([group-by fun]
`(ddply* ~(exprs-to-fns group-by) ~fun $data))
([group-by fun data]
`(ddply* ~(exprs-to-fns group-by) ~fun ~data)))
(defn d_ply*
"Split-apply-combine from datasets to nothing. This version ignores the output of
fun and is used for fun's side effects.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and then drops the result.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword keyfn] where the
function keyfun is applied to each row to generate the key for that row. When
the groups are combined, keyword is used as the column name for the resulting
column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Example:
(d_ply* :Symbol
#(view (bar-chart :Date :Volume :data %))
stock-data)"
([group-by fun]
(ddply* group-by fun $data))
([group-by fun data]
(let [group-by (if (coll? group-by) group-by [group-by])
group-by (for [item group-by]
(if (coll? item) item [item item]))]
(dorun
(->> data
(split-ds (map second group-by))
(apply-ds fun))))))
(defmacro d_ply
"Split-apply-combine from datasets to nothing. This version ignores the output of
fun and is used for fun's side effects. This macro is a wrapper on d_ply*
which provides translation of simple column-referencing expressions in the group-by
argument.
Splits data into a the group of datasets as specified by the group-by argument,
applies fun to each of the resulting datasets and then drops the result.
The group-by argument can be a keyword or collection of keywords which specify
the columns to group by. It can also include pairs [keyword keyfn] where the
function keyfun is applied to each row to generate the key for that row. When
the groups are combined, keyword is used as the column name for the resulting
column. The two types of group-by specifications can be mixed.
The result of the apply function can contain the same columns names as the
original dataset or different ones. It can contain the same number of rows as
the original, a different number, or a single row.
If data is not specified, it defaults to the currently bound value of $data.
Example:
(d_ply :Symbol
#(view (bar-chart :Date :Volume :data %))
stock-data)"
([group-by fun]
`(d_ply* ~(exprs-to-fns group-by) ~fun $data))
([group-by fun data]
`(d_ply* ~(exprs-to-fns group-by) ~fun ~data)))
(defn ddply*
([group-by fun]
(ddply* group-by fun $data))
([group-by fun data]
(let [group-by (if (coll? group-by) group-by [group-by])
group-by (for [item group-by]
(if (coll? item) item [item item]))]
(->> data
(split-ds (map second group-by))
(apply-ds fun)
(combine-ds (map first group-by))))))
(defmacro ddply
([group-by fun]
`(ddply* ~(exprs-to-fns group-by) ~fun $data))
([group-by fun data]
`(ddply* ~(exprs-to-fns group-by) ~fun ~data)))
Saturday, June 8, 13
Support functions - colwise
(ddply :Symbol
(colwise :num stats/mean)
tech-stocks)
Saturday, June 8, 13
Support functions - transform
(ddply :Symbol
(transform :Change = (diff0 :Close)
:Date =* (time-format/parse
(time-format/formatters :year-month-day)
:Date))
tech-stocks)
Saturday, June 8, 13
A Case Study
Saturday, June 8, 13
A Case Study	
“SpaceCurve delivers instantaneous
intelligence for location-based services,
commodities, defense, emergency
services and other markets. The
company is developing Big Data
solutions that continuously store and
immediately analyze massive amounts
of multidimensional data.”
Performance analysis of large-scale geospatial-
temporal ingest and query on the SpaceCurve
multidimensional DB
Saturday, June 8, 13
Our Sample Problem
cpu23cpu11
cpu22cpu10
cpu09 cpu21
cpu20cpu08
cpu15
cpu16
cpu17
cpu14
cpu12
cpu19
cpu13
cpu18
cpu07
cpu06
cpu05
cpu04
cpu03
cpu02
cpu01
cpu00
cpu23cpu11
cpu22cpu10
cpu09 cpu21
cpu20cpu08
cpu15
cpu16
cpu17
cpu14
cpu12
cpu19
cpu13
cpu18
cpu07
cpu06
cpu05
cpu04
cpu03
cpu02
cpu01
cpu00
cpu23cpu11
cpu22cpu10
cpu09 cpu21
cpu20cpu08
cpu15
cpu16
cpu17
cpu14
cpu12
cpu19
cpu13
cpu18
cpu07
cpu06
cpu05
cpu04
cpu03
cpu02
cpu01
cpu00
cpu23cpu11
cpu22cpu10
cpu09 cpu21
cpu20cpu08
cpu15
cpu16
cpu17
cpu14
cpu12
cpu19
cpu13
cpu18
cpu07
cpu06
cpu05
cpu04
cpu03
cpu02
cpu01
cpu00
cpu23cpu11
cpu22cpu10
cpu09 cpu21
cpu20cpu08
cpu15
cpu16
cpu17
cpu14
cpu12
cpu19
cpu13
cpu18
cpu07
cpu06
cpu05
cpu04
cpu03
cpu02
cpu01
cpu00
cpu23cpu11
cpu22cpu10
cpu09 cpu21
cpu20cpu08
cpu15
cpu16
cpu17
cpu14
cpu12
cpu19
cpu13
cpu18
cpu07
cpu06
cpu05
cpu04
cpu03
cpu02
cpu01
cpu00
10GB/s/channel
switch
External Clients
10.0.1.101 10.0.1.102 10.0.1.107 10.0.1.109 10.0.1.111 10.0.1.112
‣CPU load data
‣6 systems
‣24 cores/each
‣6 data points
‣1 sample/second
‣~38 minutes run time
Total of ~2 million data points
Small subset of the overall SpaceCurve analysis
Saturday, June 8, 13
Time to see it work...
Saturday, June 8, 13
Where to?
Saturday, June 8, 13
Where to?
Saturday, June 8, 13
Where to?
• A full library implementation of Split-Apply-Combine and helpers
Saturday, June 8, 13
Where to?
• A full library implementation of Split-Apply-Combine and helpers
• Add to Incanter?
Saturday, June 8, 13
Where to?
• A full library implementation of Split-Apply-Combine and helpers
• Add to Incanter?
• Performance optimizations (mutable intermediate results, column-oriented
datasets)
Saturday, June 8, 13
Where to?
• A full library implementation of Split-Apply-Combine and helpers
• Add to Incanter?
• Performance optimizations (mutable intermediate results, column-oriented
datasets)
• Implementation based on reducers and parallelism
Saturday, June 8, 13
Where to?
• A full library implementation of Split-Apply-Combine and helpers
• Add to Incanter?
• Performance optimizations (mutable intermediate results, column-oriented
datasets)
• Implementation based on reducers and parallelism
• Explore the continuum from data exploration tools (R, Incanter) to large-scale
data analysis (Hadoop, Cascalog, SpaceCurve, etc.)
Saturday, June 8, 13
Discussion
Saturday, June 8, 13
References
• Source for this presentation: https://www.github.com/tomfaulhaber/split-
apply-combine
• The R Project: http://www.r-project.org
• The plyr home page: http://plyr.had.co.nz
• Hadley Wickham, The Split-Apply-Combine Strategy for Data Analysis,
Journal of Statistical Software, April 2011, Volume 40, Issue 1
• Incanter project: http://incanter.org
• Eric Rochester, The Clojure Data Analysis Cookbook, Packt Publishing, 2013
• Bruce Durling, Quick and Dirty Data Science with Incanter, talk from
EuroClojure 2012, http://confreaks.com/videos/2071-euroclojure2012-quick-
and-dirty-data-science-with-incanter
• Spacecurve: http://www.spacecurve.com
Tom Faulhaber
twitter: @tomfaulhaber
github: tomfaulhaber
Saturday, June 8, 13
Photo Credits
• Florida Home - anoldent on flickr (http://www.flickr.com/photos/anoldent/2405722434/)
• Midland Coal Mine - jasonwoodhead23 on flickr (http://www.flickr.com/photos/woodhead/8522679843/)
• Paradise - Antti Simonen on flickr (http://www.flickr.com/photos/anttisimonen/6041095682/)
• Traders on the Exchange - thetaxhaven on flickr (http://www.flickr.com/photos/83532250@N06/7651028854)
• Louvre - dynamosquito on flickr (http://www.flickr.com/photos/25182210@N07/2802458437/)
• Construction - Aapo Haapanen on flickr (http://www.flickr.com/photos/decade_null/214247988/)
• Server farm - from the Spacecurve website (http://www.spacecurve.com)
• Sailboat race - Ryk Van Toronto on flickr (http://www.flickr.com/photos/sydandsaskia/394507351)
• Arguing Philosophers - David Schroeter on flickr (http://www.flickr.com/photos/53477785@N00/92134612/)
Saturday, June 8, 13

More Related Content

Similar to Clojure Data Analysis Using Split-Apply-Combine

Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfcookie1969
 
Links to Estimation Techniques Tim Shaughnessy, Chapter 7 .docx
Links to Estimation Techniques Tim Shaughnessy, Chapter 7 .docxLinks to Estimation Techniques Tim Shaughnessy, Chapter 7 .docx
Links to Estimation Techniques Tim Shaughnessy, Chapter 7 .docxsmile790243
 
Statistical quality control, sampling
Statistical quality control, samplingStatistical quality control, sampling
Statistical quality control, samplingSana Fatima
 
Apache Commons Math @ FOSDEM 2013
Apache Commons Math @ FOSDEM 2013Apache Commons Math @ FOSDEM 2013
Apache Commons Math @ FOSDEM 2013netomi
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonWes McKinney
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
 
RuleML2015: Compact representation of conditional probability for rule-based...
RuleML2015:  Compact representation of conditional probability for rule-based...RuleML2015:  Compact representation of conditional probability for rule-based...
RuleML2015: Compact representation of conditional probability for rule-based...RuleML
 
Stevens-Benchmarking Sorting Algorithms
Stevens-Benchmarking Sorting AlgorithmsStevens-Benchmarking Sorting Algorithms
Stevens-Benchmarking Sorting AlgorithmsJames Stevens
 
IRJET- Segregation of Machines According to the Noise Emitted by Differen...
IRJET-  	  Segregation of Machines According to the Noise Emitted by Differen...IRJET-  	  Segregation of Machines According to the Noise Emitted by Differen...
IRJET- Segregation of Machines According to the Noise Emitted by Differen...IRJET Journal
 
Multi state churn analysis with a subscription product
Multi state churn analysis with a subscription productMulti state churn analysis with a subscription product
Multi state churn analysis with a subscription productVienna Data Science Group
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxnikshaikh786
 
Advanced Topics in Agile Planning
Advanced Topics in Agile PlanningAdvanced Topics in Agile Planning
Advanced Topics in Agile PlanningMike Cohn
 

Similar to Clojure Data Analysis Using Split-Apply-Combine (20)

Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
 
Links to Estimation Techniques Tim Shaughnessy, Chapter 7 .docx
Links to Estimation Techniques Tim Shaughnessy, Chapter 7 .docxLinks to Estimation Techniques Tim Shaughnessy, Chapter 7 .docx
Links to Estimation Techniques Tim Shaughnessy, Chapter 7 .docx
 
Slides ads ia
Slides ads iaSlides ads ia
Slides ads ia
 
Statistical quality control, sampling
Statistical quality control, samplingStatistical quality control, sampling
Statistical quality control, sampling
 
Lecture3a sorting
Lecture3a sortingLecture3a sorting
Lecture3a sorting
 
Graphs, pareto
Graphs, paretoGraphs, pareto
Graphs, pareto
 
IA-advanced-R
IA-advanced-RIA-advanced-R
IA-advanced-R
 
Apache Commons Math @ FOSDEM 2013
Apache Commons Math @ FOSDEM 2013Apache Commons Math @ FOSDEM 2013
Apache Commons Math @ FOSDEM 2013
 
R_Proficiency.pptx
R_Proficiency.pptxR_Proficiency.pptx
R_Proficiency.pptx
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
RuleML2015: Compact representation of conditional probability for rule-based...
RuleML2015:  Compact representation of conditional probability for rule-based...RuleML2015:  Compact representation of conditional probability for rule-based...
RuleML2015: Compact representation of conditional probability for rule-based...
 
Capable
CapableCapable
Capable
 
Stevens-Benchmarking Sorting Algorithms
Stevens-Benchmarking Sorting AlgorithmsStevens-Benchmarking Sorting Algorithms
Stevens-Benchmarking Sorting Algorithms
 
IRJET- Segregation of Machines According to the Noise Emitted by Differen...
IRJET-  	  Segregation of Machines According to the Noise Emitted by Differen...IRJET-  	  Segregation of Machines According to the Noise Emitted by Differen...
IRJET- Segregation of Machines According to the Noise Emitted by Differen...
 
Quality control tools
Quality control toolsQuality control tools
Quality control tools
 
Multi state churn analysis with a subscription product
Multi state churn analysis with a subscription productMulti state churn analysis with a subscription product
Multi state churn analysis with a subscription product
 
Training Module
Training ModuleTraining Module
Training Module
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
Advanced Topics in Agile Planning
Advanced Topics in Agile PlanningAdvanced Topics in Agile Planning
Advanced Topics in Agile Planning
 

Recently uploaded

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Recently uploaded (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Clojure Data Analysis Using Split-Apply-Combine

  • 1. Using Split-Apply-Combine for Data Analysis in Clojure Bay Area Clojure Group June 6, 2013 Tom Faulhaber twitter: @tomfaulhaber github: tomfaulhaber Saturday, June 8, 13
  • 7. Data Structures for Data Analysis Saturday, June 8, 13
  • 8. The Vector [265.0 259.98 266.89 262.22 ...] Saturday, June 8, 13
  • 9. The Vector (mean [265.0 259.98 266.89 262.22 ...]) ➜ 263.697 Saturday, June 8, 13
  • 10. The Vector (apply min [265.0 259.98 266.89 262.22 ...]) ➜ 257.21 Saturday, June 8, 13
  • 11. The Vector (apply max [265.0 259.98 266.89 262.22 ...]) ➜ 269.75 Saturday, June 8, 13
  • 12. The Vector (sd [265.0 259.98 266.89 262.22 ...]) ➜ 3.815 Saturday, June 8, 13
  • 13. The Vector (quantile [265.0 259.98 266.89 262.22 ...]) ➜ [257.21 260.105 264.27 266.175 269.75] Saturday, June 8, 13
  • 14. The Vector [265.0 259.98 266.89 262.22 ...] Saturday, June 8, 13
  • 15. The Vector (histogram [265.0 259.98 266.89 262.22 ...]) ➜ Saturday, June 8, 13
  • 16. The Vector [265.0 259.98 266.89 262.22 ...] Saturday, June 8, 13
  • 17. The Vector (line-chart [265.0 259.98 266.89 262.22 ...]) ➜ Saturday, June 8, 13
  • 19. The Matrix 1 Dimension 0 1 2 3 4 5 6 0 1 2 3 4 Saturday, June 8, 13
  • 20. The Matrix 1 Dimension 0 1 2 3 4 5 6 0 1 2 3 4 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 2 Dimensions Saturday, June 8, 13
  • 21. The Matrix 1 Dimension 0 1 2 3 4 5 6 0 1 2 3 4 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 2 Dimensions 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 0 1 2 3 3 Dimensions Saturday, June 8, 13
  • 22. Key-Value Pairs { "IBM" [205.18 203.79 202.79 201.02 ...], "MSFT" [27.93 27.44 27.5 27.34 ...], "AMZN" [265.0 259.98 266.89 262.22 ...]} Using Key-Value pairs can organize multiple data units (such as trials, customers, etc.) or collect parameter data Saturday, June 8, 13
  • 23. The Dataset 2013-02-05 2013-02-01 Date 2013-02-04 2013-02-04 2013-02-04 2013-02-01 2013-02-01 261.46 266.89268.03AMZN 4012900262.00 266.89 MSFT 27.4427.87 50540000 27.0328.02 27.42 203.57205.02 201.99IBM 204.19 3188800203.79 AMZN 259.98264.68 259.98 3723600259.07262.78 27.93 27.51MSFT 28.05 5556590027.5527.67 204.65 203.37IBM 203.84 3370700205.35 205.18 265.00268.93 6115000AMZN 268.93 262.80 265.00 Adj CloseVolumeCloseLowHighOpenSymbol ... Saturday, June 8, 13
  • 24. The Dataset 2013-02-05 2013-02-01 Date 2013-02-04 2013-02-04 2013-02-04 2013-02-01 2013-02-01 261.46 266.89268.03AMZN 4012900262.00 266.89 MSFT 27.4427.87 50540000 27.0328.02 27.42 203.57205.02 201.99IBM 204.19 3188800203.79 AMZN 259.98264.68 259.98 3723600259.07262.78 27.93 27.51MSFT 28.05 5556590027.5527.67 204.65 203.37IBM 203.84 3370700205.35 205.18 265.00268.93 6115000AMZN 268.93 262.80 265.00 Adj CloseVolumeCloseLowHighOpenSymbol ... Items in column have same type Saturday, June 8, 13
  • 25. The Dataset 2013-02-05 2013-02-01 Date 2013-02-04 2013-02-04 2013-02-04 2013-02-01 2013-02-01 261.46 266.89268.03AMZN 4012900262.00 266.89 MSFT 27.4427.87 50540000 27.0328.02 27.42 203.57205.02 201.99IBM 204.19 3188800203.79 AMZN 259.98264.68 259.98 3723600259.07262.78 27.93 27.51MSFT 28.05 5556590027.5527.67 204.65 203.37IBM 203.84 3370700205.35 205.18 265.00268.93 6115000AMZN 268.93 262.80 265.00 Adj CloseVolumeCloseLowHighOpenSymbol ... Across a row, there may be different types Saturday, June 8, 13
  • 26. The Dataset 2013-02-05 2013-02-01 Date 2013-02-04 2013-02-04 2013-02-04 2013-02-01 2013-02-01 261.46 266.89268.03AMZN 4012900262.00 266.89 MSFT 27.4427.87 50540000 27.0328.02 27.42 203.57205.02 201.99IBM 204.19 3188800203.79 AMZN 259.98264.68 259.98 3723600259.07262.78 27.93 27.51MSFT 28.05 5556590027.5527.67 204.65 203.37IBM 203.84 3370700205.35 205.18 265.00268.93 6115000AMZN 268.93 262.80 265.00 Adj CloseVolumeCloseLowHighOpenSymbol ... Saturday, June 8, 13
  • 27. The Dataset 2013-02-05 2013-02-01 Date 2013-02-04 2013-02-04 2013-02-04 2013-02-01 2013-02-01 261.46 266.89268.03AMZN 4012900262.00 266.89 MSFT 27.4427.87 50540000 27.0328.02 27.42 203.57205.02 201.99IBM 204.19 3188800203.79 AMZN 259.98264.68 259.98 3723600259.07262.78 27.93 27.51MSFT 28.05 5556590027.5527.67 204.65 203.37IBM 203.84 3370700205.35 205.18 265.00268.93 6115000AMZN 268.93 262.80 265.00 Adj CloseVolumeCloseLowHighOpenSymbol ... Identifiers Saturday, June 8, 13
  • 28. The Dataset 2013-02-05 2013-02-01 Date 2013-02-04 2013-02-04 2013-02-04 2013-02-01 2013-02-01 261.46 266.89268.03AMZN 4012900262.00 266.89 MSFT 27.4427.87 50540000 27.0328.02 27.42 203.57205.02 201.99IBM 204.19 3188800203.79 AMZN 259.98264.68 259.98 3723600259.07262.78 27.93 27.51MSFT 28.05 5556590027.5527.67 204.65 203.37IBM 203.84 3370700205.35 205.18 265.00268.93 6115000AMZN 268.93 262.80 265.00 Adj CloseVolumeCloseLowHighOpenSymbol ... Identifiers Measurements Saturday, June 8, 13
  • 30. Split-Apply-Combine Pattern described by Hadley Wickham and implemented in the plyr library for R. Home page: http://plyr.had.co.nz JSS Journal of Statistical Software April 2011, Volume 40, Issue 1. http://www.jstatsoft.org/ The Split-Apply-Combine Strategy for Data Analysis Hadley Wickham Rice University Abstract Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece inde- pendently and then put all the pieces back together. This insight gives rise to a new R package that allows you to smoothly apply this strategy, without having to worry about the type of structure in which your data is stored. The paper includes two case studies showing how these insights make it easier to work with batting records for veteran baseball players and a large 3d array of spatio-temporal ozone measurements. Keywords: R, apply, split, data analysis. 1. Introduction What do we do when we analyze data? What are common actions and what are common mistakes? Given the importance of this activity in statistics, there is remarkably little research on how data analysis happens. This paper attempts to remedy a very small part of that lack by describing one common data analysis pattern: Split-apply-combine. You see the split-apply- combine strategy whenever you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together. This crops up in all stages of an analysis: During data preparation, when performing group-wise ranking, standardization, or nor- malization, or in general when creating new variables that are most easily calculated on a per-group basis. When creating summaries for display or analysis, for example, when calculating marginal means, or conditioning a table of counts by dividing out group sums. Saturday, June 8, 13
  • 32. Split Apply Combine the object based on dimension(s) or identifiers (yielding segments of the same type) Saturday, June 8, 13
  • 33. Split Apply Combine the object based on dimension(s) or identifiers (yielding segments of the same type) a function to each segment producing a new segment of the target type. The function can aggregate or transform the segment. Saturday, June 8, 13
  • 34. Split Apply Combine the object based on dimension(s) or identifiers (yielding segments of the same type) a function to each segment producing a new segment of the target type. The function can aggregate or transform the segment. the results into an output type (possibly of higher dimension) Saturday, June 8, 13
  • 35. Variations based on interface Output Input Array Data.Frame List Discarded Array Data.Frame List aaply adply alply a_ply daply ddply dlply d_ply laply ldply llply l_ply From: Wickham, The Split-Apply-Combine Strategy for Data Analysis Saturday, June 8, 13
  • 36. Splitting Matrices - 2D 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 Split each element to a scalar Saturday, June 8, 13
  • 37. Splitting Matrices - 2D Split each column to a vector 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 Saturday, June 8, 13
  • 38. Splitting Matrices - 2D Split each row to a vector 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 Saturday, June 8, 13
  • 39. Splitting Matrices - 2D 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 Split each element to a scalar Split each column to a vector 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 Split each row to a vector 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 Saturday, June 8, 13
  • 40. Splitting Matrices - 3D 0 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 0 0 Split each element to a scalar Saturday, June 8, 13
  • 41. Splitting Matrices - 3D 1 2 3 4 5 6 7 0 1 2 3 0 0 1 2 3 4 5 6 Split each row x=c1, y=c2 to a vector Saturday, June 8, 13
  • 42. Splitting Matrices - 3D Split each row x=c1, z=c2 to a vector 0 1 2 3 4 5 6 7 1 2 3 4 5 6 0 0 1 2 3 Saturday, June 8, 13
  • 43. Splitting Matrices - 3D 0 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 0 0 Split each row y=c1, z=c2 to a vector Saturday, June 8, 13
  • 44. Splitting Matrices - 3D 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 0 1 2 3 Split each slice x=c to a 2D matrix Saturday, June 8, 13
  • 45. Splitting Matrices - 3D Split each slice y=c to a 2D matrix 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 0 1 2 3 Saturday, June 8, 13
  • 46. Splitting Matrices - 3D 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 0 1 2 3 Split each slice z=c to a 2D matrix Saturday, June 8, 13
  • 47. Splitting Matrices - 3D 0 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 0 0 Split each element to a scalar 1 2 3 4 5 6 7 0 1 2 3 0 0 1 2 3 4 5 6 Split each row x=c1, y=c2 to a vector Split each row x=c1, z=c2 to a vector 0 1 2 3 4 5 6 7 1 2 3 4 5 6 0 0 1 2 3 0 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 0 0 Split each row y=c1, z=c2 to a vector 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 0 1 2 3 Split each slice x=c to a 2D matrix Split each slice y=c to a 2D matrix 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 0 1 2 3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 0 1 2 3 Split each slice z=c to a 2D matrix Saturday, June 8, 13
  • 48. Splitting a Dataset Split by Symbol 2013-02-05 2013-02-01 Date 2013-02-04 2013-02-04 2013-02-04 2013-02-01 2013-02-01 261.46 266.89268.03AMZN 4012900262.00 266.89 MSFT 27.4427.87 50540000 27.0328.02 27.42 203.57205.02 201.99IBM 204.19 3188800203.79 AMZN 259.98264.68 259.98 3723600259.07262.78 27.93 27.51MSFT 28.05 5556590027.5527.67 204.65 203.37IBM 203.84 3370700205.35 205.18 265.00268.93 6115000AMZN 268.93 262.80 265.00 Adj CloseVolumeCloseLowHighOpenSymbol ... 2013-02-05 Date 2013-02-04 2013-02-01 261.46 266.89268.03AMZN 4012900262.00 266.89 AMZN 259.98264.68 259.98 3723600259.07262.78 265.00268.93 6115000AMZN 268.93 262.80 265.00 Adj CloseVolumeCloseLowHighOpenSymbol 2013-02-01 Date 2013-02-04 203.57205.02 201.99IBM 204.19 3188800203.79 204.65 203.37IBM 203.84 3370700205.35 205.18 Adj CloseVolumeCloseLowHighOpenSymbol Date 2013-02-04 2013-02-01 MSFT 27.4427.87 50540000 27.0328.02 27.42 27.93 27.51MSFT 28.05 5556590027.5527.67 Adj CloseVolumeCloseLowHighOpenSymbol Saturday, June 8, 13
  • 49. Splitting a Dataset Split by Date ... 2013-02-05 2013-02-01 Date 2013-02-04 2013-02-04 2013-02-04 2013-02-01 2013-02-01 261.46 266.89268.03AMZN 4012900262.00 266.89 MSFT 27.4427.87 50540000 27.0328.02 27.42 203.57205.02 201.99IBM 204.19 3188800203.79 AMZN 259.98264.68 259.98 3723600259.07262.78 27.93 27.51MSFT 28.05 5556590027.5527.67 204.65 203.37IBM 203.84 3370700205.35 205.18 265.00268.93 6115000AMZN 268.93 262.80 265.00 Adj CloseVolumeCloseLowHighOpenSymbol 2013-02-01 Date 2013-02-01 2013-02-01 27.93 27.51MSFT 28.05 5556590027.5527.67 204.65 203.37IBM 203.84 3370700205.35 205.18 265.00268.93 6115000AMZN 268.93 262.80 265.00 Adj CloseVolumeCloseLowHighOpenSymbol Date 2013-02-04 2013-02-04 2013-02-04 MSFT 27.4427.87 50540000 27.0328.02 27.42 203.57205.02 201.99IBM 204.19 3188800203.79 AMZN 259.98264.68 259.98 3723600259.07262.78 Adj CloseVolumeCloseLowHighOpenSymbol 2013-02-05 Date 261.46 266.89268.03AMZN 4012900262.00 266.89 Adj CloseVolumeCloseLowHighOpenSymbol Saturday, June 8, 13
  • 50. Splitting a Dataset Split by Date ... 2013-02-05 2013-02-01 Date 2013-02-04 2013-02-04 2013-02-04 2013-02-01 2013-02-01 261.46 266.89268.03AMZN 4012900262.00 266.89 MSFT 27.4427.87 50540000 27.0328.02 27.42 203.57205.02 201.99IBM 204.19 3188800203.79 AMZN 259.98264.68 259.98 3723600259.07262.78 27.93 27.51MSFT 28.05 5556590027.5527.67 204.65 203.37IBM 203.84 3370700205.35 205.18 265.00268.93 6115000AMZN 268.93 262.80 265.00 Adj CloseVolumeCloseLowHighOpenSymbol 2013-02-01 Date 2013-02-01 2013-02-01 27.93 27.51MSFT 28.05 5556590027.5527.67 204.65 203.37IBM 203.84 3370700205.35 205.18 265.00268.93 6115000AMZN 268.93 262.80 265.00 Adj CloseVolumeCloseLowHighOpenSymbol Date 2013-02-04 2013-02-04 2013-02-04 MSFT 27.4427.87 50540000 27.0328.02 27.42 203.57205.02 201.99IBM 204.19 3188800203.79 AMZN 259.98264.68 259.98 3723600259.07262.78 Adj CloseVolumeCloseLowHighOpenSymbol 2013-02-05 Date 261.46 266.89268.03AMZN 4012900262.00 266.89 Adj CloseVolumeCloseLowHighOpenSymbol We’ll see more advanced splitting in the case study Saturday, June 8, 13
  • 53. Apply (func ) 0 0 1 2 3 ➜ result Saturday, June 8, 13
  • 54. Apply (func ) result must be appropriate for output type 0 0 1 2 3 ➜ result Saturday, June 8, 13
  • 55. Combine Assemble apply results into output 5 4 3 2 1 0 0 1 2 3 5 4 3 2 1 0 0 1 2 3 Saturday, June 8, 13
  • 57. Implementing ddply (ns split-apply-combine.ply "Implementation of the split-apply-combine functions, similar to R's plyr library." (:use [incanter.core :only [$data col-names conj-rows dataset]]) (:require [split-apply-combine.core :as sac])) (defn fast-conj-rows "A simple version of conj-rows that runs much faster" [& datasets] (when (seq datasets) (dataset (col-names (first datasets)) (mapcat :rows datasets)))) (defn expr-to-fn [expr] (let [row-param (gensym "row-") kw-map (sac/build-keyword-map expr)] `(fn [~row-param] (let [~@(apply concat (for [[kw sym] kw-map] [sym `(get ~row-param ~kw ~kw)]))] ~(sac/convert-keywords expr kw-map))))) (defn exprs-to-fns [group-by] (if (coll? group-by) (vec (for [item group-by] (if (and (coll? item) (coll? (second item)) (not (#{'fn 'fn*} (first (second item))))) [(first item) (expr-to-fn (second item))] item))) group-by)) (defn split-ds "Perform a split operation on data, which must be a dataset, using the group-by-fns to choose bins. group-by-fns can either be a single function or a collection of functions. In the latter case, the results will be combined to create a key for the bin. Returns a map of the group-by-fns results to datasets including all the rows that had the given result. Note that keyword column names are the most common functions to use for the group-by." [group-by-fns data] (let [cols (col-names data) group-by-fn (if (= 1 (count group-by-fns)) (first group-by-fns) (apply juxt group-by-fns))] (loop [cur (:rows data) row-groups {}] (if (empty? cur) (for [[group rows] row-groups] [group (dataset cols rows)]) (recur (next cur) (let [row (first cur) k (group-by-fn row) a (row-groups k)] (assoc row-groups k (if a (conj a row) [row])))))))) (defn apply-ds "Apply fun to each group in grouped-data returning a sequence of pairs of the original group-keys and the result of applying the function the dataset. See split-ds for information on the grouped-data data structure." [fun grouped-data] (for [[group split-data] grouped-data] [group (fun split-data)])) (defn combine-ds "Combine the datasets in grouped-data into a single dataset including the columns specified in the group-by argument as having the values found in the keys in the grouped data. If there are columns that are in both the key and the dataset, the values in the key have precedence." [group-by grouped-data] (let [group-by (if (coll? group-by) group-by [group-by]) group-by-filter (complement (set group-by))] (apply fast-conj-rows (for [[group data] grouped-data] (let [grouped-cols (zipmap group-by group) union-cols (concat group-by (filter group-by-filter (col-names data)))] (dataset union-cols (map #(merge % grouped-cols) (:rows data)))))))) (defn ddply* "Split-apply-combine from datasets to datasets. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and combines the result of that back into a single dataset. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword keyfn] where the function keyfun is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Examples: (ddply* :Symbol (transform :Change = (diff0 :Close)) stock-data) (ddply* [[:Month #((juxt year month) (:timestamp %)]] (colwise :Volume sum) stock-data)" ([group-by fun] (ddply* group-by fun $data)) ([group-by fun data] (let [group-by (if (coll? group-by) group-by [group-by]) group-by (for [item group-by] (if (coll? item) item [item item]))] (->> data (split-ds (map second group-by)) (apply-ds fun) (combine-ds (map first group-by)))))) (defmacro ddply "Split-apply-combine from datasets to datasets. This macro is a wrapper on ddply* which provides translation of simple column-referencing expressions in the group-by argument. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and combines the result of that back into a single dataset. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword key-expr] where the exression key-expr is tranformed to a function and in expr are expanded to accessors on rows. The resulting function is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Examples: (ddply :Symbol (transform :Change = (diff0 :Close)) stock-data) (ddply [[:Month ((juxt year month) :timestamp]]] (colwise :Volume sum) stock-data)" ([group-by fun] `(ddply* ~(exprs-to-fns group-by) ~fun $data)) ([group-by fun data] `(ddply* ~(exprs-to-fns group-by) ~fun ~data))) (defn d_ply* "Split-apply-combine from datasets to nothing. This version ignores the output of fun and is used for fun's side effects. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and then drops the result. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword keyfn] where the function keyfun is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Example: (d_ply* :Symbol #(view (bar-chart :Date :Volume :data %)) stock-data)" ([group-by fun] (ddply* group-by fun $data)) ([group-by fun data] (let [group-by (if (coll? group-by) group-by [group-by]) group-by (for [item group-by] (if (coll? item) item [item item]))] (dorun (->> data (split-ds (map second group-by)) (apply-ds fun)))))) (defmacro d_ply "Split-apply-combine from datasets to nothing. This version ignores the output of fun and is used for fun's side effects. This macro is a wrapper on d_ply* which provides translation of simple column-referencing expressions in the group-by argument. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and then drops the result. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword keyfn] where the function keyfun is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Example: (d_ply :Symbol #(view (bar-chart :Date :Volume :data %)) stock-data)" ([group-by fun] `(d_ply* ~(exprs-to-fns group-by) ~fun $data)) ([group-by fun data] `(d_ply* ~(exprs-to-fns group-by) ~fun ~data))) Saturday, June 8, 13
  • 58. Implementing ddply - Split (ns split-apply-combine.ply "Implementation of the split-apply-combine functions, similar to R's plyr library." (:use [incanter.core :only [$data col-names conj-rows dataset]]) (:require [split-apply-combine.core :as sac])) (defn fast-conj-rows "A simple version of conj-rows that runs much faster" [& datasets] (when (seq datasets) (dataset (col-names (first datasets)) (mapcat :rows datasets)))) (defn expr-to-fn [expr] (let [row-param (gensym "row-") kw-map (sac/build-keyword-map expr)] `(fn [~row-param] (let [~@(apply concat (for [[kw sym] kw-map] [sym `(get ~row-param ~kw ~kw)]))] ~(sac/convert-keywords expr kw-map))))) (defn exprs-to-fns [group-by] (if (coll? group-by) (vec (for [item group-by] (if (and (coll? item) (coll? (second item)) (not (#{'fn 'fn*} (first (second item))))) [(first item) (expr-to-fn (second item))] item))) group-by)) (defn split-ds "Perform a split operation on data, which must be a dataset, using the group-by-fns to choose bins. group-by-fns can either be a single function or a collection of functions. In the latter case, the results will be combined to create a key for the bin. Returns a map of the group-by-fns results to datasets including all the rows that had the given result. Note that keyword column names are the most common functions to use for the group-by." [group-by-fns data] (let [cols (col-names data) group-by-fn (if (= 1 (count group-by-fns)) (first group-by-fns) (apply juxt group-by-fns))] (loop [cur (:rows data) row-groups {}] (if (empty? cur) (for [[group rows] row-groups] [group (dataset cols rows)]) (recur (next cur) (let [row (first cur) k (group-by-fn row) a (row-groups k)] (assoc row-groups k (if a (conj a row) [row])))))))) (defn apply-ds "Apply fun to each group in grouped-data returning a sequence of pairs of the original group-keys and the result of applying the function the dataset. See split-ds for information on the grouped-data data structure." [fun grouped-data] (for [[group split-data] grouped-data] [group (fun split-data)])) (defn combine-ds "Combine the datasets in grouped-data into a single dataset including the columns specified in the group-by argument as having the values found in the keys in the grouped data. If there are columns that are in both the key and the dataset, the values in the key have precedence." [group-by grouped-data] (let [group-by (if (coll? group-by) group-by [group-by]) group-by-filter (complement (set group-by))] (apply fast-conj-rows (for [[group data] grouped-data] (let [grouped-cols (zipmap group-by group) union-cols (concat group-by (filter group-by-filter (col-names data)))] (dataset union-cols (map #(merge % grouped-cols) (:rows data)))))))) (defn ddply* "Split-apply-combine from datasets to datasets. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and combines the result of that back into a single dataset. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword keyfn] where the function keyfun is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Examples: (ddply* :Symbol (transform :Change = (diff0 :Close)) stock-data) (ddply* [[:Month #((juxt year month) (:timestamp %)]] (colwise :Volume sum) stock-data)" ([group-by fun] (ddply* group-by fun $data)) ([group-by fun data] (let [group-by (if (coll? group-by) group-by [group-by]) group-by (for [item group-by] (if (coll? item) item [item item]))] (->> data (split-ds (map second group-by)) (apply-ds fun) (combine-ds (map first group-by)))))) (defmacro ddply "Split-apply-combine from datasets to datasets. This macro is a wrapper on ddply* which provides translation of simple column-referencing expressions in the group-by argument. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and combines the result of that back into a single dataset. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword key-expr] where the exression key-expr is tranformed to a function and in expr are expanded to accessors on rows. The resulting function is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Examples: (ddply :Symbol (transform :Change = (diff0 :Close)) stock-data) (ddply [[:Month ((juxt year month) :timestamp]]] (colwise :Volume sum) stock-data)" ([group-by fun] `(ddply* ~(exprs-to-fns group-by) ~fun $data)) ([group-by fun data] `(ddply* ~(exprs-to-fns group-by) ~fun ~data))) (defn d_ply* "Split-apply-combine from datasets to nothing. This version ignores the output of fun and is used for fun's side effects. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and then drops the result. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword keyfn] where the function keyfun is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Example: (d_ply* :Symbol #(view (bar-chart :Date :Volume :data %)) stock-data)" ([group-by fun] (ddply* group-by fun $data)) ([group-by fun data] (let [group-by (if (coll? group-by) group-by [group-by]) group-by (for [item group-by] (if (coll? item) item [item item]))] (dorun (->> data (split-ds (map second group-by)) (apply-ds fun)))))) (defmacro d_ply "Split-apply-combine from datasets to nothing. This version ignores the output of fun and is used for fun's side effects. This macro is a wrapper on d_ply* which provides translation of simple column-referencing expressions in the group-by argument. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and then drops the result. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword keyfn] where the function keyfun is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Example: (d_ply :Symbol #(view (bar-chart :Date :Volume :data %)) stock-data)" ([group-by fun] `(d_ply* ~(exprs-to-fns group-by) ~fun $data)) ([group-by fun data] `(d_ply* ~(exprs-to-fns group-by) ~fun ~data))) (defn split-ds [group-by-fns data] (let [cols (col-names data) group-by-fn (if (= 1 (count group-by-fns)) (first group-by-fns) (apply juxt group-by-fns))] (loop [cur (:rows data) row-groups {}] (if (empty? cur) (for [[group rows] row-groups] [group (dataset cols rows)]) (recur (next cur) (let [row (first cur) k (group-by-fn row) a (row-groups k)] (assoc row-groups k (if a (conj a row) [row])))))))) Saturday, June 8, 13
  • 59. Implementing ddply - Apply (ns split-apply-combine.ply "Implementation of the split-apply-combine functions, similar to R's plyr library." (:use [incanter.core :only [$data col-names conj-rows dataset]]) (:require [split-apply-combine.core :as sac])) (defn fast-conj-rows "A simple version of conj-rows that runs much faster" [& datasets] (when (seq datasets) (dataset (col-names (first datasets)) (mapcat :rows datasets)))) (defn expr-to-fn [expr] (let [row-param (gensym "row-") kw-map (sac/build-keyword-map expr)] `(fn [~row-param] (let [~@(apply concat (for [[kw sym] kw-map] [sym `(get ~row-param ~kw ~kw)]))] ~(sac/convert-keywords expr kw-map))))) (defn exprs-to-fns [group-by] (if (coll? group-by) (vec (for [item group-by] (if (and (coll? item) (coll? (second item)) (not (#{'fn 'fn*} (first (second item))))) [(first item) (expr-to-fn (second item))] item))) group-by)) (defn split-ds "Perform a split operation on data, which must be a dataset, using the group-by-fns to choose bins. group-by-fns can either be a single function or a collection of functions. In the latter case, the results will be combined to create a key for the bin. Returns a map of the group-by-fns results to datasets including all the rows that had the given result. Note that keyword column names are the most common functions to use for the group-by." [group-by-fns data] (let [cols (col-names data) group-by-fn (if (= 1 (count group-by-fns)) (first group-by-fns) (apply juxt group-by-fns))] (loop [cur (:rows data) row-groups {}] (if (empty? cur) (for [[group rows] row-groups] [group (dataset cols rows)]) (recur (next cur) (let [row (first cur) k (group-by-fn row) a (row-groups k)] (assoc row-groups k (if a (conj a row) [row])))))))) (defn apply-ds "Apply fun to each group in grouped-data returning a sequence of pairs of the original group-keys and the result of applying the function the dataset. See split-ds for information on the grouped-data data structure." [fun grouped-data] (for [[group split-data] grouped-data] [group (fun split-data)])) (defn combine-ds "Combine the datasets in grouped-data into a single dataset including the columns specified in the group-by argument as having the values found in the keys in the grouped data. If there are columns that are in both the key and the dataset, the values in the key have precedence." [group-by grouped-data] (let [group-by (if (coll? group-by) group-by [group-by]) group-by-filter (complement (set group-by))] (apply fast-conj-rows (for [[group data] grouped-data] (let [grouped-cols (zipmap group-by group) union-cols (concat group-by (filter group-by-filter (col-names data)))] (dataset union-cols (map #(merge % grouped-cols) (:rows data)))))))) (defn ddply* "Split-apply-combine from datasets to datasets. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and combines the result of that back into a single dataset. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword keyfn] where the function keyfun is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Examples: (ddply* :Symbol (transform :Change = (diff0 :Close)) stock-data) (ddply* [[:Month #((juxt year month) (:timestamp %)]] (colwise :Volume sum) stock-data)" ([group-by fun] (ddply* group-by fun $data)) ([group-by fun data] (let [group-by (if (coll? group-by) group-by [group-by]) group-by (for [item group-by] (if (coll? item) item [item item]))] (->> data (split-ds (map second group-by)) (apply-ds fun) (combine-ds (map first group-by)))))) (defmacro ddply "Split-apply-combine from datasets to datasets. This macro is a wrapper on ddply* which provides translation of simple column-referencing expressions in the group-by argument. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and combines the result of that back into a single dataset. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword key-expr] where the exression key-expr is tranformed to a function and in expr are expanded to accessors on rows. The resulting function is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Examples: (ddply :Symbol (transform :Change = (diff0 :Close)) stock-data) (ddply [[:Month ((juxt year month) :timestamp]]] (colwise :Volume sum) stock-data)" ([group-by fun] `(ddply* ~(exprs-to-fns group-by) ~fun $data)) ([group-by fun data] `(ddply* ~(exprs-to-fns group-by) ~fun ~data))) (defn d_ply* "Split-apply-combine from datasets to nothing. This version ignores the output of fun and is used for fun's side effects. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and then drops the result. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword keyfn] where the function keyfun is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Example: (d_ply* :Symbol #(view (bar-chart :Date :Volume :data %)) stock-data)" ([group-by fun] (ddply* group-by fun $data)) ([group-by fun data] (let [group-by (if (coll? group-by) group-by [group-by]) group-by (for [item group-by] (if (coll? item) item [item item]))] (dorun (->> data (split-ds (map second group-by)) (apply-ds fun)))))) (defmacro d_ply "Split-apply-combine from datasets to nothing. This version ignores the output of fun and is used for fun's side effects. This macro is a wrapper on d_ply* which provides translation of simple column-referencing expressions in the group-by argument. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and then drops the result. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword keyfn] where the function keyfun is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Example: (d_ply :Symbol #(view (bar-chart :Date :Volume :data %)) stock-data)" ([group-by fun] `(d_ply* ~(exprs-to-fns group-by) ~fun $data)) ([group-by fun data] `(d_ply* ~(exprs-to-fns group-by) ~fun ~data))) (defn apply-ds [fun grouped-data] (for [[group split-data] grouped-data] [group (fun split-data)])) Saturday, June 8, 13
  • 60. Implementing ddply - Combine (ns split-apply-combine.ply "Implementation of the split-apply-combine functions, similar to R's plyr library." (:use [incanter.core :only [$data col-names conj-rows dataset]]) (:require [split-apply-combine.core :as sac])) (defn fast-conj-rows "A simple version of conj-rows that runs much faster" [& datasets] (when (seq datasets) (dataset (col-names (first datasets)) (mapcat :rows datasets)))) (defn expr-to-fn [expr] (let [row-param (gensym "row-") kw-map (sac/build-keyword-map expr)] `(fn [~row-param] (let [~@(apply concat (for [[kw sym] kw-map] [sym `(get ~row-param ~kw ~kw)]))] ~(sac/convert-keywords expr kw-map))))) (defn exprs-to-fns [group-by] (if (coll? group-by) (vec (for [item group-by] (if (and (coll? item) (coll? (second item)) (not (#{'fn 'fn*} (first (second item))))) [(first item) (expr-to-fn (second item))] item))) group-by)) (defn split-ds "Perform a split operation on data, which must be a dataset, using the group-by-fns to choose bins. group-by-fns can either be a single function or a collection of functions. In the latter case, the results will be combined to create a key for the bin. Returns a map of the group-by-fns results to datasets including all the rows that had the given result. Note that keyword column names are the most common functions to use for the group-by." [group-by-fns data] (let [cols (col-names data) group-by-fn (if (= 1 (count group-by-fns)) (first group-by-fns) (apply juxt group-by-fns))] (loop [cur (:rows data) row-groups {}] (if (empty? cur) (for [[group rows] row-groups] [group (dataset cols rows)]) (recur (next cur) (let [row (first cur) k (group-by-fn row) a (row-groups k)] (assoc row-groups k (if a (conj a row) [row])))))))) (defn apply-ds "Apply fun to each group in grouped-data returning a sequence of pairs of the original group-keys and the result of applying the function the dataset. See split-ds for information on the grouped-data data structure." [fun grouped-data] (for [[group split-data] grouped-data] [group (fun split-data)])) (defn combine-ds "Combine the datasets in grouped-data into a single dataset including the columns specified in the group-by argument as having the values found in the keys in the grouped data. If there are columns that are in both the key and the dataset, the values in the key have precedence." [group-by grouped-data] (let [group-by (if (coll? group-by) group-by [group-by]) group-by-filter (complement (set group-by))] (apply fast-conj-rows (for [[group data] grouped-data] (let [grouped-cols (zipmap group-by group) union-cols (concat group-by (filter group-by-filter (col-names data)))] (dataset union-cols (map #(merge % grouped-cols) (:rows data)))))))) (defn ddply* "Split-apply-combine from datasets to datasets. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and combines the result of that back into a single dataset. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword keyfn] where the function keyfun is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Examples: (ddply* :Symbol (transform :Change = (diff0 :Close)) stock-data) (ddply* [[:Month #((juxt year month) (:timestamp %)]] (colwise :Volume sum) stock-data)" ([group-by fun] (ddply* group-by fun $data)) ([group-by fun data] (let [group-by (if (coll? group-by) group-by [group-by]) group-by (for [item group-by] (if (coll? item) item [item item]))] (->> data (split-ds (map second group-by)) (apply-ds fun) (combine-ds (map first group-by)))))) (defmacro ddply "Split-apply-combine from datasets to datasets. This macro is a wrapper on ddply* which provides translation of simple column-referencing expressions in the group-by argument. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and combines the result of that back into a single dataset. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword key-expr] where the exression key-expr is tranformed to a function and in expr are expanded to accessors on rows. The resulting function is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Examples: (ddply :Symbol (transform :Change = (diff0 :Close)) stock-data) (ddply [[:Month ((juxt year month) :timestamp]]] (colwise :Volume sum) stock-data)" ([group-by fun] `(ddply* ~(exprs-to-fns group-by) ~fun $data)) ([group-by fun data] `(ddply* ~(exprs-to-fns group-by) ~fun ~data))) (defn d_ply* "Split-apply-combine from datasets to nothing. This version ignores the output of fun and is used for fun's side effects. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and then drops the result. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword keyfn] where the function keyfun is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Example: (d_ply* :Symbol #(view (bar-chart :Date :Volume :data %)) stock-data)" ([group-by fun] (ddply* group-by fun $data)) ([group-by fun data] (let [group-by (if (coll? group-by) group-by [group-by]) group-by (for [item group-by] (if (coll? item) item [item item]))] (dorun (->> data (split-ds (map second group-by)) (apply-ds fun)))))) (defmacro d_ply "Split-apply-combine from datasets to nothing. This version ignores the output of fun and is used for fun's side effects. This macro is a wrapper on d_ply* which provides translation of simple column-referencing expressions in the group-by argument. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and then drops the result. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword keyfn] where the function keyfun is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Example: (d_ply :Symbol #(view (bar-chart :Date :Volume :data %)) stock-data)" ([group-by fun] `(d_ply* ~(exprs-to-fns group-by) ~fun $data)) ([group-by fun data] `(d_ply* ~(exprs-to-fns group-by) ~fun ~data))) (defn combine-ds [group-by grouped-data] (let [group-by (if (coll? group-by) group-by [group-by]) group-by-filter (complement (set group-by))] (apply fast-conj-rows (for [[group data] grouped-data] (let [grouped-cols (zipmap group-by group) union-cols (concat group-by (filter group-by-filter (col-names data)))] (dataset union-cols (map #(merge % grouped-cols) (:rows data)))))))) Saturday, June 8, 13
  • 61. Implementing ddply - Putting it all together (ns split-apply-combine.ply "Implementation of the split-apply-combine functions, similar to R's plyr library." (:use [incanter.core :only [$data col-names conj-rows dataset]]) (:require [split-apply-combine.core :as sac])) (defn fast-conj-rows "A simple version of conj-rows that runs much faster" [& datasets] (when (seq datasets) (dataset (col-names (first datasets)) (mapcat :rows datasets)))) (defn expr-to-fn [expr] (let [row-param (gensym "row-") kw-map (sac/build-keyword-map expr)] `(fn [~row-param] (let [~@(apply concat (for [[kw sym] kw-map] [sym `(get ~row-param ~kw ~kw)]))] ~(sac/convert-keywords expr kw-map))))) (defn exprs-to-fns [group-by] (if (coll? group-by) (vec (for [item group-by] (if (and (coll? item) (coll? (second item)) (not (#{'fn 'fn*} (first (second item))))) [(first item) (expr-to-fn (second item))] item))) group-by)) (defn split-ds "Perform a split operation on data, which must be a dataset, using the group-by-fns to choose bins. group-by-fns can either be a single function or a collection of functions. In the latter case, the results will be combined to create a key for the bin. Returns a map of the group-by-fns results to datasets including all the rows that had the given result. Note that keyword column names are the most common functions to use for the group-by." [group-by-fns data] (let [cols (col-names data) group-by-fn (if (= 1 (count group-by-fns)) (first group-by-fns) (apply juxt group-by-fns))] (loop [cur (:rows data) row-groups {}] (if (empty? cur) (for [[group rows] row-groups] [group (dataset cols rows)]) (recur (next cur) (let [row (first cur) k (group-by-fn row) a (row-groups k)] (assoc row-groups k (if a (conj a row) [row])))))))) (defn apply-ds "Apply fun to each group in grouped-data returning a sequence of pairs of the original group-keys and the result of applying the function the dataset. See split-ds for information on the grouped-data data structure." [fun grouped-data] (for [[group split-data] grouped-data] [group (fun split-data)])) (defn combine-ds "Combine the datasets in grouped-data into a single dataset including the columns specified in the group-by argument as having the values found in the keys in the grouped data. If there are columns that are in both the key and the dataset, the values in the key have precedence." [group-by grouped-data] (let [group-by (if (coll? group-by) group-by [group-by]) group-by-filter (complement (set group-by))] (apply fast-conj-rows (for [[group data] grouped-data] (let [grouped-cols (zipmap group-by group) union-cols (concat group-by (filter group-by-filter (col-names data)))] (dataset union-cols (map #(merge % grouped-cols) (:rows data)))))))) (defn ddply* "Split-apply-combine from datasets to datasets. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and combines the result of that back into a single dataset. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword keyfn] where the function keyfun is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Examples: (ddply* :Symbol (transform :Change = (diff0 :Close)) stock-data) (ddply* [[:Month #((juxt year month) (:timestamp %)]] (colwise :Volume sum) stock-data)" ([group-by fun] (ddply* group-by fun $data)) ([group-by fun data] (let [group-by (if (coll? group-by) group-by [group-by]) group-by (for [item group-by] (if (coll? item) item [item item]))] (->> data (split-ds (map second group-by)) (apply-ds fun) (combine-ds (map first group-by)))))) (defmacro ddply "Split-apply-combine from datasets to datasets. This macro is a wrapper on ddply* which provides translation of simple column-referencing expressions in the group-by argument. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and combines the result of that back into a single dataset. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword key-expr] where the exression key-expr is tranformed to a function and in expr are expanded to accessors on rows. The resulting function is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Examples: (ddply :Symbol (transform :Change = (diff0 :Close)) stock-data) (ddply [[:Month ((juxt year month) :timestamp]]] (colwise :Volume sum) stock-data)" ([group-by fun] `(ddply* ~(exprs-to-fns group-by) ~fun $data)) ([group-by fun data] `(ddply* ~(exprs-to-fns group-by) ~fun ~data))) (defn d_ply* "Split-apply-combine from datasets to nothing. This version ignores the output of fun and is used for fun's side effects. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and then drops the result. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword keyfn] where the function keyfun is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Example: (d_ply* :Symbol #(view (bar-chart :Date :Volume :data %)) stock-data)" ([group-by fun] (ddply* group-by fun $data)) ([group-by fun data] (let [group-by (if (coll? group-by) group-by [group-by]) group-by (for [item group-by] (if (coll? item) item [item item]))] (dorun (->> data (split-ds (map second group-by)) (apply-ds fun)))))) (defmacro d_ply "Split-apply-combine from datasets to nothing. This version ignores the output of fun and is used for fun's side effects. This macro is a wrapper on d_ply* which provides translation of simple column-referencing expressions in the group-by argument. Splits data into a the group of datasets as specified by the group-by argument, applies fun to each of the resulting datasets and then drops the result. The group-by argument can be a keyword or collection of keywords which specify the columns to group by. It can also include pairs [keyword keyfn] where the function keyfun is applied to each row to generate the key for that row. When the groups are combined, keyword is used as the column name for the resulting column. The two types of group-by specifications can be mixed. The result of the apply function can contain the same columns names as the original dataset or different ones. It can contain the same number of rows as the original, a different number, or a single row. If data is not specified, it defaults to the currently bound value of $data. Example: (d_ply :Symbol #(view (bar-chart :Date :Volume :data %)) stock-data)" ([group-by fun] `(d_ply* ~(exprs-to-fns group-by) ~fun $data)) ([group-by fun data] `(d_ply* ~(exprs-to-fns group-by) ~fun ~data))) (defn ddply* ([group-by fun] (ddply* group-by fun $data)) ([group-by fun data] (let [group-by (if (coll? group-by) group-by [group-by]) group-by (for [item group-by] (if (coll? item) item [item item]))] (->> data (split-ds (map second group-by)) (apply-ds fun) (combine-ds (map first group-by)))))) (defmacro ddply ([group-by fun] `(ddply* ~(exprs-to-fns group-by) ~fun $data)) ([group-by fun data] `(ddply* ~(exprs-to-fns group-by) ~fun ~data))) Saturday, June 8, 13
  • 62. Support functions - colwise (ddply :Symbol (colwise :num stats/mean) tech-stocks) Saturday, June 8, 13
  • 63. Support functions - transform (ddply :Symbol (transform :Change = (diff0 :Close) :Date =* (time-format/parse (time-format/formatters :year-month-day) :Date)) tech-stocks) Saturday, June 8, 13
  • 65. A Case Study “SpaceCurve delivers instantaneous intelligence for location-based services, commodities, defense, emergency services and other markets. The company is developing Big Data solutions that continuously store and immediately analyze massive amounts of multidimensional data.” Performance analysis of large-scale geospatial- temporal ingest and query on the SpaceCurve multidimensional DB Saturday, June 8, 13
  • 66. Our Sample Problem cpu23cpu11 cpu22cpu10 cpu09 cpu21 cpu20cpu08 cpu15 cpu16 cpu17 cpu14 cpu12 cpu19 cpu13 cpu18 cpu07 cpu06 cpu05 cpu04 cpu03 cpu02 cpu01 cpu00 cpu23cpu11 cpu22cpu10 cpu09 cpu21 cpu20cpu08 cpu15 cpu16 cpu17 cpu14 cpu12 cpu19 cpu13 cpu18 cpu07 cpu06 cpu05 cpu04 cpu03 cpu02 cpu01 cpu00 cpu23cpu11 cpu22cpu10 cpu09 cpu21 cpu20cpu08 cpu15 cpu16 cpu17 cpu14 cpu12 cpu19 cpu13 cpu18 cpu07 cpu06 cpu05 cpu04 cpu03 cpu02 cpu01 cpu00 cpu23cpu11 cpu22cpu10 cpu09 cpu21 cpu20cpu08 cpu15 cpu16 cpu17 cpu14 cpu12 cpu19 cpu13 cpu18 cpu07 cpu06 cpu05 cpu04 cpu03 cpu02 cpu01 cpu00 cpu23cpu11 cpu22cpu10 cpu09 cpu21 cpu20cpu08 cpu15 cpu16 cpu17 cpu14 cpu12 cpu19 cpu13 cpu18 cpu07 cpu06 cpu05 cpu04 cpu03 cpu02 cpu01 cpu00 cpu23cpu11 cpu22cpu10 cpu09 cpu21 cpu20cpu08 cpu15 cpu16 cpu17 cpu14 cpu12 cpu19 cpu13 cpu18 cpu07 cpu06 cpu05 cpu04 cpu03 cpu02 cpu01 cpu00 10GB/s/channel switch External Clients 10.0.1.101 10.0.1.102 10.0.1.107 10.0.1.109 10.0.1.111 10.0.1.112 ‣CPU load data ‣6 systems ‣24 cores/each ‣6 data points ‣1 sample/second ‣~38 minutes run time Total of ~2 million data points Small subset of the overall SpaceCurve analysis Saturday, June 8, 13
  • 67. Time to see it work... Saturday, June 8, 13
  • 70. Where to? • A full library implementation of Split-Apply-Combine and helpers Saturday, June 8, 13
  • 71. Where to? • A full library implementation of Split-Apply-Combine and helpers • Add to Incanter? Saturday, June 8, 13
  • 72. Where to? • A full library implementation of Split-Apply-Combine and helpers • Add to Incanter? • Performance optimizations (mutable intermediate results, column-oriented datasets) Saturday, June 8, 13
  • 73. Where to? • A full library implementation of Split-Apply-Combine and helpers • Add to Incanter? • Performance optimizations (mutable intermediate results, column-oriented datasets) • Implementation based on reducers and parallelism Saturday, June 8, 13
  • 74. Where to? • A full library implementation of Split-Apply-Combine and helpers • Add to Incanter? • Performance optimizations (mutable intermediate results, column-oriented datasets) • Implementation based on reducers and parallelism • Explore the continuum from data exploration tools (R, Incanter) to large-scale data analysis (Hadoop, Cascalog, SpaceCurve, etc.) Saturday, June 8, 13
  • 76. References • Source for this presentation: https://www.github.com/tomfaulhaber/split- apply-combine • The R Project: http://www.r-project.org • The plyr home page: http://plyr.had.co.nz • Hadley Wickham, The Split-Apply-Combine Strategy for Data Analysis, Journal of Statistical Software, April 2011, Volume 40, Issue 1 • Incanter project: http://incanter.org • Eric Rochester, The Clojure Data Analysis Cookbook, Packt Publishing, 2013 • Bruce Durling, Quick and Dirty Data Science with Incanter, talk from EuroClojure 2012, http://confreaks.com/videos/2071-euroclojure2012-quick- and-dirty-data-science-with-incanter • Spacecurve: http://www.spacecurve.com Tom Faulhaber twitter: @tomfaulhaber github: tomfaulhaber Saturday, June 8, 13
  • 77. Photo Credits • Florida Home - anoldent on flickr (http://www.flickr.com/photos/anoldent/2405722434/) • Midland Coal Mine - jasonwoodhead23 on flickr (http://www.flickr.com/photos/woodhead/8522679843/) • Paradise - Antti Simonen on flickr (http://www.flickr.com/photos/anttisimonen/6041095682/) • Traders on the Exchange - thetaxhaven on flickr (http://www.flickr.com/photos/83532250@N06/7651028854) • Louvre - dynamosquito on flickr (http://www.flickr.com/photos/25182210@N07/2802458437/) • Construction - Aapo Haapanen on flickr (http://www.flickr.com/photos/decade_null/214247988/) • Server farm - from the Spacecurve website (http://www.spacecurve.com) • Sailboat race - Ryk Van Toronto on flickr (http://www.flickr.com/photos/sydandsaskia/394507351) • Arguing Philosophers - David Schroeter on flickr (http://www.flickr.com/photos/53477785@N00/92134612/) Saturday, June 8, 13