Your SlideShare is downloading. ×
Implementing the Split-Apply-Combine model in Clojure and Incanter
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Implementing the Split-Apply-Combine model in Clojure and Incanter

824
views

Published on

These are the slides from my talk to the Bay Area Clojure Group meeting in San Francisco on June 6, 2013. …

These are the slides from my talk to the Bay Area Clojure Group meeting in San Francisco on June 6, 2013.

The slides are not meant to stand alone, so they may not be completely useful if you did not attend.

Here is the description of the talk sent out in advance:

Tom Faulhaber will talk about interactive data analysis focusing on data organization and the split-apply-combine pattern. You'll find that split-apply-combine is a powerful tool that applies to many of the data problems that we look at in Clojure. This pattern is the basis of the popular plyr package developed by Hadley Wickham in the R language.

Tom will demonstrate some basic ideas of data analysis and show how they're implemented in the Incanter system. We'll discuss split-apply-combine and how it's used in Incanter today. Then, we'll discuss how to implement a full version of split-apply-combine in Clojure on top of Incanter's dataset type. Finally, we'll use our implementation to learn about some real data.

Published in: Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
824
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
8
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Using Split-Apply-Combine for Data Analysis inClojureBay Area Clojure GroupJune 6, 2013Tom Faulhabertwitter: @tomfaulhabergithub: tomfaulhaberSaturday, June 8, 13
  • 2. Saturday, June 8, 13
  • 3. Saturday, June 8, 13
  • 4. Saturday, June 8, 13
  • 5. Saturday, June 8, 13
  • 6. Saturday, June 8, 13
  • 7. Data Structures for DataAnalysisSaturday, June 8, 13
  • 8. The Vector [265.0 259.98 266.89 262.22 ...] Saturday, June 8, 13
  • 9. The Vector (mean [265.0 259.98 266.89 262.22 ...]) ➜263.697Saturday, June 8, 13
  • 10. The Vector (apply min [265.0 259.98 266.89 262.22 ...]) ➜257.21Saturday, June 8, 13
  • 11. The Vector (apply max [265.0 259.98 266.89 262.22 ...]) ➜269.75Saturday, June 8, 13
  • 12. The Vector (sd [265.0 259.98 266.89 262.22 ...]) ➜3.815Saturday, June 8, 13
  • 13. The Vector (quantile [265.0 259.98 266.89 262.22 ...]) ➜[257.21 260.105 264.27 266.175 269.75]Saturday, June 8, 13
  • 14. The Vector [265.0 259.98 266.89 262.22 ...] Saturday, June 8, 13
  • 15. The Vector (histogram [265.0 259.98 266.89 262.22 ...]) ➜Saturday, June 8, 13
  • 16. The Vector [265.0 259.98 266.89 262.22 ...] Saturday, June 8, 13
  • 17. The Vector (line-chart [265.0 259.98 266.89 262.22 ...]) ➜Saturday, June 8, 13
  • 18. The MatrixSaturday, June 8, 13
  • 19. The Matrix1 Dimension01234560 1 2 3 4Saturday, June 8, 13
  • 20. The Matrix1 Dimension01234560 1 2 3 40 1 2 3 4 5 6 701234562 DimensionsSaturday, June 8, 13
  • 21. The Matrix1 Dimension01234560 1 2 3 40 1 2 3 4 5 6 701234562 Dimensions0 1 2 3 4 5 6 7012345601233 DimensionsSaturday, June 8, 13
  • 22. Key-Value Pairs{ "IBM" [205.18 203.79 202.79 201.02 ...], "MSFT" [27.93 27.44 27.5 27.34 ...], "AMZN" [265.0 259.98 266.89 262.22 ...]}Using Key-Value pairs can organize multiple dataunits (such as trials, customers, etc.) or collectparameter dataSaturday, June 8, 13
  • 23. The Dataset2013-02-052013-02-01Date2013-02-042013-02-042013-02-042013-02-012013-02-01261.46 266.89268.03AMZN 4012900262.00 266.89MSFT 27.4427.87 50540000 27.0328.02 27.42203.57205.02 201.99IBM 204.19 3188800203.79AMZN 259.98264.68 259.98 3723600259.07262.7827.93 27.51MSFT 28.05 5556590027.5527.67204.65 203.37IBM 203.84 3370700205.35 205.18265.00268.93 6115000AMZN 268.93 262.80 265.00Adj CloseVolumeCloseLowHighOpenSymbol...Saturday, June 8, 13
  • 24. The Dataset2013-02-052013-02-01Date2013-02-042013-02-042013-02-042013-02-012013-02-01261.46 266.89268.03AMZN 4012900262.00 266.89MSFT 27.4427.87 50540000 27.0328.02 27.42203.57205.02 201.99IBM 204.19 3188800203.79AMZN 259.98264.68 259.98 3723600259.07262.7827.93 27.51MSFT 28.05 5556590027.5527.67204.65 203.37IBM 203.84 3370700205.35 205.18265.00268.93 6115000AMZN 268.93 262.80 265.00Adj CloseVolumeCloseLowHighOpenSymbol...Items in column have same typeSaturday, June 8, 13
  • 25. The Dataset2013-02-052013-02-01Date2013-02-042013-02-042013-02-042013-02-012013-02-01261.46 266.89268.03AMZN 4012900262.00 266.89MSFT 27.4427.87 50540000 27.0328.02 27.42203.57205.02 201.99IBM 204.19 3188800203.79AMZN 259.98264.68 259.98 3723600259.07262.7827.93 27.51MSFT 28.05 5556590027.5527.67204.65 203.37IBM 203.84 3370700205.35 205.18265.00268.93 6115000AMZN 268.93 262.80 265.00Adj CloseVolumeCloseLowHighOpenSymbol...Across a row, there may be different typesSaturday, June 8, 13
  • 26. The Dataset2013-02-052013-02-01Date2013-02-042013-02-042013-02-042013-02-012013-02-01261.46 266.89268.03AMZN 4012900262.00 266.89MSFT 27.4427.87 50540000 27.0328.02 27.42203.57205.02 201.99IBM 204.19 3188800203.79AMZN 259.98264.68 259.98 3723600259.07262.7827.93 27.51MSFT 28.05 5556590027.5527.67204.65 203.37IBM 203.84 3370700205.35 205.18265.00268.93 6115000AMZN 268.93 262.80 265.00Adj CloseVolumeCloseLowHighOpenSymbol...Saturday, June 8, 13
  • 27. The Dataset2013-02-052013-02-01Date2013-02-042013-02-042013-02-042013-02-012013-02-01261.46 266.89268.03AMZN 4012900262.00 266.89MSFT 27.4427.87 50540000 27.0328.02 27.42203.57205.02 201.99IBM 204.19 3188800203.79AMZN 259.98264.68 259.98 3723600259.07262.7827.93 27.51MSFT 28.05 5556590027.5527.67204.65 203.37IBM 203.84 3370700205.35 205.18265.00268.93 6115000AMZN 268.93 262.80 265.00Adj CloseVolumeCloseLowHighOpenSymbol...IdentifiersSaturday, June 8, 13
  • 28. The Dataset2013-02-052013-02-01Date2013-02-042013-02-042013-02-042013-02-012013-02-01261.46 266.89268.03AMZN 4012900262.00 266.89MSFT 27.4427.87 50540000 27.0328.02 27.42203.57205.02 201.99IBM 204.19 3188800203.79AMZN 259.98264.68 259.98 3723600259.07262.7827.93 27.51MSFT 28.05 5556590027.5527.67204.65 203.37IBM 203.84 3370700205.35 205.18265.00268.93 6115000AMZN 268.93 262.80 265.00Adj CloseVolumeCloseLowHighOpenSymbol...Identifiers MeasurementsSaturday, June 8, 13
  • 29. Split-Apply-CombineSaturday, June 8, 13
  • 30. Split-Apply-CombinePattern described by HadleyWickham and implemented in theplyr library for R.Home page: http://plyr.had.co.nzJSS Journal of Statistical SoftwareApril 2011, Volume 40, Issue 1. http://www.jstatsoft.org/The Split-Apply-Combine Strategy for DataAnalysisHadley WickhamRice UniversityAbstractMany data analysis problems involve the application of a split-apply-combine strategy,where you break up a big problem into manageable pieces, operate on each piece inde-pendently and then put all the pieces back together. This insight gives rise to a new Rpackage that allows you to smoothly apply this strategy, without having to worry aboutthe type of structure in which your data is stored.The paper includes two case studies showing how these insights make it easier to workwith batting records for veteran baseball players and a large 3d array of spatio-temporalozone measurements.Keywords: R, apply, split, data analysis.1. IntroductionWhat do we do when we analyze data? What are common actions and what are commonmistakes? Given the importance of this activity in statistics, there is remarkably little researchon how data analysis happens. This paper attempts to remedy a very small part of that lack bydescribing one common data analysis pattern: Split-apply-combine. You see the split-apply-combine strategy whenever you break up a big problem into manageable pieces, operate oneach piece independently and then put all the pieces back together. This crops up in all stagesof an analysis:During data preparation, when performing group-wise ranking, standardization, or nor-malization, or in general when creating new variables that are most easily calculated ona per-group basis.When creating summaries for display or analysis, for example, when calculating marginalmeans, or conditioning a table of counts by dividing out group sums.Saturday, June 8, 13
  • 31. SplitApplyCombineSaturday, June 8, 13
  • 32. SplitApplyCombinethe object based on dimension(s) or identifiers(yielding segments of the same type)Saturday, June 8, 13
  • 33. SplitApplyCombinethe object based on dimension(s) or identifiers(yielding segments of the same type)a function to each segment producing a new segment ofthe target type. The function can aggregate or transformthe segment.Saturday, June 8, 13
  • 34. SplitApplyCombinethe object based on dimension(s) or identifiers(yielding segments of the same type)a function to each segment producing a new segment ofthe target type. The function can aggregate or transformthe segment.the results into an output type (possibly of higherdimension)Saturday, June 8, 13
  • 35. Variations based on interface OutputInputArray Data.Frame List DiscardedArrayData.FrameListaaply adply alply a_plydaply ddply dlply d_plylaply ldply llply l_plyFrom: Wickham, The Split-Apply-Combine Strategy for Data AnalysisSaturday, June 8, 13
  • 36. Splitting Matrices - 2D0 1 2 3 4 5 6 70123456Split each element to ascalarSaturday, June 8, 13
  • 37. Splitting Matrices - 2DSplit each column to avector0 1 2 3 4 5 6 70123456Saturday, June 8, 13
  • 38. Splitting Matrices - 2DSplit each row to a vector0 1 2 3 4 5 6 70123456Saturday, June 8, 13
  • 39. Splitting Matrices - 2D0 1 2 3 4 5 6 70123456Split each element to ascalarSplit each column to avector0 1 2 3 4 5 6 70123456Split each row to a vector0 1 2 3 4 5 6 70123456Saturday, June 8, 13
  • 40. Splitting Matrices - 3D0 1 2 3 4 5 6 712345612300Split each element to ascalarSaturday, June 8, 13
  • 41. Splitting Matrices - 3D1 2 3 4 5 6 7012300123456Split each row x=c1, y=c2 toa vectorSaturday, June 8, 13
  • 42. Splitting Matrices - 3DSplit each row x=c1, z=c2 toa vector0 1 2 3 4 5 6 712345600123Saturday, June 8, 13
  • 43. Splitting Matrices - 3D0 1 2 3 4 5 6 712345612300Split each row y=c1, z=c2 toa vectorSaturday, June 8, 13
  • 44. Splitting Matrices - 3D0 1 2 3 4 5 6 701234560123Split each slice x=c to a 2DmatrixSaturday, June 8, 13
  • 45. Splitting Matrices - 3DSplit each slice y=c to a 2Dmatrix0 1 2 3 4 5 6 701234560123Saturday, June 8, 13
  • 46. Splitting Matrices - 3D0 1 2 3 4 5 6 701234560123Split each slice z=c to a 2DmatrixSaturday, June 8, 13
  • 47. Splitting Matrices - 3D0 1 2 3 4 5 6 712345612300Split each element to a scalar1 2 3 4 5 6 7012300123456Split each row x=c1, y=c2 to a vectorSplit each row x=c1, z=c2 to a vector0 1 2 3 4 5 6 7123456001230 1 2 3 4 5 6 712345612300Split each row y=c1, z=c2 to a vector0 1 2 3 4 5 6 701234560123Split each slice x=c to a 2D matrixSplit each slice y=c to a 2D matrix0 1 2 3 4 5 6 7012345601230 1 2 3 4 5 6 701234560123Split each slice z=c to a 2D matrixSaturday, June 8, 13
  • 48. Splitting a DatasetSplit by Symbol2013-02-052013-02-01Date2013-02-042013-02-042013-02-042013-02-012013-02-01261.46 266.89268.03AMZN 4012900262.00 266.89MSFT 27.4427.87 50540000 27.0328.02 27.42203.57205.02 201.99IBM 204.19 3188800203.79AMZN 259.98264.68 259.98 3723600259.07262.7827.93 27.51MSFT 28.05 5556590027.5527.67204.65 203.37IBM 203.84 3370700205.35 205.18265.00268.93 6115000AMZN 268.93 262.80 265.00Adj CloseVolumeCloseLowHighOpenSymbol...2013-02-05Date2013-02-042013-02-01261.46 266.89268.03AMZN 4012900262.00 266.89AMZN 259.98264.68 259.98 3723600259.07262.78265.00268.93 6115000AMZN 268.93 262.80 265.00Adj CloseVolumeCloseLowHighOpenSymbol2013-02-01Date2013-02-04 203.57205.02 201.99IBM 204.19 3188800203.79204.65 203.37IBM 203.84 3370700205.35 205.18Adj CloseVolumeCloseLowHighOpenSymbolDate2013-02-042013-02-01MSFT 27.4427.87 50540000 27.0328.02 27.4227.93 27.51MSFT 28.05 5556590027.5527.67Adj CloseVolumeCloseLowHighOpenSymbolSaturday, June 8, 13
  • 49. Splitting a DatasetSplit by Date... 2013-02-052013-02-01Date2013-02-042013-02-042013-02-042013-02-012013-02-01261.46 266.89268.03AMZN 4012900262.00 266.89MSFT 27.4427.87 50540000 27.0328.02 27.42203.57205.02 201.99IBM 204.19 3188800203.79AMZN 259.98264.68 259.98 3723600259.07262.7827.93 27.51MSFT 28.05 5556590027.5527.67204.65 203.37IBM 203.84 3370700205.35 205.18265.00268.93 6115000AMZN 268.93 262.80 265.00Adj CloseVolumeCloseLowHighOpenSymbol2013-02-01Date2013-02-012013-02-01 27.93 27.51MSFT 28.05 5556590027.5527.67204.65 203.37IBM 203.84 3370700205.35 205.18265.00268.93 6115000AMZN 268.93 262.80 265.00Adj CloseVolumeCloseLowHighOpenSymbolDate2013-02-042013-02-042013-02-04MSFT 27.4427.87 50540000 27.0328.02 27.42203.57205.02 201.99IBM 204.19 3188800203.79AMZN 259.98264.68 259.98 3723600259.07262.78Adj CloseVolumeCloseLowHighOpenSymbol2013-02-05Date261.46 266.89268.03AMZN 4012900262.00 266.89Adj CloseVolumeCloseLowHighOpenSymbolSaturday, June 8, 13
  • 50. Splitting a DatasetSplit by Date... 2013-02-052013-02-01Date2013-02-042013-02-042013-02-042013-02-012013-02-01261.46 266.89268.03AMZN 4012900262.00 266.89MSFT 27.4427.87 50540000 27.0328.02 27.42203.57205.02 201.99IBM 204.19 3188800203.79AMZN 259.98264.68 259.98 3723600259.07262.7827.93 27.51MSFT 28.05 5556590027.5527.67204.65 203.37IBM 203.84 3370700205.35 205.18265.00268.93 6115000AMZN 268.93 262.80 265.00Adj CloseVolumeCloseLowHighOpenSymbol2013-02-01Date2013-02-012013-02-01 27.93 27.51MSFT 28.05 5556590027.5527.67204.65 203.37IBM 203.84 3370700205.35 205.18265.00268.93 6115000AMZN 268.93 262.80 265.00Adj CloseVolumeCloseLowHighOpenSymbolDate2013-02-042013-02-042013-02-04MSFT 27.4427.87 50540000 27.0328.02 27.42203.57205.02 201.99IBM 204.19 3188800203.79AMZN 259.98264.68 259.98 3723600259.07262.78Adj CloseVolumeCloseLowHighOpenSymbol2013-02-05Date261.46 266.89268.03AMZN 4012900262.00 266.89Adj CloseVolumeCloseLowHighOpenSymbolWe’ll see more advanced splitting in the case studySaturday, June 8, 13
  • 51. Apply00123Saturday, June 8, 13
  • 52. Apply (func ) 00123Saturday, June 8, 13
  • 53. Apply (func ) 00123 ➜resultSaturday, June 8, 13
  • 54. Apply (func ) result must be appropriate for output type00123 ➜resultSaturday, June 8, 13
  • 55. CombineAssemble apply results into output54321001235432100123Saturday, June 8, 13
  • 56. Implementing ddply inClojureSaturday, June 8, 13
  • 57. Implementing ddply(ns split-apply-combine.ply"Implementation of the split-apply-combine functions, similar to Rs plyr library."(:use [incanter.core :only [$data col-names conj-rows dataset]])(:require [split-apply-combine.core :as sac]))(defn fast-conj-rows"A simple version of conj-rows that runs much faster"[& datasets](when (seq datasets)(dataset (col-names (first datasets))(mapcat :rows datasets))))(defn expr-to-fn[expr](let [row-param (gensym "row-")kw-map (sac/build-keyword-map expr)]`(fn [~row-param](let [~@(apply concat(for [[kw sym] kw-map][sym `(get ~row-param ~kw ~kw)]))]~(sac/convert-keywords expr kw-map)))))(defn exprs-to-fns[group-by](if (coll? group-by)(vec (for [item group-by](if (and (coll? item)(coll? (second item))(not (#{fn fn*} (first (second item)))))[(first item) (expr-to-fn (second item))]item)))group-by))(defn split-ds"Perform a split operation on data, which must be a dataset, using the group-by-fnsto choose bins. group-by-fns can either be a single function or a collection offunctions. In the latter case, the results will be combined to create a key forthe bin. Returns a map of the group-by-fns results to datasets including allthe rows that had the given result.Note that keyword column names are the most common functions to use for thegroup-by."[group-by-fns data](let [cols (col-names data)group-by-fn (if (= 1 (count group-by-fns))(first group-by-fns)(apply juxt group-by-fns))](loop [cur (:rows data) row-groups {}](if (empty? cur)(for [[group rows] row-groups] [group (dataset cols rows)])(recur (next cur)(let [row (first cur)k (group-by-fn row)a (row-groups k)](assoc row-groups k (if a (conj a row) [row]))))))))(defn apply-ds"Apply fun to each group in grouped-data returning a sequence of pairs of theoriginal group-keys and the result of applying the function the dataset. Seesplit-ds for information on the grouped-data data structure."[fun grouped-data](for [[group split-data] grouped-data][group (fun split-data)]))(defn combine-ds"Combine the datasets in grouped-data into a single dataset including thecolumns specified in the group-by argument as having the values found inthe keys in the grouped data.If there are columns that are in both the key and the dataset, the valuesin the key have precedence."[group-by grouped-data](let [group-by (if (coll? group-by) group-by [group-by])group-by-filter (complement (set group-by))](apply fast-conj-rows(for [[group data] grouped-data](let [grouped-cols (zipmap group-by group)union-cols (concat group-by(filter group-by-filter(col-names data)))](dataset union-cols(map #(merge % grouped-cols)(:rows data))))))))(defn ddply*"Split-apply-combine from datasets to datasets.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and combines the result of thatback into a single dataset.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword keyfn] where thefunction keyfun is applied to each row to generate the key for that row. Whenthe groups are combined, keyword is used as the column name for the resultingcolumn. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Examples:(ddply* :Symbol(transform :Change = (diff0 :Close))stock-data)(ddply* [[:Month #((juxt year month) (:timestamp %)]](colwise :Volume sum)stock-data)"([group-by fun](ddply* group-by fun $data))([group-by fun data](let [group-by (if (coll? group-by) group-by [group-by])group-by (for [item group-by](if (coll? item) item [item item]))](->> data(split-ds (map second group-by))(apply-ds fun)(combine-ds (map first group-by))))))(defmacro ddply"Split-apply-combine from datasets to datasets. This macro is a wrapper on ddply*which provides translation of simple column-referencing expressions in the group-byargument.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and combines the result of thatback into a single dataset.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword key-expr] where theexression key-expr is tranformed to a function and in expr are expanded to accessorson rows. The resulting function is applied to each row to generate the key forthat row. When the groups are combined, keyword is used as the column name forthe resulting column. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Examples:(ddply :Symbol(transform :Change = (diff0 :Close))stock-data)(ddply [[:Month ((juxt year month) :timestamp]]](colwise :Volume sum)stock-data)"([group-by fun]`(ddply* ~(exprs-to-fns group-by) ~fun $data))([group-by fun data]`(ddply* ~(exprs-to-fns group-by) ~fun ~data)))(defn d_ply*"Split-apply-combine from datasets to nothing. This version ignores the output offun and is used for funs side effects.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and then drops the result.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword keyfn] where thefunction keyfun is applied to each row to generate the key for that row. Whenthe groups are combined, keyword is used as the column name for the resultingcolumn. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Example:(d_ply* :Symbol#(view (bar-chart :Date :Volume :data %))stock-data)"([group-by fun](ddply* group-by fun $data))([group-by fun data](let [group-by (if (coll? group-by) group-by [group-by])group-by (for [item group-by](if (coll? item) item [item item]))](dorun(->> data(split-ds (map second group-by))(apply-ds fun))))))(defmacro d_ply"Split-apply-combine from datasets to nothing. This version ignores the output offun and is used for funs side effects. This macro is a wrapper on d_ply*which provides translation of simple column-referencing expressions in the group-byargument.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and then drops the result.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword keyfn] where thefunction keyfun is applied to each row to generate the key for that row. Whenthe groups are combined, keyword is used as the column name for the resultingcolumn. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Example:(d_ply :Symbol#(view (bar-chart :Date :Volume :data %))stock-data)"([group-by fun]`(d_ply* ~(exprs-to-fns group-by) ~fun $data))([group-by fun data]`(d_ply* ~(exprs-to-fns group-by) ~fun ~data)))Saturday, June 8, 13
  • 58. Implementing ddply - Split(ns split-apply-combine.ply"Implementation of the split-apply-combine functions, similar to Rs plyr library."(:use [incanter.core :only [$data col-names conj-rows dataset]])(:require [split-apply-combine.core :as sac]))(defn fast-conj-rows"A simple version of conj-rows that runs much faster"[& datasets](when (seq datasets)(dataset (col-names (first datasets))(mapcat :rows datasets))))(defn expr-to-fn[expr](let [row-param (gensym "row-")kw-map (sac/build-keyword-map expr)]`(fn [~row-param](let [~@(apply concat(for [[kw sym] kw-map][sym `(get ~row-param ~kw ~kw)]))]~(sac/convert-keywords expr kw-map)))))(defn exprs-to-fns[group-by](if (coll? group-by)(vec (for [item group-by](if (and (coll? item)(coll? (second item))(not (#{fn fn*} (first (second item)))))[(first item) (expr-to-fn (second item))]item)))group-by))(defn split-ds"Perform a split operation on data, which must be a dataset, using the group-by-fnsto choose bins. group-by-fns can either be a single function or a collection offunctions. In the latter case, the results will be combined to create a key forthe bin. Returns a map of the group-by-fns results to datasets including allthe rows that had the given result.Note that keyword column names are the most common functions to use for thegroup-by."[group-by-fns data](let [cols (col-names data)group-by-fn (if (= 1 (count group-by-fns))(first group-by-fns)(apply juxt group-by-fns))](loop [cur (:rows data) row-groups {}](if (empty? cur)(for [[group rows] row-groups] [group (dataset cols rows)])(recur (next cur)(let [row (first cur)k (group-by-fn row)a (row-groups k)](assoc row-groups k (if a (conj a row) [row]))))))))(defn apply-ds"Apply fun to each group in grouped-data returning a sequence of pairs of theoriginal group-keys and the result of applying the function the dataset. Seesplit-ds for information on the grouped-data data structure."[fun grouped-data](for [[group split-data] grouped-data][group (fun split-data)]))(defn combine-ds"Combine the datasets in grouped-data into a single dataset including thecolumns specified in the group-by argument as having the values found inthe keys in the grouped data.If there are columns that are in both the key and the dataset, the valuesin the key have precedence."[group-by grouped-data](let [group-by (if (coll? group-by) group-by [group-by])group-by-filter (complement (set group-by))](apply fast-conj-rows(for [[group data] grouped-data](let [grouped-cols (zipmap group-by group)union-cols (concat group-by(filter group-by-filter(col-names data)))](dataset union-cols(map #(merge % grouped-cols)(:rows data))))))))(defn ddply*"Split-apply-combine from datasets to datasets.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and combines the result of thatback into a single dataset.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword keyfn] where thefunction keyfun is applied to each row to generate the key for that row. Whenthe groups are combined, keyword is used as the column name for the resultingcolumn. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Examples:(ddply* :Symbol(transform :Change = (diff0 :Close))stock-data)(ddply* [[:Month #((juxt year month) (:timestamp %)]](colwise :Volume sum)stock-data)"([group-by fun](ddply* group-by fun $data))([group-by fun data](let [group-by (if (coll? group-by) group-by [group-by])group-by (for [item group-by](if (coll? item) item [item item]))](->> data(split-ds (map second group-by))(apply-ds fun)(combine-ds (map first group-by))))))(defmacro ddply"Split-apply-combine from datasets to datasets. This macro is a wrapper on ddply*which provides translation of simple column-referencing expressions in the group-byargument.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and combines the result of thatback into a single dataset.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword key-expr] where theexression key-expr is tranformed to a function and in expr are expanded to accessorson rows. The resulting function is applied to each row to generate the key forthat row. When the groups are combined, keyword is used as the column name forthe resulting column. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Examples:(ddply :Symbol(transform :Change = (diff0 :Close))stock-data)(ddply [[:Month ((juxt year month) :timestamp]]](colwise :Volume sum)stock-data)"([group-by fun]`(ddply* ~(exprs-to-fns group-by) ~fun $data))([group-by fun data]`(ddply* ~(exprs-to-fns group-by) ~fun ~data)))(defn d_ply*"Split-apply-combine from datasets to nothing. This version ignores the output offun and is used for funs side effects.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and then drops the result.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword keyfn] where thefunction keyfun is applied to each row to generate the key for that row. Whenthe groups are combined, keyword is used as the column name for the resultingcolumn. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Example:(d_ply* :Symbol#(view (bar-chart :Date :Volume :data %))stock-data)"([group-by fun](ddply* group-by fun $data))([group-by fun data](let [group-by (if (coll? group-by) group-by [group-by])group-by (for [item group-by](if (coll? item) item [item item]))](dorun(->> data(split-ds (map second group-by))(apply-ds fun))))))(defmacro d_ply"Split-apply-combine from datasets to nothing. This version ignores the output offun and is used for funs side effects. This macro is a wrapper on d_ply*which provides translation of simple column-referencing expressions in the group-byargument.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and then drops the result.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword keyfn] where thefunction keyfun is applied to each row to generate the key for that row. Whenthe groups are combined, keyword is used as the column name for the resultingcolumn. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Example:(d_ply :Symbol#(view (bar-chart :Date :Volume :data %))stock-data)"([group-by fun]`(d_ply* ~(exprs-to-fns group-by) ~fun $data))([group-by fun data]`(d_ply* ~(exprs-to-fns group-by) ~fun ~data)))(defn split-ds[group-by-fns data](let [cols (col-names data)group-by-fn (if (= 1 (count group-by-fns))(first group-by-fns)(apply juxt group-by-fns))](loop [cur (:rows data) row-groups {}](if (empty? cur)(for [[group rows] row-groups] [group (dataset cols rows)])(recur (next cur)(let [row (first cur)k (group-by-fn row)a (row-groups k)](assoc row-groups k (if a (conj a row) [row]))))))))Saturday, June 8, 13
  • 59. Implementing ddply - Apply(ns split-apply-combine.ply"Implementation of the split-apply-combine functions, similar to Rs plyr library."(:use [incanter.core :only [$data col-names conj-rows dataset]])(:require [split-apply-combine.core :as sac]))(defn fast-conj-rows"A simple version of conj-rows that runs much faster"[& datasets](when (seq datasets)(dataset (col-names (first datasets))(mapcat :rows datasets))))(defn expr-to-fn[expr](let [row-param (gensym "row-")kw-map (sac/build-keyword-map expr)]`(fn [~row-param](let [~@(apply concat(for [[kw sym] kw-map][sym `(get ~row-param ~kw ~kw)]))]~(sac/convert-keywords expr kw-map)))))(defn exprs-to-fns[group-by](if (coll? group-by)(vec (for [item group-by](if (and (coll? item)(coll? (second item))(not (#{fn fn*} (first (second item)))))[(first item) (expr-to-fn (second item))]item)))group-by))(defn split-ds"Perform a split operation on data, which must be a dataset, using the group-by-fnsto choose bins. group-by-fns can either be a single function or a collection offunctions. In the latter case, the results will be combined to create a key forthe bin. Returns a map of the group-by-fns results to datasets including allthe rows that had the given result.Note that keyword column names are the most common functions to use for thegroup-by."[group-by-fns data](let [cols (col-names data)group-by-fn (if (= 1 (count group-by-fns))(first group-by-fns)(apply juxt group-by-fns))](loop [cur (:rows data) row-groups {}](if (empty? cur)(for [[group rows] row-groups] [group (dataset cols rows)])(recur (next cur)(let [row (first cur)k (group-by-fn row)a (row-groups k)](assoc row-groups k (if a (conj a row) [row]))))))))(defn apply-ds"Apply fun to each group in grouped-data returning a sequence of pairs of theoriginal group-keys and the result of applying the function the dataset. Seesplit-ds for information on the grouped-data data structure."[fun grouped-data](for [[group split-data] grouped-data][group (fun split-data)]))(defn combine-ds"Combine the datasets in grouped-data into a single dataset including thecolumns specified in the group-by argument as having the values found inthe keys in the grouped data.If there are columns that are in both the key and the dataset, the valuesin the key have precedence."[group-by grouped-data](let [group-by (if (coll? group-by) group-by [group-by])group-by-filter (complement (set group-by))](apply fast-conj-rows(for [[group data] grouped-data](let [grouped-cols (zipmap group-by group)union-cols (concat group-by(filter group-by-filter(col-names data)))](dataset union-cols(map #(merge % grouped-cols)(:rows data))))))))(defn ddply*"Split-apply-combine from datasets to datasets.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and combines the result of thatback into a single dataset.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword keyfn] where thefunction keyfun is applied to each row to generate the key for that row. Whenthe groups are combined, keyword is used as the column name for the resultingcolumn. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Examples:(ddply* :Symbol(transform :Change = (diff0 :Close))stock-data)(ddply* [[:Month #((juxt year month) (:timestamp %)]](colwise :Volume sum)stock-data)"([group-by fun](ddply* group-by fun $data))([group-by fun data](let [group-by (if (coll? group-by) group-by [group-by])group-by (for [item group-by](if (coll? item) item [item item]))](->> data(split-ds (map second group-by))(apply-ds fun)(combine-ds (map first group-by))))))(defmacro ddply"Split-apply-combine from datasets to datasets. This macro is a wrapper on ddply*which provides translation of simple column-referencing expressions in the group-byargument.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and combines the result of thatback into a single dataset.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword key-expr] where theexression key-expr is tranformed to a function and in expr are expanded to accessorson rows. The resulting function is applied to each row to generate the key forthat row. When the groups are combined, keyword is used as the column name forthe resulting column. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Examples:(ddply :Symbol(transform :Change = (diff0 :Close))stock-data)(ddply [[:Month ((juxt year month) :timestamp]]](colwise :Volume sum)stock-data)"([group-by fun]`(ddply* ~(exprs-to-fns group-by) ~fun $data))([group-by fun data]`(ddply* ~(exprs-to-fns group-by) ~fun ~data)))(defn d_ply*"Split-apply-combine from datasets to nothing. This version ignores the output offun and is used for funs side effects.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and then drops the result.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword keyfn] where thefunction keyfun is applied to each row to generate the key for that row. Whenthe groups are combined, keyword is used as the column name for the resultingcolumn. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Example:(d_ply* :Symbol#(view (bar-chart :Date :Volume :data %))stock-data)"([group-by fun](ddply* group-by fun $data))([group-by fun data](let [group-by (if (coll? group-by) group-by [group-by])group-by (for [item group-by](if (coll? item) item [item item]))](dorun(->> data(split-ds (map second group-by))(apply-ds fun))))))(defmacro d_ply"Split-apply-combine from datasets to nothing. This version ignores the output offun and is used for funs side effects. This macro is a wrapper on d_ply*which provides translation of simple column-referencing expressions in the group-byargument.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and then drops the result.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword keyfn] where thefunction keyfun is applied to each row to generate the key for that row. Whenthe groups are combined, keyword is used as the column name for the resultingcolumn. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Example:(d_ply :Symbol#(view (bar-chart :Date :Volume :data %))stock-data)"([group-by fun]`(d_ply* ~(exprs-to-fns group-by) ~fun $data))([group-by fun data]`(d_ply* ~(exprs-to-fns group-by) ~fun ~data)))(defn apply-ds[fun grouped-data](for [[group split-data] grouped-data][group (fun split-data)]))Saturday, June 8, 13
  • 60. Implementing ddply - Combine(ns split-apply-combine.ply"Implementation of the split-apply-combine functions, similar to Rs plyr library."(:use [incanter.core :only [$data col-names conj-rows dataset]])(:require [split-apply-combine.core :as sac]))(defn fast-conj-rows"A simple version of conj-rows that runs much faster"[& datasets](when (seq datasets)(dataset (col-names (first datasets))(mapcat :rows datasets))))(defn expr-to-fn[expr](let [row-param (gensym "row-")kw-map (sac/build-keyword-map expr)]`(fn [~row-param](let [~@(apply concat(for [[kw sym] kw-map][sym `(get ~row-param ~kw ~kw)]))]~(sac/convert-keywords expr kw-map)))))(defn exprs-to-fns[group-by](if (coll? group-by)(vec (for [item group-by](if (and (coll? item)(coll? (second item))(not (#{fn fn*} (first (second item)))))[(first item) (expr-to-fn (second item))]item)))group-by))(defn split-ds"Perform a split operation on data, which must be a dataset, using the group-by-fnsto choose bins. group-by-fns can either be a single function or a collection offunctions. In the latter case, the results will be combined to create a key forthe bin. Returns a map of the group-by-fns results to datasets including allthe rows that had the given result.Note that keyword column names are the most common functions to use for thegroup-by."[group-by-fns data](let [cols (col-names data)group-by-fn (if (= 1 (count group-by-fns))(first group-by-fns)(apply juxt group-by-fns))](loop [cur (:rows data) row-groups {}](if (empty? cur)(for [[group rows] row-groups] [group (dataset cols rows)])(recur (next cur)(let [row (first cur)k (group-by-fn row)a (row-groups k)](assoc row-groups k (if a (conj a row) [row]))))))))(defn apply-ds"Apply fun to each group in grouped-data returning a sequence of pairs of theoriginal group-keys and the result of applying the function the dataset. Seesplit-ds for information on the grouped-data data structure."[fun grouped-data](for [[group split-data] grouped-data][group (fun split-data)]))(defn combine-ds"Combine the datasets in grouped-data into a single dataset including thecolumns specified in the group-by argument as having the values found inthe keys in the grouped data.If there are columns that are in both the key and the dataset, the valuesin the key have precedence."[group-by grouped-data](let [group-by (if (coll? group-by) group-by [group-by])group-by-filter (complement (set group-by))](apply fast-conj-rows(for [[group data] grouped-data](let [grouped-cols (zipmap group-by group)union-cols (concat group-by(filter group-by-filter(col-names data)))](dataset union-cols(map #(merge % grouped-cols)(:rows data))))))))(defn ddply*"Split-apply-combine from datasets to datasets.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and combines the result of thatback into a single dataset.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword keyfn] where thefunction keyfun is applied to each row to generate the key for that row. Whenthe groups are combined, keyword is used as the column name for the resultingcolumn. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Examples:(ddply* :Symbol(transform :Change = (diff0 :Close))stock-data)(ddply* [[:Month #((juxt year month) (:timestamp %)]](colwise :Volume sum)stock-data)"([group-by fun](ddply* group-by fun $data))([group-by fun data](let [group-by (if (coll? group-by) group-by [group-by])group-by (for [item group-by](if (coll? item) item [item item]))](->> data(split-ds (map second group-by))(apply-ds fun)(combine-ds (map first group-by))))))(defmacro ddply"Split-apply-combine from datasets to datasets. This macro is a wrapper on ddply*which provides translation of simple column-referencing expressions in the group-byargument.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and combines the result of thatback into a single dataset.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword key-expr] where theexression key-expr is tranformed to a function and in expr are expanded to accessorson rows. The resulting function is applied to each row to generate the key forthat row. When the groups are combined, keyword is used as the column name forthe resulting column. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Examples:(ddply :Symbol(transform :Change = (diff0 :Close))stock-data)(ddply [[:Month ((juxt year month) :timestamp]]](colwise :Volume sum)stock-data)"([group-by fun]`(ddply* ~(exprs-to-fns group-by) ~fun $data))([group-by fun data]`(ddply* ~(exprs-to-fns group-by) ~fun ~data)))(defn d_ply*"Split-apply-combine from datasets to nothing. This version ignores the output offun and is used for funs side effects.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and then drops the result.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword keyfn] where thefunction keyfun is applied to each row to generate the key for that row. Whenthe groups are combined, keyword is used as the column name for the resultingcolumn. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Example:(d_ply* :Symbol#(view (bar-chart :Date :Volume :data %))stock-data)"([group-by fun](ddply* group-by fun $data))([group-by fun data](let [group-by (if (coll? group-by) group-by [group-by])group-by (for [item group-by](if (coll? item) item [item item]))](dorun(->> data(split-ds (map second group-by))(apply-ds fun))))))(defmacro d_ply"Split-apply-combine from datasets to nothing. This version ignores the output offun and is used for funs side effects. This macro is a wrapper on d_ply*which provides translation of simple column-referencing expressions in the group-byargument.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and then drops the result.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword keyfn] where thefunction keyfun is applied to each row to generate the key for that row. Whenthe groups are combined, keyword is used as the column name for the resultingcolumn. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Example:(d_ply :Symbol#(view (bar-chart :Date :Volume :data %))stock-data)"([group-by fun]`(d_ply* ~(exprs-to-fns group-by) ~fun $data))([group-by fun data]`(d_ply* ~(exprs-to-fns group-by) ~fun ~data)))(defn combine-ds[group-by grouped-data](let [group-by (if (coll? group-by) group-by [group-by])group-by-filter (complement (set group-by))](apply fast-conj-rows(for [[group data] grouped-data](let [grouped-cols (zipmap group-by group)union-cols (concat group-by(filter group-by-filter(col-names data)))](dataset union-cols(map #(merge % grouped-cols)(:rows data))))))))Saturday, June 8, 13
  • 61. Implementing ddply - Putting it all together(ns split-apply-combine.ply"Implementation of the split-apply-combine functions, similar to Rs plyr library."(:use [incanter.core :only [$data col-names conj-rows dataset]])(:require [split-apply-combine.core :as sac]))(defn fast-conj-rows"A simple version of conj-rows that runs much faster"[& datasets](when (seq datasets)(dataset (col-names (first datasets))(mapcat :rows datasets))))(defn expr-to-fn[expr](let [row-param (gensym "row-")kw-map (sac/build-keyword-map expr)]`(fn [~row-param](let [~@(apply concat(for [[kw sym] kw-map][sym `(get ~row-param ~kw ~kw)]))]~(sac/convert-keywords expr kw-map)))))(defn exprs-to-fns[group-by](if (coll? group-by)(vec (for [item group-by](if (and (coll? item)(coll? (second item))(not (#{fn fn*} (first (second item)))))[(first item) (expr-to-fn (second item))]item)))group-by))(defn split-ds"Perform a split operation on data, which must be a dataset, using the group-by-fnsto choose bins. group-by-fns can either be a single function or a collection offunctions. In the latter case, the results will be combined to create a key forthe bin. Returns a map of the group-by-fns results to datasets including allthe rows that had the given result.Note that keyword column names are the most common functions to use for thegroup-by."[group-by-fns data](let [cols (col-names data)group-by-fn (if (= 1 (count group-by-fns))(first group-by-fns)(apply juxt group-by-fns))](loop [cur (:rows data) row-groups {}](if (empty? cur)(for [[group rows] row-groups] [group (dataset cols rows)])(recur (next cur)(let [row (first cur)k (group-by-fn row)a (row-groups k)](assoc row-groups k (if a (conj a row) [row]))))))))(defn apply-ds"Apply fun to each group in grouped-data returning a sequence of pairs of theoriginal group-keys and the result of applying the function the dataset. Seesplit-ds for information on the grouped-data data structure."[fun grouped-data](for [[group split-data] grouped-data][group (fun split-data)]))(defn combine-ds"Combine the datasets in grouped-data into a single dataset including thecolumns specified in the group-by argument as having the values found inthe keys in the grouped data.If there are columns that are in both the key and the dataset, the valuesin the key have precedence."[group-by grouped-data](let [group-by (if (coll? group-by) group-by [group-by])group-by-filter (complement (set group-by))](apply fast-conj-rows(for [[group data] grouped-data](let [grouped-cols (zipmap group-by group)union-cols (concat group-by(filter group-by-filter(col-names data)))](dataset union-cols(map #(merge % grouped-cols)(:rows data))))))))(defn ddply*"Split-apply-combine from datasets to datasets.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and combines the result of thatback into a single dataset.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword keyfn] where thefunction keyfun is applied to each row to generate the key for that row. Whenthe groups are combined, keyword is used as the column name for the resultingcolumn. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Examples:(ddply* :Symbol(transform :Change = (diff0 :Close))stock-data)(ddply* [[:Month #((juxt year month) (:timestamp %)]](colwise :Volume sum)stock-data)"([group-by fun](ddply* group-by fun $data))([group-by fun data](let [group-by (if (coll? group-by) group-by [group-by])group-by (for [item group-by](if (coll? item) item [item item]))](->> data(split-ds (map second group-by))(apply-ds fun)(combine-ds (map first group-by))))))(defmacro ddply"Split-apply-combine from datasets to datasets. This macro is a wrapper on ddply*which provides translation of simple column-referencing expressions in the group-byargument.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and combines the result of thatback into a single dataset.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword key-expr] where theexression key-expr is tranformed to a function and in expr are expanded to accessorson rows. The resulting function is applied to each row to generate the key forthat row. When the groups are combined, keyword is used as the column name forthe resulting column. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Examples:(ddply :Symbol(transform :Change = (diff0 :Close))stock-data)(ddply [[:Month ((juxt year month) :timestamp]]](colwise :Volume sum)stock-data)"([group-by fun]`(ddply* ~(exprs-to-fns group-by) ~fun $data))([group-by fun data]`(ddply* ~(exprs-to-fns group-by) ~fun ~data)))(defn d_ply*"Split-apply-combine from datasets to nothing. This version ignores the output offun and is used for funs side effects.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and then drops the result.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword keyfn] where thefunction keyfun is applied to each row to generate the key for that row. Whenthe groups are combined, keyword is used as the column name for the resultingcolumn. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Example:(d_ply* :Symbol#(view (bar-chart :Date :Volume :data %))stock-data)"([group-by fun](ddply* group-by fun $data))([group-by fun data](let [group-by (if (coll? group-by) group-by [group-by])group-by (for [item group-by](if (coll? item) item [item item]))](dorun(->> data(split-ds (map second group-by))(apply-ds fun))))))(defmacro d_ply"Split-apply-combine from datasets to nothing. This version ignores the output offun and is used for funs side effects. This macro is a wrapper on d_ply*which provides translation of simple column-referencing expressions in the group-byargument.Splits data into a the group of datasets as specified by the group-by argument,applies fun to each of the resulting datasets and then drops the result.The group-by argument can be a keyword or collection of keywords which specifythe columns to group by. It can also include pairs [keyword keyfn] where thefunction keyfun is applied to each row to generate the key for that row. Whenthe groups are combined, keyword is used as the column name for the resultingcolumn. The two types of group-by specifications can be mixed.The result of the apply function can contain the same columns names as theoriginal dataset or different ones. It can contain the same number of rows asthe original, a different number, or a single row.If data is not specified, it defaults to the currently bound value of $data.Example:(d_ply :Symbol#(view (bar-chart :Date :Volume :data %))stock-data)"([group-by fun]`(d_ply* ~(exprs-to-fns group-by) ~fun $data))([group-by fun data]`(d_ply* ~(exprs-to-fns group-by) ~fun ~data)))(defn ddply*([group-by fun](ddply* group-by fun $data))([group-by fun data](let [group-by (if (coll? group-by) group-by [group-by])group-by (for [item group-by](if (coll? item) item [item item]))](->> data(split-ds (map second group-by))(apply-ds fun)(combine-ds (map first group-by))))))(defmacro ddply([group-by fun]`(ddply* ~(exprs-to-fns group-by) ~fun $data))([group-by fun data]`(ddply* ~(exprs-to-fns group-by) ~fun ~data)))Saturday, June 8, 13
  • 62. Support functions - colwise(ddply :Symbol(colwise :num stats/mean)tech-stocks)Saturday, June 8, 13
  • 63. Support functions - transform(ddply :Symbol(transform :Change = (diff0 :Close):Date =* (time-format/parse(time-format/formatters :year-month-day):Date))tech-stocks)Saturday, June 8, 13
  • 64. A Case StudySaturday, June 8, 13
  • 65. A Case Study “SpaceCurve delivers instantaneousintelligence for location-based services,commodities, defense, emergencyservices and other markets. Thecompany is developing Big Datasolutions that continuously store andimmediately analyze massive amountsof multidimensional data.”Performance analysis of large-scale geospatial-temporal ingest and query on the SpaceCurvemultidimensional DBSaturday, June 8, 13
  • 66. Our Sample Problemcpu23cpu11cpu22cpu10cpu09 cpu21cpu20cpu08cpu15cpu16cpu17cpu14cpu12cpu19cpu13cpu18cpu07cpu06cpu05cpu04cpu03cpu02cpu01cpu00cpu23cpu11cpu22cpu10cpu09 cpu21cpu20cpu08cpu15cpu16cpu17cpu14cpu12cpu19cpu13cpu18cpu07cpu06cpu05cpu04cpu03cpu02cpu01cpu00cpu23cpu11cpu22cpu10cpu09 cpu21cpu20cpu08cpu15cpu16cpu17cpu14cpu12cpu19cpu13cpu18cpu07cpu06cpu05cpu04cpu03cpu02cpu01cpu00cpu23cpu11cpu22cpu10cpu09 cpu21cpu20cpu08cpu15cpu16cpu17cpu14cpu12cpu19cpu13cpu18cpu07cpu06cpu05cpu04cpu03cpu02cpu01cpu00cpu23cpu11cpu22cpu10cpu09 cpu21cpu20cpu08cpu15cpu16cpu17cpu14cpu12cpu19cpu13cpu18cpu07cpu06cpu05cpu04cpu03cpu02cpu01cpu00cpu23cpu11cpu22cpu10cpu09 cpu21cpu20cpu08cpu15cpu16cpu17cpu14cpu12cpu19cpu13cpu18cpu07cpu06cpu05cpu04cpu03cpu02cpu01cpu0010GB/s/channelswitchExternal Clients10.0.1.101 10.0.1.102 10.0.1.107 10.0.1.109 10.0.1.111 10.0.1.112‣CPU load data‣6 systems‣24 cores/each‣6 data points‣1 sample/second‣~38 minutes run timeTotal of ~2 million data pointsSmall subset of the overall SpaceCurve analysisSaturday, June 8, 13
  • 67. Time to see it work...Saturday, June 8, 13
  • 68. Where to?Saturday, June 8, 13
  • 69. Where to?Saturday, June 8, 13
  • 70. Where to?• A full library implementation of Split-Apply-Combine and helpersSaturday, June 8, 13
  • 71. Where to?• A full library implementation of Split-Apply-Combine and helpers• Add to Incanter?Saturday, June 8, 13
  • 72. Where to?• A full library implementation of Split-Apply-Combine and helpers• Add to Incanter?• Performance optimizations (mutable intermediate results, column-orienteddatasets)Saturday, June 8, 13
  • 73. Where to?• A full library implementation of Split-Apply-Combine and helpers• Add to Incanter?• Performance optimizations (mutable intermediate results, column-orienteddatasets)• Implementation based on reducers and parallelismSaturday, June 8, 13
  • 74. Where to?• A full library implementation of Split-Apply-Combine and helpers• Add to Incanter?• Performance optimizations (mutable intermediate results, column-orienteddatasets)• Implementation based on reducers and parallelism• Explore the continuum from data exploration tools (R, Incanter) to large-scaledata analysis (Hadoop, Cascalog, SpaceCurve, etc.)Saturday, June 8, 13
  • 75. DiscussionSaturday, June 8, 13
  • 76. References• Source for this presentation: https://www.github.com/tomfaulhaber/split-apply-combine• The R Project: http://www.r-project.org• The plyr home page: http://plyr.had.co.nz• Hadley Wickham, The Split-Apply-Combine Strategy for Data Analysis,Journal of Statistical Software, April 2011, Volume 40, Issue 1• Incanter project: http://incanter.org• Eric Rochester, The Clojure Data Analysis Cookbook, Packt Publishing, 2013• Bruce Durling, Quick and Dirty Data Science with Incanter, talk fromEuroClojure 2012, http://confreaks.com/videos/2071-euroclojure2012-quick-and-dirty-data-science-with-incanter• Spacecurve: http://www.spacecurve.comTom Faulhabertwitter: @tomfaulhabergithub: tomfaulhaberSaturday, June 8, 13
  • 77. Photo Credits• Florida Home - anoldent on flickr (http://www.flickr.com/photos/anoldent/2405722434/)• Midland Coal Mine - jasonwoodhead23 on flickr (http://www.flickr.com/photos/woodhead/8522679843/)• Paradise - Antti Simonen on flickr (http://www.flickr.com/photos/anttisimonen/6041095682/)• Traders on the Exchange - thetaxhaven on flickr (http://www.flickr.com/photos/83532250@N06/7651028854)• Louvre - dynamosquito on flickr (http://www.flickr.com/photos/25182210@N07/2802458437/)• Construction - Aapo Haapanen on flickr (http://www.flickr.com/photos/decade_null/214247988/)• Server farm - from the Spacecurve website (http://www.spacecurve.com)• Sailboat race - Ryk Van Toronto on flickr (http://www.flickr.com/photos/sydandsaskia/394507351)• Arguing Philosophers - David Schroeter on flickr (http://www.flickr.com/photos/53477785@N00/92134612/)Saturday, June 8, 13