Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Clojure for Data Science

4,144 views

Published on

Presentation given at the Jan 2016 Singapore Clojure Users' Group

Published in: Data & Analytics
  • Be the first to comment

Clojure for Data Science

  1. 1. 11 Clojure for Data Science Mike Anderson 26 January 2016
  2. 2. 2 Contents  Why Clojure for Data Science  Array Programming Essentials  core.matrix  Library Ecosystem Overview  Examples and discussion
  3. 3. 3 Why Clojure for Data Science Attribute Clojure Python R Julia Scala Haskell JavaScript Strong general purpose language ✓ ✓ ✓ ✓ ✓ Functional language ✓ ✓ ✓ JVM Ecosystem (Hadoop, Spark etc.) ✓ ✓ Near-native runtime performance ✓ ✓ ✓ ✓ Dynamic language ✓ ✓ ✓ ✓ ✓ Client side execution ✓ ✓ “Code is Data” ✓
  4. 4. 4 Contents  Why Clojure for Data Science  Array Programming Essentials  core.matrix  Library Ecosystem Overview  Examples and discussion
  5. 5. 5 Plug-in paradigms Paradigm Exemplar language Clojure implementation Functional programming Haskell clojure.core Meta-programming Lisp Logic programming Prolog core.logic Process algebras / CSP Go core.async Array programming APL core.matrix
  6. 6. 6 APL Venerable history Has its own keyboard Interesting perspective on code readability  Notation invented in 1957 by Ken Iverson  Implemented at IBM around 1960-64 life←{↑1 ⍵∨.∧3 4=+/,¯1 0 1∘.⊖¯1 0 1∘.⌽⊂⍵}
  7. 7. 7 Modern array programming Standalone environment for statistical programming / graphics Python library for array programming A new language (2012) based on array programming principles .... and many others
  8. 8. 8 "It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures." —Alan Perlis abstraction Design wisdom
  9. 9. 9 What is an array? 0 1 2 0 1 2 3 4 5 6 7 8 1 2 3 Dimensions Example Vector Matrix 3D Array (3rd order Tensor) Terminology N ND Array 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 ... ...
  10. 10. 10 Multi-dimensional array properties 0 1 2 3 4 5 6 7 8 0 1 2 0 1 2 Dimension 0 Dimension 1 Dimensions (ordered and indexed) Each of the array elements is a regular value Dimension sizes together define the shape of the array (e.g. 3 x 3)
  11. 11. 11 Arrays = data about relationships (foo :A :T) => 2 0 1 2 3 4 5 6 7 8 9 10 11 :A :B :C :R :S :T Set X Set Y Each element is a fact about a relationship between a value in Set X and a value in Set Y ND array lookup is analogous to arity-N functions! :U
  12. 12. 12 Why arrays instead of functions? 0 1 2 3 4 5 6 7 8 0 1 2 0 1 2 vs. (fn [i j] (+ j (* 3 i))) 1. Precomputed values with O(1) access 2. Efficient computation with optimised bulk operations 3. Data driven representation
  13. 13. 13 Principle of array programming: generalise operations on regular (scalar) values to multi-dimensional data (+ 1 2) => 3 (+ ) => 2
  14. 14. 14 Contents  Why Clojure for Data Science  Array Programming Essentials  core.matrix  Library Ecosystem Overview  Examples and discussion
  15. 15. 15 core.matrix Array programming as a language extension for Clojure (with a Data Science focus)
  16. 16. 16 Expressivity for (int i=0; i<n; i++) { for (int j=0; j<m; j++) { for (int k=0; k<p; k++) { result[i][j][k] = a[i][j][k] + b[i][j][k]; } } } Java (mapv (fn [a b] (mapv (fn [a b] (mapv + a b)) a b)) a b) (+ a b) + core.matrix
  17. 17. 17 Elements of core.matrix Abstraction Coding with N-dimensional arrays Implementation How is everything implemented? API What can you do with arrays?
  18. 18. 18 API
  19. 19. 19 Equivalence to Clojure vectors Nested Clojure vectors of regular shape are arrays! 0 1 2 3 4 5 6 7 8 ↔ [[0 1 2] [3 4 5] [6 7 8]] 0 1 2 [0 1 2] ↔
  20. 20. 20 Array creation ;; Build an array from a sequence (array (range 5)) => [0 1 2 3 4] ;; ... or from nested arrays/sequences (array (for [i (range 3)] (for [j (range 3)] (str i j)))) => [["00" "01" "02"] ["10" "11" "12"] ["20" "21" "22"]]
  21. 21. 21 Shape ;; Shape of a 3 x 2 matrix (shape [[1 2] [3 4] [5 6]]) => [3 2] ;; Regular values have no shape (shape 10.0) => nil
  22. 22. 22 Dimensionality ;; Dimensionality = number of dimensions ;; = length of shape vector ;; = nesting level (dimensionality [[1 2] [3 4] [5 6]]) => 2 (dimensionality [1 2 3 4 5]) => 1 ;; Regular values have zero dimensionality (dimensionality “Foo”) => 0
  23. 23. 23 Scalars vs. arrays (array? [[1 2] [3 4]]) => true (array? 12.3) => false (scalar? [1 2 3]) => false (scalar? “foo”) => true Everything is either an array or a scalar A scalar works as like a 0-dimensional array
  24. 24. 24 Indexed element access 0 1 2 3 4 5 6 7 8 0 1 2 0 1 2 Dimension 0 Dimension 1 (def M [[0 1 2] [3 4 5] [6 7 8]]) (mget M 1 2) => 5
  25. 25. 25 Slicing access 0 1 2 3 4 5 6 7 8 0 1 2 0 1 2 Dimension 0 Dimension 1 (def M [[0 1 2] [3 4 5] [6 7 8]]) (slice M 1) => [3 4 5] A slice of an array is itself an array!
  26. 26. 26 Arrays as a composition of slices (def M [[0 1 2] [3 4 5] [6 7 8]]) (slices M) => ([0 1 2] [3 4 5] [6 7 8]) (apply + (slices M)) => [9 12 15] 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 slices
  27. 27. 27 Operators (use 'clojure.core.matrix.operators) (+ [1 2 3] [4 5 6]) => [5 7 9] (* [1 2 3] [0 2 -1]) => [0 4 -3] (- [1 2] [3 4 5 6]) => RuntimeException Incompatible shapes (/ [1 2 3] 10.0) => [0.1 0.2 0.3]
  28. 28. 28 Broadcasting scalars (+ 1 1 )= ? [[0 1 2] [3 4 5] [6 7 8]] (+ 1 )=. [[1 2 3] [4 5 6] [7 8 9]] [[1 1 1] [1 1 1] [1 1 1]] [[0 1 2] [3 4 5] [6 7 8]] “Broadcasting”
  29. 29. 29 Broadcasting arrays (+ 1 )= ? [[0 1 2] [3 4 5] [6 7 8]] [2 1 0] (+ 1 )=. [[2 2 2] [5 5 5] [8 8 8]] [[2 1 0] [2 1 0] [2 1 0]] [[0 1 2] [3 4 5] [6 7 8]] “Broadcasting”
  30. 30. 30 Broadcasting Rules 1. Designed for elementwise operations - other uses must be explicit 2. Extends shape vector by adding new leading dimensions • original shape [4 5] • can broadcast to any shape [x y ... z 4 5] • scalars can broadcast to any shape 3. Fills the new array space by duplication of the original array over the new dimensions 4. Smart implementations can avoid making full copies by structural sharing or clever indexing tricks
  31. 31. 31 Functional operations on sequences (map inc [1 2 3 4]) => (2 3 4 5)map (reduce * [1 2 3 4]) => 24reduce (seq [1 2 3 4]) => (1 2 3 4)seq
  32. 32. 32 Functional operations on arrays (emap inc [[1 2] [3 4]]) => [[2 3] [4 5]] map ↔ emap “element map” (ereduce * [[1 2] [3 4]]) => 24 reduce ↔ ereduce “element reduce” (eseq [[1 2] [3 4]]) => (1 2 3 4) seq ↔ eseq “element seq”
  33. 33. 33 Specialised matrix constructors (zero-matrix 4 3) 0 0 0 0 0 0 0 0 0 0 0 0 (identity-matrix 4) 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 (permutation-matrix [3 1 0 2]) 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0
  34. 34. 34 Array transformations (transpose ) 0 1 2 3 4 5 0 3 1 4 2 5 Transposes reverses the order of all dimensions and indexes
  35. 35. 35 Matrix multiplication 9 2 7 6 4 8 . 2 8 3 4 5 9 = 𝑎 𝑏 𝑐 𝑑 𝑎 = (9 ∗ 2) + (2 ∗ 3) + (7 ∗ 5) 𝑎 = 59 (mmul [[9 2 7] [6 4 8]] [[2 8] [3 4] [5 9]]) => [[59 143] [64 136]]
  36. 36. 36 Geometry (def π 3.141592653589793) (def τ (* 2.0 π)) (defn rot [turns] (let [a (* τ turns)] [[ (cos a) (sin a)] [(-(sin a)) (cos a)]])) (mmul (rot 1/8) [3 4]) => [4.9497 0.7071] NB: See Tau Manifesto (http://tauday.com/) regarding the use of Tau (τ) 45° = 1/8 turn
  37. 37. 37 Mutability?
  38. 38. 38 Mutability – the tradeoffs Avoid mutability. But it’s an option if you really need it. Pros Cons  Faster  Reduces GC pressure  Standard in many existing matrix libraries ✘ Mutability is evil ✘ Harder to maintain / debug ✘ Hard to write concurrent code ✘ Not idiomatic in Clojure ✘ Not supported by all core.matrix implementations ✘ “Place Oriented Programming”
  39. 39. 39 Mutability – performance benefit 28 120 0 50 100 150 Mutable add! Immutable add Time for addition of vectors* (ns) * Length 10 double vectors, using :vectorz implementation 4x performance benefit
  40. 40. 40 Mutability – syntax A core.matrix function name ending with “!” performs mutation (usually on the first argument only) (add [1 2] 1)  [2 3] (add! [1 2] 1) => RuntimeException ...... not mutable! (def a (mutable [1 2])) ;; coerce to a mutable format => #<Vector2 [1.0,2.0]> (add! a 1) => #<Vector2 [2.0,3.0]>
  41. 41. 41 Implementation
  42. 42. 42 Many Matrix libraries… UJMP ojAlgo MTJ javax.vecmath
  43. 43. 43
  44. 44. 44 Lots of trade-offs Native Libraries vs. Pure JVM Mutability vs. Immutability Specialized elements (e.g. doubles) vs. Generalised elements (Object, Complex) Multi-dimensional vs. 2D matrices only Memory efficiency vs. Runtime efficiency Concrete types vs. Abstraction (interfaces / wrappers) Specified storage format vs. Multiple / arbitrary storage formats License A vs. License B Lightweight (zero-copy) views vs. Heavyweight copying / cloning
  45. 45. 45 What’s the best data structure? 0 1 2 3 .. 49Length 50 “range” vector: 2. Java double[] array new double[] {0, 1, 2, …. 49}; 1. Clojure Vector [0 1 2 …. 49] 3. Custom deftype (deftype RangeVector [^long start ^long end]) 4. Native vector format (org.jblas.DoubleMatrix. params)
  46. 46. 46 There is no spoon.
  47. 47. 47 Secret weapon time!
  48. 48. 48 Clojure Protocols (defprotocol PSummable "Protocol to support the summing of all elements in an array. The array must hold numeric values only, or an exception will be thrown." (element-sum [m])) clojure.core.matrix.protocols 1. Abstract Interface 2. Open Extension 3. Fast Dispatch
  49. 49. 49 Protocols are fast and open 89 13.8 7.9 1.9 1.2 0 20 40 60 80 100 Multimethod* Protocol call Boxed function call Primitive function call Static / inlined code Open extensionFunction call costs (ns) ✓ ✓ ✘ ✘ ✘ * Using class of first argument as dispatch function
  50. 50. 50 Typical core.matrix call path core.matrix API (matrix.clj) (defn esum "Calculates the sum of all the elements in a numerical array." [m] (mp/element-sum m)) User Code (esum [1 2 3 4]) Impl. code (extend-protocol mp/PSummable SomeImplementationClass (element-sum [a] ………))
  51. 51. 51 Most protocols are optional MANDATORY Required for a working core.matrix implementation PImplementation PDimensionInfo PIndexedAccess PIndexedSetting PMatrixEquality PSummable PRowOperations PVectorCross PCoercion PTranspose PVectorDistance PMatrixMultiply PAddProductMutable PReshaping PMathsFunctionsMutable PMatrixRank PArrayMetrics PAddProduct PVectorOps PMatrixScaling PMatrixOps PMatrixPredicates PSparseArray ….. OPTIONAL  Everything in the API will work without these  core.matrix provides a “default implementation”  Implement for improved performance
  52. 52. 52 Default implementations (extend-protocol mp/PSummable Number (element-sum [a] a) Object (element-sum [a] (mp/element-reduce a +))) clojure.core.matrix.impl.default Protocol name - from namespace clojure.core.matrix.protocols Implementation for any Number Implementation for an arbitrary Object (assumed to be an array)
  53. 53. 53 Extending a protocol (extend-protocol mp/PSummable (Class/forName "[D") (element-sum [m] (let [^doubles m m] (areduce m i res 0.0 (+ res (aget m i)))))) Class to implement protocol for, in this case a Java array : double[] Optimised code to add up all the elements of a double[] array Add type hint to avoid reflection
  54. 54. 54 15-20x benefit Speedup vs. default implementation 201 2859 3690 0 1000 2000 3000 4000 (esum v) "Specialised" (reduce + v) (esum v) "Default" Timing for element sum of length 100 double array (ns)
  55. 55. 55 Internal Implementations Implementation  Key Features :persistent-vector  Support for Clojure vectors  Immutable  Not so fast, but great for quick testing :double-array  Treats Java double[] objects as 1D arrays  Mutable – useful for accumulating results etc. :sequence  Treats Clojure sequences as arrays  Mostly useful for interop / data loading :ndarray :ndarray-double :ndarray-long .....  Google Summer of Code project by Dmitry Groshev  Pure Clojure  N-Dimensional arrays similar to NumPy  Support arbitrary dimensions and data types :scalar-wrapper :slice-wrapper :nd-wrapper  Internal wrapper formats  Used to provide efficient default implementations for various protocols
  56. 56. 56 NDArray (deftype NDArrayDouble [^doubles data ^int ndims ^ints shape ^ints strides ^int offset]) 0 1 2 3 4 5 ? ? ? 0 1 2 ? ? 3 4 5 ? offset 0 strides[1] strides[0] data (Java array) ndims = 2 shape = [2 3]
  57. 57. 57 External Implementations Implementation Key Features vectorz-clj  Pure JVM (wraps Java Library Vectorz)  Very fast, especially for vectors and small-medium matrices  Most mature core.matrix implementation at present Clatrix  Use Native BLAS libraries by wrapping the Jblas library  Very fast, especially for large 2D matrices  Used by Incanter parallel-colt-matrix  Wraps Parallel Colt library from Java  Support for multithreaded matrix computations arrayspace  Experimental  Ideas around distributed matrix computation  Builds on ideas from Blaze, Chapele, ZPL image-matrix  Treats a Java BufferedImage as a core.matrix array  Because you can?
  58. 58. 58 Switching implementations (array (range 5)) => [0 1 2 3 4] ;; switch implementations (set-current-implementation :vectorz) ;; create array with current implementation (array (range 5)) => #<Vector [0.0,1.0,2.0,3.0,4.0]> ;; explicit implementation usage (array :persistent-vector (range 5)) => [0 1 2 3 4]
  59. 59. 59 Mixing implementations (def A (array :persistent-vector (range 5))) => [0 1 2 3 4] (def B (array :vectorz (range 5))) => #<Vector [0.0,1.0,2.0,3.0,4.0]> (* A B) => [0.0 1.0 4.0 9.0 16.0] (* B A) => #<Vector [0.0,1.0,4.0,9.0,16.0]> core.matrix implementations can be mixed (but: behaviour depends on the first argument)
  60. 60. 60 Contents  Why Clojure for Data Science  Array Programming Essentials  core.matrix  Library Ecosystem Overview  Examples and discussion
  61. 61. 61 Data Science Libraries for Clojure • Still not as mature as R or Python, but developing rapidly • Clojure philosophy of small libraries rather than all-encompassing frameworks • Key areas: • Interactive environments • Visualisation • Databases / data access • Realtime data processing • Machine Learning
  62. 62. 62 Library Description Incanter Fully featured analytical environment (“R-like platform”) gorilla-repl Notebook-style web-based environment Interactive environments
  63. 63. 63 Library Description quil Clojure interface to the Processing library/environment for dynamic visualisations gyptis Clojure + ClojureScript library for producing Vega.js graphs imagez Library for generating and manipulation bitmap images Visualisation
  64. 64. 64 Library Description Datomic Awesome database supporting immutable “time travel” over database history. Great scalability for reads / analytics java.jdbc Clojure library for access to SQL databases. Mature workhorse Yesql Arguably better way to do SQL in Clojure Sparkling Clojure library for Apache Spark flambo Clojure library for Apache Spark Cascalog Clojure library for querying and data processing with Apache Hadoop many, many, more..... Databases / data access
  65. 65. 65 Library Description Storm Mature, stream processing librray for highly scalable realtime computation over large distribute clusters of compute nodes Onyx More modern / better designed alternative to Storm with growing traction core.async “Roll your own” concurrent data processing pipelines Realtime Data Processing
  66. 66. 66 Library Description clj-ml Wrapper for the popular and venerable “Weka” machine learning library for Java enclog Wrapper for the “Encog” machine learning library Clortex / Comportex Libraries implementing Numenta’s Hierarchical Temporary Memory model synaptic Basic neural networks in Clojure State of the art “Deep Learning” library Machine Learning
  67. 67. 67 Contents  Why Clojure for Data Science  Array Programming Essentials  core.matrix  Library Ecosystem Overview  Examples and discussion
  68. 68. 6868 Thank you For more information about Datacraft, visit: www.datacraft.sg
  69. 69. 69 Demo

×