This document provides an overview of using Clojure for data science. It discusses why Clojure is suitable for data science due to its functional programming capabilities, performance on the JVM, and rich library ecosystem. It introduces core.matrix, a Clojure library that provides multi-dimensional array programming functionality through Clojure protocols. The document covers core.matrix concepts like array creation and manipulation, element-wise operations, broadcasting, and optional support for mutability. It also discusses core.matrix implementation details like the performance benefits of using Clojure protocols.
Presentation given at the 2013 Clojure Conj on core.matrix, a library that brings muli-dimensional array and matrix programming capabilities to Clojure
Clojure is a new language that combines the power of Lisp with an existing hosted VM ecosystem (the Java VM). Clojure is a dynamically typed, functional, compiled language with performance on par with Java.
At the heart of all programming lies the need for abstraction, be it abstraction over our data or abstraction over the processes that operate upon it. Clojure provides a core set of powerful abstractions and ways to compose them. These abstractions are based in a heritage of Lisp but also cover many aspects of object-oriented programming as well.
This talk will examine these abstractions and introduce you to both Clojure and functional programming. Attendees are not expected to be familiar with either Clojure or FP.
These are the outline slides that I used for the Pune Clojure Course.
The slides may not be much useful standalone, but I have uploaded them for reference.
Presentation given at the 2013 Clojure Conj on core.matrix, a library that brings muli-dimensional array and matrix programming capabilities to Clojure
Clojure is a new language that combines the power of Lisp with an existing hosted VM ecosystem (the Java VM). Clojure is a dynamically typed, functional, compiled language with performance on par with Java.
At the heart of all programming lies the need for abstraction, be it abstraction over our data or abstraction over the processes that operate upon it. Clojure provides a core set of powerful abstractions and ways to compose them. These abstractions are based in a heritage of Lisp but also cover many aspects of object-oriented programming as well.
This talk will examine these abstractions and introduce you to both Clojure and functional programming. Attendees are not expected to be familiar with either Clojure or FP.
These are the outline slides that I used for the Pune Clojure Course.
The slides may not be much useful standalone, but I have uploaded them for reference.
Example of using Kotlin lang features for writing DSL for Spark-Cassandra connector. Comparison Kotlin lang DSL features with similar features in others JVM languages (Scala, Groovy).
19. Java data structures algorithms and complexityIntro C# Book
In this chapter we will compare the data structures we have learned so far by the performance (execution speed) of the basic operations (addition, search, deletion, etc.). We will give specific tips in what situations what data structures to use.
Rainer Grimm, “Functional Programming in C++11”Platonov Sergey
C++ это мультипарадигменный язык, поэтому программист сам может выбирать и совмещать структурный, объектно-ориентированный, обобщенный и функциональный подходы. Функциональный аспект C++ особенно расширился стандартом C++11: лямбда-функции, variadic templates, std::function, std::bind. (язык доклада: английский).
Monads, also known as Kleisli triples in Category Theory, are an (endo-)functor together with two natural transformations, which are surprisingly useful in pure languages like Haskell, but this talk will NOT reference monads. Ever. (Well, at least not in this talk.)
Instead what I intend to impress upon an audience of newcomers to Haskell is the wide array of freely available libraries most of which are liberally licensed open source software, intuitive package management, practical build tools, reasonable documentation (when you know how to read it and where to find it), interactive shell (or REPL), mature compiler, stable runtime, testing tools that will blow your mind away, and a small but collaborative and knowledgeable community of developers. Oh, and some special features of Haskell - the language - too!
Having programmers do data science is terrible, if only everyone else were not even worse. The problem is of course tools. We seem to have settled on either: a bunch of disparate libraries thrown into a more or less agnostic IDE, or some point-and-click wonder which no matter how glossy, never seems to trully fit our domain once we get down to it. The dual lisp tradition of grow-your-own-language and grow-your-own-editor gives me hope there is a third way. This presentation is a meditation on how I approach data problems with Clojure, what I believe the process of doing data science should look like and the tools needed to get there. Some already exists (or can at least be bodged together); others can be made with relative ease (and we are already working on some of these); but a few will take a lot more hammock time.
Talk delivered at :clojureD 2016 http://www.clojured.de/
Example of using Kotlin lang features for writing DSL for Spark-Cassandra connector. Comparison Kotlin lang DSL features with similar features in others JVM languages (Scala, Groovy).
19. Java data structures algorithms and complexityIntro C# Book
In this chapter we will compare the data structures we have learned so far by the performance (execution speed) of the basic operations (addition, search, deletion, etc.). We will give specific tips in what situations what data structures to use.
Rainer Grimm, “Functional Programming in C++11”Platonov Sergey
C++ это мультипарадигменный язык, поэтому программист сам может выбирать и совмещать структурный, объектно-ориентированный, обобщенный и функциональный подходы. Функциональный аспект C++ особенно расширился стандартом C++11: лямбда-функции, variadic templates, std::function, std::bind. (язык доклада: английский).
Monads, also known as Kleisli triples in Category Theory, are an (endo-)functor together with two natural transformations, which are surprisingly useful in pure languages like Haskell, but this talk will NOT reference monads. Ever. (Well, at least not in this talk.)
Instead what I intend to impress upon an audience of newcomers to Haskell is the wide array of freely available libraries most of which are liberally licensed open source software, intuitive package management, practical build tools, reasonable documentation (when you know how to read it and where to find it), interactive shell (or REPL), mature compiler, stable runtime, testing tools that will blow your mind away, and a small but collaborative and knowledgeable community of developers. Oh, and some special features of Haskell - the language - too!
Having programmers do data science is terrible, if only everyone else were not even worse. The problem is of course tools. We seem to have settled on either: a bunch of disparate libraries thrown into a more or less agnostic IDE, or some point-and-click wonder which no matter how glossy, never seems to trully fit our domain once we get down to it. The dual lisp tradition of grow-your-own-language and grow-your-own-editor gives me hope there is a third way. This presentation is a meditation on how I approach data problems with Clojure, what I believe the process of doing data science should look like and the tools needed to get there. Some already exists (or can at least be bodged together); others can be made with relative ease (and we are already working on some of these); but a few will take a lot more hammock time.
Talk delivered at :clojureD 2016 http://www.clojured.de/
Having programmers do data science is terrible, if only everyone else were not even worse. The problem is of course tools. We seem to have settled on either: a bunch of disparate libraries thrown into a more or less agnostic IDE, or some point-and-click wonder which no matter how glossy, never seems to truly fit our domain once we get down to it. The dual lisp tradition of grow-your-own-language and grow-your-own-editor gives me hope there is a third way.
This talk is a meditation on the ideal environment for doing data science and how to (almost) get there. I will cover how I approach data problems with Clojure (and why Clojure in the first place), what I believe the process of doing data science should look like and the tools needed to get there. Some already exists (or can at least be bodged together); others can be made with relative ease (and we are already working on some of these); but a few will take a lot more hammock time.
Clojure has always been good at manipulating data. With the release of spec and Onyx (“a masterless, cloud scale, fault tolerant, high performance distributed computation system”) good became best. In this talk you will learn about a data layer architecture build around Kafka and Onyx that is self-describing, declarative, scalable and convenient to work with for the end user. The focus will be on the power and elegance of describing data and computation with data; and the inferences and automations that can be built on top of that.
Having a fast, low-friction Edit/Build/Test cycle is one of the best and easiest ways to increase developer productivity across an organization.
This breadth-first tour covers some of the tools we use at Basho to speed up and streamline the Edit/Build/Test cycle for our Erlang projects.
There is an increasing interest in functional programming from Java developers and the organisations in which they work. For many companies the challenge now is how to make use of the competitive advantage of functional programming. For developers, how do you adapt your mindset to this newly reimagined paradigm? Through the use of examples and a modular approach to design, Clojure made simple will show how developers can be productive quickly without a major change to their current development life-cycle. We will also cover the Clojure build process, tools and exciting projects out there.
Erlang - Because s**t Happens by Mahesh Paolini-SubramanyaHakka Labs
Mahesh talks about the buddha-nature of Erlang/OTP, pointing out how the various features of the language tie together into one seamless Fault Tolerant whole. Mahesh emphasizes that Erlang begins and ends with Fault Tolerance. Fault Tolerance is baked into the very genes of Erlang/OTP - something that ends up being amazingly useful when building any kind of system. Mahesh Paolini-Subramanya is the V.P. of R&D at Ubiquiti Networks - a manufacturer of disruptive technology platforms for emerging markets. He has spent the recent past building out Erlang-based massively concurrent Cloud Services and VoIP platforms. Mahesh was previously the CTO of Vocalocity after its merger with Aptela, where he was a founder and CTO.
VoltDB and Erlang: two very promising beasts, made for the new parallel world, but still lingering in the wings. Not only are they addressing todays challenges but they are using parallel architectures as corner stone of their new and surprising approach to be faster and more productive. What are they good for? Why are we working to team them up?
Erlang promises faster implementation, way better maintenance and 4 times shorter code. VoltDB claims to be two orders of magnitude faster than its competitors. The two share many similarities: both are the result of scientific research and designed from scratch to address the new reality of parallel architectures with full force.
This talk presents the case for Erlang as server language, where it shines, how it looks, and how to get started. It details Erlang's secret sauce: microprocesses, actors, atoms, immutable variables, message passing and pattern matching. (Note: for a longer version of this treatment of Erlang only see: Why Erlang? http://www.slideshare.net/eonblast/why-erlang-gdc-online-2012)
VoltDB's inner workings are explained to understand why it can be so incredibly fast and still better than its NoSQL competitors. The well publicized Node.js benchmark clocking in at 695,000 transactions per second is described and the simple steps to get VoltDB up and running to see the prodigy from up close.
Source examples are presented that show Erlang and VoltDB in action.
The speaker is creator and maintainer of the Erlang VoltDB driver Erlvolt.
This presentation is aimed at students of the bachelors degree course Applied Computer Science who (almost) finished the course “Advanced Programming Concepts - Functional Programming with Erlang”
This presentation was created for the course “Independent Coursework” at the University of Applied Sciences Berlin.
Supervisor was Professor Dr.-Ing. Hendrik Gärtner
Introduction to Erlang for Python ProgrammersPython Ireland
What is Erlang? Why it is important? Why should Python programmers learn Erlang? How is Erlang different? How is Erlang the same? These and other questions will be answered during this talk, as well as this one: Should Erlang be the new programming language you learn this year?
Computers and Programming , Programming Languages Types, Problem solving, Introduction to the MATLAB environment, Using MATLAB Documentation
Introduction to the course, Operating methodology-Installation Procedure
1. Compare a sample code in C with MATLAB
2. Trajectory of a particle in projectile motion ( solving quadratic equations)
3. Ideal gas law problem to find volume
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
7. 7
Modern array programming
Standalone environment for statistical
programming / graphics
Python library for array programming
A new language (2012) based on
array programming principles
.... and many others
8. 8
"It is better to have 100 functions
operate on one data structure than
10 functions on 10 data structures."
—Alan Perlis
abstraction
Design wisdom
9. 9
What is an array?
0 1 2
0 1 2
3 4 5
6 7 8
1
2
3
Dimensions Example
Vector
Matrix
3D Array
(3rd order Tensor)
Terminology
N ND Array
0 1 2
3 4 5
6 7 8
0 1 2
3 4 5
6 7 8
0 1 2
3 4 5
6 7 8
...
...
10. 10
Multi-dimensional array properties
0 1 2
3 4 5
6 7 8
0
1
2
0 1 2
Dimension 0
Dimension 1
Dimensions
(ordered and
indexed)
Each of the array
elements is a
regular value
Dimension sizes
together define
the shape of the
array
(e.g. 3 x 3)
11. 11
Arrays = data about relationships
(foo :A :T) => 2
0 1 2 3
4 5 6 7
8 9 10 11
:A
:B
:C
:R :S :T
Set X
Set Y
Each element is a
fact about a
relationship
between a value in
Set X and a value in
Set Y
ND array lookup is analogous to arity-N functions!
:U
12. 12
Why arrays instead of functions?
0 1 2
3 4 5
6 7 8
0
1
2
0 1 2
vs. (fn [i j]
(+ j (* 3 i)))
1. Precomputed values with O(1) access
2. Efficient computation with optimised bulk
operations
3. Data driven representation
13. 13
Principle of array programming:
generalise operations on regular (scalar) values to
multi-dimensional data
(+ 1 2) => 3
(+ ) => 2
14. 14
Contents
Why Clojure for Data Science
Array Programming Essentials
core.matrix
Library Ecosystem Overview
Examples and discussion
20. 20
Array creation
;; Build an array from a sequence
(array (range 5))
=> [0 1 2 3 4]
;; ... or from nested arrays/sequences
(array
(for [i (range 3)]
(for [j (range 3)]
(str i j))))
=> [["00" "01" "02"]
["10" "11" "12"]
["20" "21" "22"]]
21. 21
Shape
;; Shape of a 3 x 2 matrix
(shape [[1 2]
[3 4]
[5 6]])
=> [3 2]
;; Regular values have no shape
(shape 10.0)
=> nil
22. 22
Dimensionality
;; Dimensionality = number of dimensions
;; = length of shape vector
;; = nesting level
(dimensionality [[1 2]
[3 4]
[5 6]])
=> 2
(dimensionality [1 2 3 4 5])
=> 1
;; Regular values have zero dimensionality
(dimensionality “Foo”)
=> 0
23. 23
Scalars vs. arrays
(array? [[1 2] [3 4]])
=> true
(array? 12.3)
=> false
(scalar? [1 2 3])
=> false
(scalar? “foo”)
=> true
Everything is either an array or a scalar
A scalar works as like a 0-dimensional array
30. 30
Broadcasting Rules
1. Designed for elementwise operations
- other uses must be explicit
2. Extends shape vector by adding new leading
dimensions
• original shape [4 5]
• can broadcast to any shape [x y ... z 4 5]
• scalars can broadcast to any shape
3. Fills the new array space by duplication of the original
array over the new dimensions
4. Smart implementations can avoid making full copies by
structural sharing or clever indexing tricks
38. 38
Mutability – the tradeoffs
Avoid mutability. But it’s an option if you really need it.
Pros Cons
Faster
Reduces GC pressure
Standard in many existing
matrix libraries
✘ Mutability is evil
✘ Harder to maintain / debug
✘ Hard to write concurrent code
✘ Not idiomatic in Clojure
✘ Not supported by all core.matrix
implementations
✘ “Place Oriented Programming”
39. 39
Mutability – performance benefit
28
120
0 50 100 150
Mutable add!
Immutable add
Time for addition of vectors* (ns)
* Length 10 double vectors, using :vectorz implementation
4x
performance benefit
40. 40
Mutability – syntax
A core.matrix function name ending with “!” performs mutation
(usually on the first argument only)
(add [1 2] 1)
[2 3]
(add! [1 2] 1)
=> RuntimeException ...... not mutable!
(def a (mutable [1 2])) ;; coerce to a mutable format
=> #<Vector2 [1.0,2.0]>
(add! a 1)
=> #<Vector2 [2.0,3.0]>
44. 44
Lots of trade-offs
Native Libraries vs. Pure JVM
Mutability vs. Immutability
Specialized elements (e.g.
doubles)
vs. Generalised elements (Object,
Complex)
Multi-dimensional vs. 2D matrices only
Memory efficiency vs. Runtime efficiency
Concrete types vs. Abstraction (interfaces / wrappers)
Specified storage format vs. Multiple / arbitrary storage formats
License A vs. License B
Lightweight (zero-copy) views vs. Heavyweight copying / cloning
45. 45
What’s the best data structure?
0 1 2 3 .. 49Length 50 “range” vector:
2. Java double[] array
new double[]
{0, 1, 2, …. 49};
1. Clojure Vector
[0 1 2 …. 49]
3. Custom deftype
(deftype RangeVector
[^long start
^long end])
4. Native vector format
(org.jblas.DoubleMatrix.
params)
48. 48
Clojure Protocols
(defprotocol PSummable
"Protocol to support the summing of all elements in
an array. The array must hold numeric values only,
or an exception will be thrown."
(element-sum [m]))
clojure.core.matrix.protocols
1. Abstract Interface
2. Open Extension
3. Fast Dispatch
49. 49
Protocols are fast and open
89
13.8
7.9
1.9
1.2
0 20 40 60 80 100
Multimethod*
Protocol call
Boxed function call
Primitive function call
Static / inlined code
Open extensionFunction call costs (ns)
✓
✓
✘
✘
✘
* Using class of first argument as dispatch function
50. 50
Typical core.matrix call path
core.matrix
API
(matrix.clj)
(defn esum
"Calculates the sum of all the elements in a
numerical array."
[m]
(mp/element-sum m))
User Code
(esum [1 2 3 4])
Impl.
code
(extend-protocol mp/PSummable
SomeImplementationClass
(element-sum [a]
………))
51. 51
Most protocols are optional
MANDATORY
Required for a working core.matrix implementation
PImplementation
PDimensionInfo
PIndexedAccess
PIndexedSetting
PMatrixEquality
PSummable
PRowOperations
PVectorCross
PCoercion
PTranspose
PVectorDistance
PMatrixMultiply
PAddProductMutable
PReshaping
PMathsFunctionsMutable
PMatrixRank
PArrayMetrics
PAddProduct
PVectorOps
PMatrixScaling
PMatrixOps
PMatrixPredicates
PSparseArray
…..
OPTIONAL
Everything in the API will work without these
core.matrix provides a “default implementation”
Implement for improved performance
52. 52
Default implementations
(extend-protocol mp/PSummable
Number
(element-sum [a] a)
Object
(element-sum [a]
(mp/element-reduce a +)))
clojure.core.matrix.impl.default
Protocol name - from namespace
clojure.core.matrix.protocols
Implementation for any Number
Implementation for an arbitrary Object
(assumed to be an array)
53. 53
Extending a protocol
(extend-protocol mp/PSummable
(Class/forName "[D")
(element-sum [m]
(let [^doubles m m]
(areduce m i res 0.0 (+ res (aget m i))))))
Class to implement protocol for, in
this case a Java array : double[]
Optimised code to add up all the
elements of a double[] array
Add type hint to avoid reflection
54. 54
15-20x
benefit
Speedup vs. default implementation
201
2859
3690
0 1000 2000 3000 4000
(esum v)
"Specialised"
(reduce + v)
(esum v)
"Default"
Timing for element sum of length 100 double array
(ns)
55. 55
Internal Implementations
Implementation Key Features
:persistent-vector Support for Clojure vectors
Immutable
Not so fast, but great for quick testing
:double-array Treats Java double[] objects as 1D arrays
Mutable – useful for accumulating results etc.
:sequence Treats Clojure sequences as arrays
Mostly useful for interop / data loading
:ndarray
:ndarray-double
:ndarray-long
.....
Google Summer of Code project by Dmitry Groshev
Pure Clojure
N-Dimensional arrays similar to NumPy
Support arbitrary dimensions and data types
:scalar-wrapper
:slice-wrapper
:nd-wrapper
Internal wrapper formats
Used to provide efficient default implementations for
various protocols
57. 57
External Implementations
Implementation Key Features
vectorz-clj Pure JVM (wraps Java Library Vectorz)
Very fast, especially for vectors and small-medium
matrices
Most mature core.matrix implementation at present
Clatrix Use Native BLAS libraries by wrapping the Jblas library
Very fast, especially for large 2D matrices
Used by Incanter
parallel-colt-matrix Wraps Parallel Colt library from Java
Support for multithreaded matrix computations
arrayspace Experimental
Ideas around distributed matrix computation
Builds on ideas from Blaze, Chapele, ZPL
image-matrix Treats a Java BufferedImage as a core.matrix array
Because you can?
59. 59
Mixing implementations
(def A (array :persistent-vector (range 5)))
=> [0 1 2 3 4]
(def B (array :vectorz (range 5)))
=> #<Vector [0.0,1.0,2.0,3.0,4.0]>
(* A B)
=> [0.0 1.0 4.0 9.0 16.0]
(* B A)
=> #<Vector [0.0,1.0,4.0,9.0,16.0]>
core.matrix implementations can be mixed
(but: behaviour depends on the first argument)
60. 60
Contents
Why Clojure for Data Science
Array Programming Essentials
core.matrix
Library Ecosystem Overview
Examples and discussion
61. 61
Data Science Libraries for Clojure
• Still not as mature as R or Python, but developing rapidly
• Clojure philosophy of small libraries rather than all-encompassing
frameworks
• Key areas:
• Interactive environments
• Visualisation
• Databases / data access
• Realtime data processing
• Machine Learning
63. 63
Library Description
quil Clojure interface to the Processing
library/environment for dynamic
visualisations
gyptis Clojure + ClojureScript library for
producing Vega.js graphs
imagez Library for generating and
manipulation bitmap images
Visualisation
64. 64
Library Description
Datomic Awesome database supporting
immutable “time travel” over
database history. Great scalability
for reads / analytics
java.jdbc Clojure library for access to SQL
databases. Mature workhorse
Yesql Arguably better way to do SQL in
Clojure
Sparkling Clojure library for Apache Spark
flambo Clojure library for Apache Spark
Cascalog Clojure library for querying and data
processing with Apache Hadoop
many, many, more.....
Databases / data access
65. 65
Library Description
Storm Mature, stream processing librray
for highly scalable realtime
computation over large distribute
clusters of compute nodes
Onyx More modern / better designed
alternative to Storm with growing
traction
core.async “Roll your own” concurrent data
processing pipelines
Realtime Data Processing
66. 66
Library Description
clj-ml Wrapper for the popular and venerable “Weka”
machine learning library for Java
enclog Wrapper for the “Encog” machine learning library
Clortex /
Comportex
Libraries implementing Numenta’s Hierarchical
Temporary Memory model
synaptic Basic neural networks in Clojure
State of the art “Deep Learning” library
Machine Learning
67. 67
Contents
Why Clojure for Data Science
Array Programming Essentials
core.matrix
Library Ecosystem Overview
Examples and discussion
When I say language extension, it is of course in the sense that Clojure seems to have this ability to absorb new paradigms just by plugging in new libraries.
Clojure already stole many good pure functional programming techniques from languages like Haskell
And of course we have the macro meta-programming capabilities from Lisp
More recently we’ve got core.logic bringing in Logic programming, inspired by Prolog and miniKanren
And core.async bringing in the Communicating Sequential Processes with some syntax similar to Go
And core.matrix is designed very much in the same way, to provide array programming capabilities. And if we want to trace the roots of array programming, we can go all the way back to this language called APL
When I say language extension, it is of course in the sense that Clojure seems to have this ability to absorb new paradigms just by plugging in new libraries.
Clojure already stole many good pure functional programming techniques from languages like Haskell
And of course we have the macro meta-programming capabilities from Lisp
More recently we’ve got core.logic bringing in Logic programming, inspired by Prolog and miniKanren
And core.async bringing in the Communicating Sequential Processes with some syntax similar to Go
And core.matrix is designed very much in the same way, to provide array programming capabilities. And if we want to trace the roots of array programming, we can go all the way back to this language called APL
About the same age as Lisp? First specified in 1958
Love the fact that it has its own keyboard, with all these symbols inspired by mathematical notation
And you get some crazy code.
Might seem like a bit of a dinosaur new
Array programming has had quite a renaissance in recent years.
This is because of the increasing important of data science and numerical computing in many fields
- So we’ve seen languages like R that provide an environment for statistical computing
Highlight value of paradigm – clearly a demand for these kind of numerical computing capabilities
Start off with one of my favourite quotes, because it contains a pretty important insight.
“It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures”
There is of course one error here….. (click)
We should of course be talking about an abstraction here, not a concrete data structure.
A great example of this is the sequence abstraction in Clojure – there are literally hundreds of functions that operate on Clojure sequences. Because so many functions produce and consume sequences, it gives you many different ways to compose then together.
And it’s more than just the clojure.core API: other code can build on the same abstraction, which means that the composability extends to any code you write that uses the same abstraction. It makes entire libraries composable.
In some ways I think the key to building systems using simple, composable components is about having shared abstractions.
We’ve taken this principle very much to heart in core.matrix, our abstraction of course is the array - more specifically the multi-dimensional array
And the rest of core.matrix is really all about giving you a powerful set of composable operations you can do with arrays
Overloaded terminology!
- Vector = 1D array (maths / array programming sense) – Also a Clojure vector
- Matrix: conventionally used to indicate a 2 dimensional numerical array,
- Array: in the sense of the N-dimensional array, but also the specific concrete example of a Java array
Dimensions: also overloaded! Here using in the sense of the number of dimensions in an array, but it’s also used to refer to the number of dimensions in a vector space, e.g. 3 dimensional Euclidean space.
If we’re lucky it should be clear from the context what we’re talking about.
Give you an idea about how general array programming can be –
An array is a way of representing a function using data
Instead of computing a value for each combination of inputs, we’re typically pre-computing all such values
Give you an idea about how general array programming can be –
An array is a way of representing a function using data
Instead of computing a value for each combination of inputs, we’re typically pre-computing all such values
Today I’m going to be talking about core.matrix, and it’s quite appropriate that I’m talking about it here today at the Clojure Conj because this project actually came about as a direct result of conversations I had with many people at last year’s Conj
The focus of those discussions was very much about how we could make numerical computing better in Clojure.
And the solution I’ve been working on over the past year along with a number of collaborators is core.matrix, which offers array programming as a language extension to Clojure
Example of adding a 3D array.
Java it’s just a big nested loop…
Clojure you can do it with nested maps, which is a bit more of a functional style, but still you’ve got this three-level nesting
With core.matrix it’s really simple. We just generalise + to arbitrary multi-dimensional arrays and it all just works
Does conciseness matter? Well if you’re writing a lot of code manipulating arrays it’s going to save you quite a bit of time, but more importantly it makes it much easier to avoid errors. Very easy to get off-by-one errors in this kind of code.
core.matrix gives you a nice DSL that does all the index juggling for you
Also it helps you to be mentally much closer to the problem that you are modelling. You ideally want an API that reflects the way that you think about the problem you are solving.
So today I’m going to talk about core.matrix with three different lenses
First I want to talk about the abstraction – what are these arrays?
Then I’m going to talk about the core.matrix API
Implementation: how does this all work, some of the engineering choices we’ve made
So lets talk about the core.matrix API.
This isn’t going to be an exhaustive tour, but I’m going to highlight a few of the key features to give you a taste of what is possible
One of the important API design objectives was to exploit the “natural equivalence of arrays to nested Clojure vectors”.
1D array is a Clojure vector, 2D array is like a vector of vectors
Most things in the core.matrix API work with nested Clojure vectors.
This is nice – gives a natural syntax, and great for dynamic, exploratory work at the REPL.
The most fundamental attribute of an array is probably the shape
The most fundamental attribute of an array is probably the shape
Arrays are compositions of arrays!
This is one of the best signs that you have a good abstraction: if the abstraction can be recursively defined as a composition of the same abstraction.
So of course we have quite a few different functions that let you work with slices of arrays.
Most useful is probably the slices function, which cuts an array into a sequence of its slices
Pretty common to want to do this – imagine if each slice is a row in your data set
We define array versions of the common mathematical operators.
These use the same names as clojure.core
You have to use the clojure.core.matrix.operators namespace if you want to use these names instead of the standard clojure.core operators
Question: what should happen if we add a scalar number to an array?
We have a feature called broadcasting, which allows a lower dimensional array to be treated as a higher dimensional array
The idea of broadcasting also generalises to arrays!
Here the semantics is the same, we just duplicate the smaller array to fill out the shape of the larger array
So we have some rules for broadcasting
Note that it only really makes sense for elementwise operations. You can broadcast arrays explicitly if you want to to, but it only happens automatically for elementwise operations at present.
Can only add leading dimensions.
So lets talk about some higher order functions
Two of my favourite Clojure functions – map and reduce are extremely useful higher order functions
So one of the interesting observations about array programming is that you can also see it as a generalisation of sequences in multiple dimensions, so it probably isn’t too surprising that many of the sequence functions in Clojure actually have a nice array programming equivalent
emap is the equivalent of map, it maps a function over all elements of an array – the key difference is that is preserves the structure of the array so here we’re mapping over a 2x2 matrix, and therefore we get a 2x2 result
ereduce is the equivalent of reduce over all elements
eseq is a handy bridge between core.matrix arrays and regular Clojure sequences – it just returns all the elements of an array in order
Note row-major ordering of eseq and ereduce
Basically mutability is horrible. You should be avoiding it as much as you can
But it turns out that it is needed in some cases – performance matters for numerical work
Mutability OK for library implementers, e.g. accumulation of a result in a temporary array
Once a value is constructed, shouldn’t be mutated any more
Usually 4x performance benefit isn’t a big deal – unless it happens to be your bottleneck
There are cases where it might be important: e.g. if you are crunching through a lot of data and need to add to some sort of accumulator…
Mutability OK for library implementers, e.g. accumulation of a result in a temporary array
Once a value is constructed, shouldn’t be mutated any more
Clearly this is insane – why so many matrix libraries?
This explains the problem. But doesn’t really help us….
The point is – there isn’t ever going to be a perfect right answer when choosing a concrete data type to implement an abstraction.
There are always going to be inherent advantages of different approaches
Luckily we have a secret weapon, and I think this is actually what really distinguishes core.matrix from all other array programming systems
Of course the secret weapon is Clojure protocols.
Here’s an example – PSummable protocol is a very simple protocol that allows to to compute the sum of all values in an array
Three things are important to know about
First is that they define an abstract interface – which is exactly what we need to define operations that work on our array abstraction
Secondly they feature open extension: which means that we can solve the expression problem and use protocols with arbitrary types – importantly, this includes types that weren’t written with the protocol in mind – e.g. arbitrary Java classes
Third feature is really fast dispatch – which is important if we want to core.matrix to be useful in high performance situations.
Protocols are really the “sweet spot” of being both fast and open
We benchmarked a pretty wide variety of different function calls
It’s easy to make a working core.matrix implementation!
It’s more work if you want to make it perfom across the whole API
But that’s OK because it can be done incrementally
So hopefully this provides a smooth development path for core.matrix implementations to integrate
The secret is having default implementations for all protocols, that get used if you haven’t extended the protocol for your particular type
Note that the default implementation delegates to another protocol call – this is generally the case, ultimately all these protocol calls have to be implemented in terms of the lower-level mandatory protocols if we want them to work on any array.
Value of a specialised implementation
Makes some operations very efficient
- For example if you want to transpose an NDArray, you just need to reverse the shape and reverse the strides.
vectorz-clj: probably the best choice if you want general purpose double numerics
clatrix: probably the best choice if you want linear algebra with big matrices
Not only can you switch implementation: you can also mix them!
Actually quite unique capability
How do we do this? Provide generic coercion functionality – so implementations typically use this to coerce second argument to type of the first