Haskell for data science

Haskell for Data Science
John Cant

Haskell at a glance
● Purely functional programming language
● Compiled
● Statically typed
● Strongly typed
● Non-strict

Who on earth would use Haskell for data science?
Finance industry
Facebook
Safety critical systems
Bioinformatics
Startups
Various

Whirlwind tour of Haskell
main = putStrLn "Hello world!"

Function application
main = putStrLn "Hello world!"
Function First argument

Type signatures
(+) :: Num a => a -> a -> a
Type constraint Arguments Return type

Partial function application
(+) :: Num a => a -> a -> a
(+5) :: Num a => a -> a

Higher order functions
map :: (a -> b) -> [a] -> [b]
GHCI> map (*2) [0, 1, 2]
[0,2,4]

Data definitions
data ChessGame = NotStarted
| PlayerTurn Double Player BoardState
| CheckMate Player BoardState

Record syntax
data PersonRecord s = PersonRecord { firstName :: s
, lastName :: s
, personID :: Integer }

Pattern matching
airportSecurity ( PersonRecord "John" n _) = "Water bottles detected, Mr. " ++ n
airportSecurity _ = "Please proceed to the departure lounge "

Pattern matching
data [] a = [] | a : [a]
map _ [] = []
map f (x:xs) = f x : map f xs

Typeclasses
class ToJSON a where
toJSON :: a -> JSON
instance ToJSON (PersonRecord String) where
toJSON (PersonRecord n0 n1 i) = ...

Evaluation of expressions
foo a b = (a+b)*(a+b)
foo 3 5
*
+ +
3 5

*
+ +
a b
*
+
a b

+
3 5
*
8
*
64

bar = PersonRecord “bla” “bla” (bar 3 5)
bar
3 5
PersonRecord
“bla” “bla”

bar
3 5
PersonRecord
“bla” “bla”
STOP!!!!!
Success!
Weak Head
Normal Form

Pure functions
● Output determined only by inputs
● No side effects
=> Result independent of evaluation strategy
Impure functions
● Randomness
● File IO
● Network
● Call impure functions
● Mutations
● Hard to reason about
● Requires reasoning

Monads
Ordinary value
cube :: (Floating a) => a →
a
cube x = x * x * x
Just use the value Monad
cubeM :: (Monad m, Floating a) => m a → m a
cubeM mx = mx >>= (x → return x * x * x)
Just use the value (inside a function you’ve
bound to the monad using >>=)

Various Monad >>= implementations
IO monad
After the IO is performed
Maybe monad
If the value is not Nothing
Reader, Writer, State monad
Immediately
List monad
For each element
cubeM :: (Monad m, Floating a) => m a → m a
cubeM mx = mx >>= (x → return x * x * x)

class Monad m where
(>>=) :: m a -> (a -> m b) -> m b
return :: a -> m a
-- ...
Monads
● (In general) No way to extract value
● Result of >>= is m b, so no escape from m!
● Monads can function as tags in your source code

class Monad m where
(>>=) :: m a -> (a -> m b) -> m b
return :: a -> m a
-- ...
Monads
● Return representation of side effects
● Control evaluation order
● Move non-determinism away from pure code
● Tag values resulting from impure computation
● Store information between computations

Syntactic sugar: Imperative syntax
● Each line evaluated inside a
function passed to >>=
● Evaluation order of lines
guaranteed
● answer is the name bound to
an argument of one of these
functions. It is available to
functions defined inside this
function.

Tooling
GHC
GHCI
Cabal
Hoogle
Hackage
Haddock
Fay, Haste, GHCJS

Libraries required for data science
Fast Vectors, Arrays, Linear Algebra
Machine learning, Deep learning
Probability and statistics
Big data
Plotting, Graphs, Visualization

Vectors, Arrays, Linear Algebra
Vec, Linear, Repa, Accelerate
Use type level literals to encode dimensions of arrays (Repa, Accelerate)
Use type level literals to encode length of vectors (Linear, Vec)
Accelerate EDSL for running computations on the GPU!
Compatability - Use data types from Linear on Accelerate backends

Machine learning
LambdaNet, bpann, hfann, hnn, HaskellNN,
instinct, mines, simple-neural-networks
HMM, hmm, learning-hmm, markov-processes
svm, svm-simple
hopfield, deeplearning-hs, dnngraph
HLearn
genetics, GA, genprog, hgalib, moo, HSGEP,
simple-genetic-algorithm, SimpleEA
dtw, DynamicTimeWarp
KMeans, clustering,
heiraclus, Kmeans-vector,
gsc-weighting, hps-
kmeans, hsgsom,
hierarchical-clustering
estimator, Kalman

Probability and statistics
~40 packages

Big Data
No spark library unfortunately
Hadron
Misc Hadoop libraries
Haskell-HBase, ElasticSearch, Cassandra, MongoDB, Redis
CloudHaskell
Kafka, ZeroMQ
Various DB connectors

Plotting, graphing, visualisation
Many plotting libraries
OpenGL implementation
Elegant DSLs for writing HTML and CSS

Wide range of different plots (credit: timbod7)

Example: Density of OpenStreetMap points
Raw OSM points. 78Gb uncompressed. 2.9 Billion points.
Plot the density of these on a globe.
Use Triangular binning because it might look cool

Data types
Point data type. Just use the Vec library
Triangle data type. Tuple of points
A point that stores extra info, for insertion into a KD Tree

Data.Trees.KdTree
Lets use our own point implementation!

Bin 3d GPS points onto a triangle in the mesh

B reaking
H askell
undefined
unsafePerformIO
IORef
MVar
unsafeCoerce
trace
Use these very carefully!

Haskell Problems
Jargon, complicated types
Learning curve

Haskell Problems
Preventable and unpreventable bugs
Runtime errors

Non exhaustive pattern match
Prevent with -Wall
Crashes at runtime!

Error/Exception handling
error
fail
Error and Exception catching monads and transformers. ErrorT
Maybe
Either

Haskell Problems
Space leaks
Very easy to accidentally exhaust system memory

Many types of space leak
http://blog.ezyang.com/2011/05/space-leak-zoo/
Enough to need their own zoo!
Memory leak
Strong reference leak
Thunk leak
Live variable leak
Streaming leak
Stack Overflow
Selector Leak
Optimization induced Leak
Thread leak

Tail recursion
length
+
1
length
+
1
length
length
len’
0 len’
Thunk
buildup
Evaluation
(1+(1+(...)))
0
1
len’
2
Intermediate values not
needed!
Thunk leak
Optimisation
possible!

Performance
This slide intentionally left blank
C/C++ > Haskell > Scripting languages

Data.IntMap
Strict or Lazy variety? Persistent or Ephemeral?
“The implementation is based on big-endian patricia trees. This data structure performs especially well on binary operations like union and
intersection. However, my benchmarks show that it is also (much) faster on insertions and deletions when compared to a generic size-balanced map
implementation (see Data.Map).
● Chris Okasaki and Andy Gill, "Fast Mergeable Integer Maps", Workshop on ML, September 1998, pages 77-86, http://citeseer.ist.psu.
edu/okasaki98fast.html
○ D.R. Morrison, "/PATRICIA -- Practical Algorithm To Retrieve Information Coded In Alphanumeric/", Journal of the ACM,
15(4), October 1968, pages 514-534.
“”

Data.IntMap is a persistent data structure!
Result => Horrendous space leak!
Fix by periodically rebuilding it.
Or, give in and use a mutable vector.

Accelerate (credit: Trevor L.
McDonell)

Haskell for data science

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Haskell for data science

Similar to Haskell for data science (20)

Recently uploaded

Recently uploaded (20)

Haskell for data science

Editor's Notes