Have you met Julia?

Have you met Julia?
Tommaso Rigon
May 2, 2016
Tommaso Rigon Have you met Julia? May 2, 2016 1 / 25

Introduction
Which software are we more likely to use?
A non comprehensive list
In statistics many programming languages can be used. One could use:
1 C / Fortran.
1 Low-level programming languages.
2 General purpose languages, but very eﬃcient for numeric computing.
2 Python
1 Open source and general purpose language.
2 Widespread in industry and among computer scientists.
3 Matlab
1 Closed source (!)
2 Optimized for numerical computing, fast and clear linear algebra.
4 R
1 Open source: a lot of additional statistical packages are available.
2 R is developed by statisticians for statisticians.
3 Widely spread among academics.

Introduction
A typical workflow in R
Suppose we are going to analyze a real dataset:
1 Data managemenent. We need to read the data in R, from a textual file or
from a database. We also need to arrange them in a convenient form (the
dplyr package is awesome!).
2 Data visualization. Visualize the data (See package ggplot2).
3 Statistical Modeling. First analyses are done using available packages.
4 Developing. We need to implement our new methodology.
5 Reporting. We need to communicate effectively our results, usually with
tables and graphs. (See Markdown and Knitr projects).
But...
1 The script is quickly developed in R, but it is often (very) slow. Sometimes
this precludes the use of the whole dataset.
2 The slow parts need to be written in C or Fortran and then interfaced to R
(package Rcpp helps!)

Introduction
R is great but...
A vectorized language
1 The language encourages operating on the whole object (i.e. vectorized
code). However, some tasks (e.g. MCMC) can not be easily vectorized.
2 Unvectorized R code (for and while loops) is slow.
A nested for cycle compared to the same vectorized operation
system.time(for (i in 1:10^4) for (j in 1:10^3) runif(1))
# user system elapsed
# 17.689 0.000 17.528
system.time(runif(10^7))
# user system elapsed
# 0.410 0.000 0.424

Introduction
What is Julia?
Julia according to its developer
Julia is a high-level, high-performance dynamic programming language for
technical computing, with syntax that is familiar to users of other technical
computing environments.
Julia in a nutshell
Julia was released recently, in 2012, by Jeff Bezanson, Stefan Karpinski, Viral
Shah, Alan Edelman. The last stable version is the 0.4.5.
1 Open-source, with MIT liberal license.
2 High-level and familiar. It can work on the level of vectors, matrices, arrays.
The syntax is similar to Matlab and R and easy to read without a huge effort.
3 Technical computing. It is specifically optimized for scientific computing,
not necessarily statistics.

Introduction
Why Julia?
Julia in a nutshell - Technical details
1 Julia is REPL (read–eval–print loop). Exactly as in R, it is possible to
interact with the software, facilitating debugging, testing and developing.
Conversely, languages like C are usually ECRL (edit-compile-run loop).
2 Based on a sophisticated compiler which is JIT (Just in time) and LLVM
based.
3 Julia is fast. Its compiler is designed to approach the speed of C.
4 No need to vectorize code for performance; devectorized code is fast.
5 Eﬃcient support for Unicode, including but not limited to UTF-8. It means,
for instance, that
µ = 10; σ = 5
are legitimate assignments in Julia.
6 Designed for parallelism and distributed computation

Introduction
Packages available
Julia for statistics
These are some useful packages for statistical computing
1 Distributions. Probability distributions and associated functions (similar but
not equal to the d-p-q-r system in R).
2 DataFrames. For handling datasets, having eventually missing values.
3 GLM. Generalized linear models, including linear model.
4 StatBase. Basic descriptive function: sample mean, median, sample
variance...
5 ...and many others!
Julia and R integration
1 rjulia. An R package that calls Julia functions and import / export objects
between the two environments. Currently under development, available only
on GitHub.
2 RCall. As the name itself suggests, it calls R from Julia.

Introduction
Speeding up computations
Do we really need such fast and powerful tools?
In many cases, we do not. Suppose our implementation is bad written and
ineﬃcient, but it takes about 1 second to be executed. Does it worth to improve
the code?
Where eﬃcient computation are really necessary?
Just to mention some areas among others:
1 In almost any procedure applied to huge datasets (even linear models!)
2 In any procedures which involves Cross-Validation (Both Lasso and CART
use often CV for model selection).
3 In “boostrap like” procedures (bootstrap, bagged trees,...).
4 In Bayesian statistics, in approximating the posterior distribution through
simulations (e.g. MCMC, Importance sampling, ABC,...).
5 A combination of the previous.

Bootstrap example
Bootstrap example
What is the bootstrap? A (very) brief explanation
1 It is inferential technique, which (usually!) makes use of simulation. For
instance, in a frequentist framework, it can be used for assessing conﬁdence
intervals.
2 Let ˆθ(Y ) be an estimator of θ and Yi ∼ F i.i.d. random vectors, for
i = 1, . . . , n. The “true” c.d.f. F is replaced with an estimate ˆF. Then, we
simulate Y ∗r
from ˆF for r = 1, . . . , R and we get
ˆθ∗
1 , . . . , ˆθ∗
R , where ˆθ∗
r = ˆθ(Y ∗r
)
which is a bootstrap sample of the estimator.
3 The bootstrap sample can be used to make inference on θ. This is usually
the main goal, but now we are mainly interested in fastly simulating it.

Bootstrap example
Inference on the correlation coefficient
An example, using the “cars” dataset
I have considered the dataset cars available in R. Suppose that Yi = (Y1i , Y2i )
are i.i.d. I would like to make inference on the correlation coefficient
ˆρ =
ˆCov(Y1, Y2)
ˆVar(Y1) ˆVar(Y2)
,
using the so called non parametric boostrap, that is, ˆF is replaced by the empirical
distribution function. This operation can be vectorized and has been done both
in Julia and in R, for comparison.
In practice...
We need to “resample” the original data, with replacement. Then, we evaluate
the correlation coefficient for each bootstrap sample.

Bootstrap example
Implementation
Listing 1: Bootstrap; R implementation
rho_boot <- function(R,dataset){
n <- NROW(dataset)
# Sampling the indexes
index <- matrix(sample(1:n,R*n,replace=TRUE),R,n)
# Bootstrap correlation estimate
apply(index,1,function(x) cor(dataset[x,1],dataset[x,2]) )
}
Listing 2: Bootstrap; Julia implementation
function rho_boot(R,dataset)
n = size(data)[1]
# Sampling the indexes
index = rand(1:n,n,R)
out = Array(Float64,R)
for i in 1:R
# Bootstrap correlation estimate
out[i] = cor(dataset[index[:,i],:])[1,2]
end
out
end

Bootstrap example
Performance - Milliseconds in log-scale
Naive_R Julia Boot_library Boot_library_cor2
501002005001000
Expression
log(time)[t]

Bootstrap example
Global Performance
And the winner is...
1 The R code is vectorized and therefore we expect a good performance.
2 Despite this, for this particular problem Julia ≈ 10 times faster than R. This
is true even if we use boot package.
Speeding up the R code
The bottleneck of the calculations is the R cor function. It is designed for the
evaluation of an entire correlation matrix. It also check for missing values before
performing the calculation. Therefore, we can easily improve the code deﬁning the
function cor2. Now, Julia ≈ 5 times faster than R.
cor2 <- function(x,y) {
xbar <- x-mean(x)
ybar <- y-mean(y)
sum(xbar*ybar)/sqrt(sum(xbar^2)*sum(ybar^2))
}

Bootstrap example
Bootstrap ﬁnal result
0.0
2.5
5.0
7.5
0.6 0.7 0.8 0.9
Correlation coefficient
Estimatedbootstrapdensity

Bootstrap example
Principal component analysis
Notation about PCA
Let yi = (yi1, . . . , yip) for i = 1, . . . , n, be i.i.d realizations from a random vector
having covariance matrix Σ. Let ˆΣ be the sample variance and R the related
correlation matrix. The spectral decomposition of R is denoted as follow
R = GΛGT
, Λ = Diag(λ1, . . . , λp), λ1 > λ2 > · · · > λp
The quantity interest
The quantity of interest is the the cumulative percentage of the "total variance"
explained by the ﬁrst k principal components:
ˆτk =
k
i=1 λi
p
i=1 λi
=
1
p
k
i=1
λi

Bootstrap example
The iris dataset
A famous example
The iris dataset was considered just for illustrative purpose. We would like to
assess the variability of the quantity
ˆτ1 =
λ1
p
,
using a non parametric bootstrap approach. The quantity ˆτ1 is the relative
importance of the first principal component. Without the bootstrap, it would be
difficult to assess the variability of this estimate. Also, notice that we are not
assuming a specific parametric family of distribution for Y .

Bootstrap example
Implementation
function tau_est(data)
R = cor(data)
lambda = eigvals(R)
tau = (lambda/sum(lambda))[end] # Also lambda[end]/p is fine
tau
end
function pca_boot(R,data)
n = size(data)[1]
index = rand(1:n,n,R)
out = Array(Float64,R)
for i in 1:R
out[i] = tau_est(data[index[:,i],:])
end
out
end

Bootstrap example
Bootstrap ﬁnal result
0
5
10
15
20
25
0.70 0.74 0.78
Explained variance
Estimatedbootstrapdensity

Bayesian statistics with Julia
A Bayesian logistic regression
The “shuttle” dataset
I have considered the famous “shuttle” dataset, having sample size n = 23. We
suppose the following Bayesian logistic regression:
Yi ∼ Bin(6, θi ), θi =
1
1 + e−ηi
, ηi = β0 + β1xi ,
where xi are a known constants and i = 1, . . . , n. Moreover let βj ∼ N(0, σ2
µ),
j = 1, 2, be the prior distributions and σ2
µ an hyperparameter.
MCMC posterior computation
I have approximated the posterior distribution of β | Y using a Metropolis
algorithm. I have used a multivariate Gaussian random walk as proposal
distribution, having covariance matrix equal to the observed information.

First step: the log-posterior
Listing 3: Julia implementation
using Distributions
# Log-likelihood
function loglik(data::Matrix, beta::Vector)
eta = beta[1] + beta[2]*data[:,3]
theta = 1./(1 + exp(- eta))
sum(data[:,2].*eta) + sum(data[:,1].*log(1-theta))
end
# Log-posterior up to an additive constant
function lpost(data::Matrix, beta::Vector, sigma_mu::Float64)
norm = Normal(0,sigma_mu)
loglik(data,beta) + logpdf(norm,beta[1]) + logpdf(norm,beta[2])
end

Metropolis Algorithm
using Optim # For numerical optimization
using ForwardDiff # For numerical derivative
# Maximum likelihood estimate
beta_hat = optimize(x -> -loglik(data,x),[0.0, 0.0], method=:l_bfgs).minimum
# Observed information matrix
Sigma = inv(ForwardDiff.hessian(x -> -loglik(data,x), beta_hat))
function Metropolis(R::Int64, Sigma::Matrix, sigma_mu::Float64,start::Vector)
out = zeros(R,2)
beta = start #Initialization
for r in 1:R
beta_star = rand(MvNormal(beta,Sigma)) # Proposal distribution
alpha = exp(lpost(data,beta_star,sigma_mu) - lpost(data,beta,sigma_mu))
if rand(1)[1] < alpha # ‘rand’ is a pseudo random from a Uniform
beta = copy(beta_star) # Copy if accepted
end
out[r,:] = beta
end
out
end

Performance - Milliseconds in log-scale
R Julia OpenBUGS STAN
10020050010002000
Expression
log(time)[t]

Global performance
Julia now really shines!
1 For this particular problem Julia ≈ 20 times faster than R. In fact, the for
loops is used extensively and there is no way to vectorize this operation.
2 Also, Julia ≈ 13 times faster than OpenBUGS. However, OpenBUGS does
not necessarily use our Gaussian random walk, but tries to select the “best”
way to do MCMC according to its own criteria. Therefore, a fair comparison
should take into account, at the very least, the autocorrelation of the
sampled chain.
3 Finally, Julia ≈ 10 times faster than STAN but, as for OpenBUGS, we should
be careful to make a direct comparison.

Bayesian logistic ﬁnal result
−0.3
−0.2
−0.1
0.0
−5 0 5 10 15
β0
β1

References
Some references about Julia
1 http://julialang.org/
2 http://docs.julialang.org/en/release-0.4/
3 http://distributionsjl.readthedocs.org/en/latest/

Have you met Julia?

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Have you met Julia?

Similar to Have you met Julia? (20)

Recently uploaded

Recently uploaded (20)

Have you met Julia?