Your SlideShare is downloading.
×

- 1. Have you met Julia? Tommaso Rigon May 2, 2016 Tommaso Rigon Have you met Julia? May 2, 2016 1 / 25
- 2. Introduction Which software are we more likely to use? A non comprehensive list In statistics many programming languages can be used. One could use: 1 C / Fortran. 1 Low-level programming languages. 2 General purpose languages, but very eﬃcient for numeric computing. 2 Python 1 Open source and general purpose language. 2 Widespread in industry and among computer scientists. 3 Matlab 1 Closed source (!) 2 Optimized for numerical computing, fast and clear linear algebra. 4 R 1 Open source: a lot of additional statistical packages are available. 2 R is developed by statisticians for statisticians. 3 Widely spread among academics. Tommaso Rigon Have you met Julia? May 2, 2016 2 / 25
- 3. Introduction A typical workﬂow in R Suppose we are going to analyze a real dataset: 1 Data managemenent. We need to read the data in R, from a textual ﬁle or from a database. We also need to arrange them in a convenient form (the dplyr package is awesome!). 2 Data visualization. Visualize the data (See package ggplot2). 3 Statistical Modeling. First analyses are done using available packages. 4 Developing. We need to implement our new methodology. 5 Reporting. We need to communicate eﬀectively our results, usually with tables and graphs. (See Markdown and Knitr projects). But... 1 The script is quickly developed in R, but it is often (very) slow. Sometimes this precludes the use of the whole dataset. 2 The slow parts need to be written in C or Fortran and then interfaced to R (package Rcpp helps!) Tommaso Rigon Have you met Julia? May 2, 2016 3 / 25
- 4. Introduction R is great but... A vectorized language 1 The language encourages operating on the whole object (i.e. vectorized code). However, some tasks (e.g. MCMC) can not be easily vectorized. 2 Unvectorized R code (for and while loops) is slow. A nested for cycle compared to the same vectorized operation system.time(for (i in 1:10^4) for (j in 1:10^3) runif(1)) # user system elapsed # 17.689 0.000 17.528 system.time(runif(10^7)) # user system elapsed # 0.410 0.000 0.424 Tommaso Rigon Have you met Julia? May 2, 2016 4 / 25
- 5. Introduction What is Julia? Julia according to its developer Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. Julia in a nutshell Julia was released recently, in 2012, by Jeﬀ Bezanson, Stefan Karpinski, Viral Shah, Alan Edelman. The last stable version is the 0.4.5. 1 Open-source, with MIT liberal license. 2 High-level and familiar. It can work on the level of vectors, matrices, arrays. The syntax is similar to Matlab and R and easy to read without a huge eﬀort. 3 Technical computing. It is speciﬁcally optimized for scientiﬁc computing, not necessarily statistics. Tommaso Rigon Have you met Julia? May 2, 2016 5 / 25
- 6. Introduction Why Julia? Julia in a nutshell - Technical details 1 Julia is REPL (read–eval–print loop). Exactly as in R, it is possible to interact with the software, facilitating debugging, testing and developing. Conversely, languages like C are usually ECRL (edit-compile-run loop). 2 Based on a sophisticated compiler which is JIT (Just in time) and LLVM based. 3 Julia is fast. Its compiler is designed to approach the speed of C. 4 No need to vectorize code for performance; devectorized code is fast. 5 Eﬃcient support for Unicode, including but not limited to UTF-8. It means, for instance, that µ = 10; σ = 5 are legitimate assignments in Julia. 6 Designed for parallelism and distributed computation Tommaso Rigon Have you met Julia? May 2, 2016 6 / 25
- 7. Introduction Packages available Julia for statistics These are some useful packages for statistical computing 1 Distributions. Probability distributions and associated functions (similar but not equal to the d-p-q-r system in R). 2 DataFrames. For handling datasets, having eventually missing values. 3 GLM. Generalized linear models, including linear model. 4 StatBase. Basic descriptive function: sample mean, median, sample variance... 5 ...and many others! Julia and R integration 1 rjulia. An R package that calls Julia functions and import / export objects between the two environments. Currently under development, available only on GitHub. 2 RCall. As the name itself suggests, it calls R from Julia. Tommaso Rigon Have you met Julia? May 2, 2016 7 / 25
- 8. Introduction Speeding up computations Do we really need such fast and powerful tools? In many cases, we do not. Suppose our implementation is bad written and ineﬃcient, but it takes about 1 second to be executed. Does it worth to improve the code? Where eﬃcient computation are really necessary? Just to mention some areas among others: 1 In almost any procedure applied to huge datasets (even linear models!) 2 In any procedures which involves Cross-Validation (Both Lasso and CART use often CV for model selection). 3 In “boostrap like” procedures (bootstrap, bagged trees,...). 4 In Bayesian statistics, in approximating the posterior distribution through simulations (e.g. MCMC, Importance sampling, ABC,...). 5 A combination of the previous. Tommaso Rigon Have you met Julia? May 2, 2016 8 / 25
- 9. Bootstrap example Bootstrap example What is the bootstrap? A (very) brief explanation 1 It is inferential technique, which (usually!) makes use of simulation. For instance, in a frequentist framework, it can be used for assessing conﬁdence intervals. 2 Let ˆθ(Y ) be an estimator of θ and Yi ∼ F i.i.d. random vectors, for i = 1, . . . , n. The “true” c.d.f. F is replaced with an estimate ˆF. Then, we simulate Y ∗r from ˆF for r = 1, . . . , R and we get ˆθ∗ 1 , . . . , ˆθ∗ R , where ˆθ∗ r = ˆθ(Y ∗r ) which is a bootstrap sample of the estimator. 3 The bootstrap sample can be used to make inference on θ. This is usually the main goal, but now we are mainly interested in fastly simulating it. Tommaso Rigon Have you met Julia? May 2, 2016 9 / 25
- 10. Bootstrap example Inference on the correlation coeﬃcient An example, using the “cars” dataset I have considered the dataset cars available in R. Suppose that Yi = (Y1i , Y2i ) are i.i.d. I would like to make inference on the correlation coeﬃcient ˆρ = ˆCov(Y1, Y2) ˆVar(Y1) ˆVar(Y2) , using the so called non parametric boostrap, that is, ˆF is replaced by the empirical distribution function. This operation can be vectorized and has been done both in Julia and in R, for comparison. In practice... We need to “resample” the original data, with replacement. Then, we evaluate the correlation coeﬃcient for each bootstrap sample. Tommaso Rigon Have you met Julia? May 2, 2016 10 / 25
- 11. Bootstrap example Implementation Listing 1: Bootstrap; R implementation rho_boot <- function(R,dataset){ n <- NROW(dataset) # Sampling the indexes index <- matrix(sample(1:n,R*n,replace=TRUE),R,n) # Bootstrap correlation estimate apply(index,1,function(x) cor(dataset[x,1],dataset[x,2]) ) } Listing 2: Bootstrap; Julia implementation function rho_boot(R,dataset) n = size(data)[1] # Sampling the indexes index = rand(1:n,n,R) out = Array(Float64,R) for i in 1:R # Bootstrap correlation estimate out[i] = cor(dataset[index[:,i],:])[1,2] end out end Tommaso Rigon Have you met Julia? May 2, 2016 11 / 25
- 12. Bootstrap example Performance - Milliseconds in log-scale Naive_R Julia Boot_library Boot_library_cor2 501002005001000 Expression log(time)[t] Tommaso Rigon Have you met Julia? May 2, 2016 12 / 25
- 13. Bootstrap example Global Performance And the winner is... 1 The R code is vectorized and therefore we expect a good performance. 2 Despite this, for this particular problem Julia ≈ 10 times faster than R. This is true even if we use boot package. Speeding up the R code The bottleneck of the calculations is the R cor function. It is designed for the evaluation of an entire correlation matrix. It also check for missing values before performing the calculation. Therefore, we can easily improve the code deﬁning the function cor2. Now, Julia ≈ 5 times faster than R. cor2 <- function(x,y) { xbar <- x-mean(x) ybar <- y-mean(y) sum(xbar*ybar)/sqrt(sum(xbar^2)*sum(ybar^2)) } Tommaso Rigon Have you met Julia? May 2, 2016 13 / 25
- 14. Bootstrap example Bootstrap ﬁnal result 0.0 2.5 5.0 7.5 0.6 0.7 0.8 0.9 Correlation coefficient Estimatedbootstrapdensity Tommaso Rigon Have you met Julia? May 2, 2016 14 / 25
- 15. Bootstrap example Principal component analysis Notation about PCA Let yi = (yi1, . . . , yip) for i = 1, . . . , n, be i.i.d realizations from a random vector having covariance matrix Σ. Let ˆΣ be the sample variance and R the related correlation matrix. The spectral decomposition of R is denoted as follow R = GΛGT , Λ = Diag(λ1, . . . , λp), λ1 > λ2 > · · · > λp The quantity interest The quantity of interest is the the cumulative percentage of the "total variance" explained by the ﬁrst k principal components: ˆτk = k i=1 λi p i=1 λi = 1 p k i=1 λi Tommaso Rigon Have you met Julia? May 2, 2016 15 / 25
- 16. Bootstrap example The iris dataset A famous example The iris dataset was considered just for illustrative purpose. We would like to assess the variability of the quantity ˆτ1 = λ1 p , using a non parametric bootstrap approach. The quantity ˆτ1 is the relative importance of the ﬁrst principal component. Without the bootstrap, it would be diﬃcult to assess the variability of this estimate. Also, notice that we are not assuming a speciﬁc parametric family of distribution for Y . Tommaso Rigon Have you met Julia? May 2, 2016 16 / 25
- 17. Bootstrap example Implementation function tau_est(data) R = cor(data) lambda = eigvals(R) tau = (lambda/sum(lambda))[end] # Also lambda[end]/p is fine tau end function pca_boot(R,data) n = size(data)[1] index = rand(1:n,n,R) out = Array(Float64,R) for i in 1:R out[i] = tau_est(data[index[:,i],:]) end out end Tommaso Rigon Have you met Julia? May 2, 2016 17 / 25
- 18. Bootstrap example Bootstrap ﬁnal result 0 5 10 15 20 25 0.70 0.74 0.78 Explained variance Estimatedbootstrapdensity Tommaso Rigon Have you met Julia? May 2, 2016 18 / 25
- 19. Bayesian statistics with Julia A Bayesian logistic regression The “shuttle” dataset I have considered the famous “shuttle” dataset, having sample size n = 23. We suppose the following Bayesian logistic regression: Yi ∼ Bin(6, θi ), θi = 1 1 + e−ηi , ηi = β0 + β1xi , where xi are a known constants and i = 1, . . . , n. Moreover let βj ∼ N(0, σ2 µ), j = 1, 2, be the prior distributions and σ2 µ an hyperparameter. MCMC posterior computation I have approximated the posterior distribution of β | Y using a Metropolis algorithm. I have used a multivariate Gaussian random walk as proposal distribution, having covariance matrix equal to the observed information. Tommaso Rigon Have you met Julia? May 2, 2016 19 / 25
- 20. Bayesian statistics with Julia First step: the log-posterior Listing 3: Julia implementation using Distributions # Log-likelihood function loglik(data::Matrix, beta::Vector) eta = beta[1] + beta[2]*data[:,3] theta = 1./(1 + exp(- eta)) sum(data[:,2].*eta) + sum(data[:,1].*log(1-theta)) end # Log-posterior up to an additive constant function lpost(data::Matrix, beta::Vector, sigma_mu::Float64) norm = Normal(0,sigma_mu) loglik(data,beta) + logpdf(norm,beta[1]) + logpdf(norm,beta[2]) end Tommaso Rigon Have you met Julia? May 2, 2016 20 / 25
- 21. Bayesian statistics with Julia Metropolis Algorithm Listing 4: Julia implementation using Optim # For numerical optimization using ForwardDiff # For numerical derivative # Maximum likelihood estimate beta_hat = optimize(x -> -loglik(data,x),[0.0, 0.0], method=:l_bfgs).minimum # Observed information matrix Sigma = inv(ForwardDiff.hessian(x -> -loglik(data,x), beta_hat)) Listing 5: Julia implementation function Metropolis(R::Int64, Sigma::Matrix, sigma_mu::Float64,start::Vector) out = zeros(R,2) beta = start #Initialization for r in 1:R beta_star = rand(MvNormal(beta,Sigma)) # Proposal distribution alpha = exp(lpost(data,beta_star,sigma_mu) - lpost(data,beta,sigma_mu)) if rand(1)[1] < alpha # ‘rand’ is a pseudo random from a Uniform beta = copy(beta_star) # Copy if accepted end out[r,:] = beta end out end Tommaso Rigon Have you met Julia? May 2, 2016 21 / 25
- 22. Bayesian statistics with Julia Performance - Milliseconds in log-scale R Julia OpenBUGS STAN 10020050010002000 Expression log(time)[t] Tommaso Rigon Have you met Julia? May 2, 2016 22 / 25
- 23. Bayesian statistics with Julia Global performance Julia now really shines! 1 For this particular problem Julia ≈ 20 times faster than R. In fact, the for loops is used extensively and there is no way to vectorize this operation. 2 Also, Julia ≈ 13 times faster than OpenBUGS. However, OpenBUGS does not necessarily use our Gaussian random walk, but tries to select the “best” way to do MCMC according to its own criteria. Therefore, a fair comparison should take into account, at the very least, the autocorrelation of the sampled chain. 3 Finally, Julia ≈ 10 times faster than STAN but, as for OpenBUGS, we should be careful to make a direct comparison. Tommaso Rigon Have you met Julia? May 2, 2016 23 / 25
- 24. Bayesian statistics with Julia Bayesian logistic ﬁnal result −0.3 −0.2 −0.1 0.0 −5 0 5 10 15 β0 β1 Tommaso Rigon Have you met Julia? May 2, 2016 24 / 25
- 25. References Some references about Julia 1 http://julialang.org/ 2 http://docs.julialang.org/en/release-0.4/ 3 http://distributionsjl.readthedocs.org/en/latest/ Tommaso Rigon Have you met Julia? May 2, 2016 25 / 25