Renjin @ TU Dortmund
Upcoming SlideShare
Loading in...5
×
 

Renjin @ TU Dortmund

on

  • 2,598 views

Presentation of the Renjin project to the Computer Science Department at TU Dortmund

Presentation of the Renjin project to the Computer Science Department at TU Dortmund

Statistics

Views

Total Views
2,598
Views on SlideShare
2,595
Embed Views
3

Actions

Likes
1
Downloads
34
Comments
0

1 Embed 3

http://a0.twimg.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Most benchmarks from http://r.research.att.com/benchmarks/See SVN repo for source of all benchmarks @ r476Run with 3 warmup runs (not timed) and 5 timed runsUsing ATLAS blas libs on Ubuntu

Renjin @ TU Dortmund Renjin @ TU Dortmund Presentation Transcript

  • RenjinAlexander BertrambedatadrivenTU DortmundDepartment of Computer Science20.01.2012
  • Agenda• Brief intro to R• Motivation for Renjin• Renjin’s Design• Performance• Optimization and the Compiler
  • Brief Intro to R• Lingua franca for statistical computing• R is used by ~ 250,000 analysts worldwide (some say up to 2 million)• Over 3,500 contributed packages in wide variety of specializations• Most new statistical techniques are published with R source code
  • Motivations for Developing Renjin• Existing R interpreter is an excellent tool for ad hoc analysis – however:• Difficult to implement certain use cases specific to our consulting business: ▫ Incorporate our R scripts into larger applications for clients (e.g. BI tools) ▫ Make our R scripts available via web interface ▫ Run user provided scripts in sandbox ▫ Develop SaaS tools based on R
  • Existing Interpreter• Developed in C• Extensive use of globals; one interpreter per process• No layer of abstraction between data and algorithms; all code operates directly on pointers
  • Opportunties in the JVM• Growth of Platforms-as-a-Service: Google AppEngine, Heroku, Amazon Beanstalk• Big Data frameworks: Hadoop, Mahout• State of the art VM: GC, JIT, etc
  • Renjin Design Principles• Performance is best achieved through well- designed abstraction, not hand-coded assembly• Focus on core statistical functionality and delegate to state-of-the art implementations in other fields, e.g.: ▫ Garbage collection ▫ Character encoding ▫ Database access ▫ Web servers
  • R developers have alreadyrevolutionized statisticalcomputing: is it fair to expect them to developbest-of-bread garbage collectors, VM, webservers, and database systems ??
  • Renjin Implementation• Parser is ported directly (via Bison-Java)• Primitive functions (700+) mostly rewritten into natural Java/OO style• Extensive unit test coverage enables experimentation
  • Renjin Compared• Others are seeking to overcome similar shortcomings in R with other approaches: ▫ RevoR ▫ Bigmemory• Renjin, in contrast, attempts to fix problems in core ▫ Pros: bigger opportunities in the long-term ▫ Cons: riskier, incompatibilities with packages written in C
  • Examples of Extension Points R File Functions Math & Data Functions Apache VFS Renjin Vector API Hadoop Amazon Mem- Rolling buffer OS DFS S3 backed backed JDBC
  • Primitive implementations• Primitive implementation is the bulk of the work• Uses declarative annotations in combination with code generation; several advantages: ▫ Boilerplate auto generated ▫ Optimizations can be globally applied ▫ Provides information to the compiler about types
  • Primitive Annotations@Primitive("==")@Recyclepublic static boolean equalTo(double x, double y) { return x == y;}@Primitive("==")@Recycle(false)public static boolean equalTo(Symbol x, Symbol y) { return x == y;}@Primitive("==")@Recyclepublic static boolean equalTo(String x, String y) { return x.equals(y);}
  • Aside: Data structures• R has several OO systems: ▫ S3 function dispatch ▫ S4 objects ▫ Rproto• However, most R packages simply reuse base vector & list types to organize data ▫ Pro: high degree of interoperability ▫ Con: can be difficult to organize large systems
  • Aside: R Data StructuresAtomic Lists OtherVectors: • null • symbol• null • list • environment*• logical • expression variables storage for a• Integer • pairlist lexical scope, can be• double • language - function call reused as map• complex • dotexp - list of promises • promise• character• raw All R values can have attributes, some are “special”Attributesclass – determines function dispatchnames – gives names to list elementsdim – lends a matrix/array shape to vectors/lists * mutable
  • Aside: Example of R data structuredata.framex <- 1:100y <- 100:1frame <- list(x, y)attr(frame, "class") <- "data.frame"attr(frame, "names") <- c("x", "y")
  • Aside: Data structures• Renjin defines these data structures as interfaces• Currently, the only implementations are array- backed, but this design opens the possibility of alternate implementations: ▫ Backed by a database cursor ▫ Rolling buffer over text file, etc
  • Performance Runtime (% difference vs R2.12) Mean.online -11% Sorting of 7,000,000 random values 24% 2800x2800 cross-product matrix (b = a * a) 25%Creation, transp., deformation of a 2500x2500 matrix 47%2400x2400 normal distributed random matrix ^1000 49% Creation of a 3000x3000 Hilbert matrix (matrix calc) 73%3,500,000 Fibonacci numbers calculation (vector calc) 146% Escoufiers method on a 45x45 matrix (mixed) 224% -50% 0% 50% 100% 150% 200% 250%
  • Performance WinsENV BASE – Bloom filterENV MethodsENV UtilsENV grDevicesENV Stats Call to length(x) atENV survey this scope requires checking all parentENV GLOBAL scopes for a variable length with a function value
  • Performance WinsENV BASE – Bloom filter To avoid expensive lookups to the HashMap at each frame, we:ENV Methods • Assign each symbol a single bit in 32-bit integer • Maintain an OR-d mask at each level in the treeENV Utils • We only check the HashMap if the symbol’s bit is setENV grDevicesENV Stats Call to length(x) atENV survey this scope requires checking all parentENV GLOBAL scopes for a variable length with a function value
  • Potential Sources of PerformanceGains (1/2)• Primitives (How fast can we compute the svd of a huge matrix?) ▫ Parallelization ▫ Algorithmic improvements  Cache-awareness ▫ Byte code optimizations (c.f. Soot) ▫ JVM-level optimization (e.g. SSE instruction sets)
  • Potential Sources of PerformanceGains (2/2)• Performance of R language code ▫ Translation to JVM byte code (and benefit from JVM’s optimizations) ▫ Avoiding vector-boxing of scalars ▫ Copy-on-write optimization ▫ Parallelization
  • Building the Compiler• Direct translation of R code to byte code yields only marginal performance gains – will require optimization to deliver signficiant speedups
  • Aside: The R language from acompiler-writer’s perspective• Very functional• Lazy and impure• Access to calling frames• Multimethod dispatch• Computing on the language
  • R Language Fun – Very functional• Everything is actually a functiondouble <- function(x) x*2g <- function(f) f(3)g(double)`if` <- function(condition, a, b) 42if(FALSE) 1 else 2; # evaluates to 42`(` <- function(x) stop(foo!)2 * (x + 1) # throws foo
  • R Language Fun – Lazy and Impuref <- function(a, b) b + ax <- 1f(x<-2, x) # evaluates to 3g <- function(x) deparse(substitute(x))g(sin(x)) # evaluates to “sin(x)”
  • R Language Fun – Access to callingframesf <- function() assign("x", 42, envir=parent.frame())g <- function() { x <- 1 f() x}g() # evaluates to 42
  • R Language Fun – Multimethoddispatchx <- list(2,14)class(x) <- "version"y <- list(2,12)class(y) <- "version"`<.version` <- function(a,b) (a[[1]] <b[[1]]) || (a[[1]] == b[[1]] && a[[2]] <b[[2]])x < y # falsey < x # true
  • R Language Fun – Computing on thelanguagef <- function(a,b) a + bbody(f)[[1]] <- `-`f(0,2) # -2
  • Design choices• How much of the language to change? ▫ Can’t reasonably allow developers to redefine if, (), {}, etc ▫ What about missing(x), quote(x), assign() ?• When to compile? ▫ AOT – more time to optimize ▫ JIT – much more information• Do ask developers to provide cues/ guidance? ▫ Maybe special blocks where only a subset of language features are supported? ▫ Type annotations? New syntax for typing arguments?
  • Compiling in-depth• Typical under-performing fragment:mean.online <- function(x) { xbar <- x[1] for(n in 2:length(x)) { xbar <- ((n – 1) * xbar + x[n]) / n } xbar}
  • Translation to IRmean.online <- function(x) { 0: xbar ← primitive<[>(x, 1.0) xbar <- x[1] 1: τ₃ ← Δ length(x)d 2: τ₄ ← Δ seq(2.0, τ₃) 3: Λ0 ← 0 for(n in seq(2,length(x)) { 4: τ₂ ← primitive<length>(τ₄) xbar <- ((n – 1) * L0 5: if Λ0 >= τ₂ goto L3 else L1 xbar + x[n]) / n L1 6: n ← τ₄[Λ0] } 7: τ₅ ← primitive<->(n, 1.0) xbar 8: τ₆ ← primitive<*>(τ₅, xbar)} 9: τ₇ ← primitive<[>(x, n) 10: τ₈ ← primitive<+>(τ₆, τ₇) 11: xbar ← primitive</>(τ₈, n) L2 12: Λ0 ← increment counter Λ0 13: goto L0 L3 14: return xbar
  • Just-in-time Compiling• Implementing the full language in the IR-based interpreter is difficult;• May stick with the AST-based interpreter, wait till we hit big loop• At that point, compile to JVM byte code• Let’s look at the example again…