Renjin @ Purdue
Upcoming SlideShare
Loading in...5

Renjin @ Purdue






Total Views
Views on SlideShare
Embed Views



7 Embeds 17 6 5
http://localhost 2 1 1 1 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Most benchmarks from SVN repo for source of all benchmarks @ r476Run with 3 warmup runs (not timed) and 5 timed runsUsing ATLAS blas libs on Ubuntu

Renjin @ Purdue Renjin @ Purdue Presentation Transcript

  • RenjinAlexander BertrambedatadrivenPurdue24.6.2012
  • Agenda• About bedatadriven• Motivation for Renjin• Design• Progress to-date• Future Directions
  • About bedatadriven• Consulting and software company based in the Netherlands with clients primarily in emerging markets and conflict zones• We use R extensively in: ▫ Predictive analytics for telecoms in Africa and Asia ▫ Web-based tools for market researchers ▫ Highly-clustered public health, opinion surveys in Afghanistan• Biggest datasets on order of 5-10 billion rows, small number of columns• Small datasets are also key when collection is expensive ($10-$20 / obs)
  • Motivations for Developing Renjin• Deploy R programs to AppEngine & other PaaS• Seamless integration with our Java codebases• Thread-based parallelization of machine- learning tasks• Simplify deployment in complex (read: chaotic) production environments of clients• Fascinating project!
  • Renjin Design: Goals• Run existing R packages without modification• No (required) native dependencies• Multithreaded• Security model that supports running unsafe code in webapp environments
  • Renjin Design: Approach• Parser is ported directly (via Bison-Java)• Simple AST-based interpreter closely modeled after C-R.• R-language portion of base library reused without modification• Leverage existing JVM libraries (commons- math, netlib, jtransforms) when possible• C/Fortran-language portions ported or rewritten in Java
  • Renjin Design: Threading JVM Process RenjinEngine 1 RenjinEngine 2 Base Base Stats Stats Global Dataframe Global
  • Renjin Design: Embedding JVM Process Hadoop FS AppEngine RenjinEngine 1 Virtual File BlobStore Base System Local File System Stats Security Global Manager
  • Renjin Design: DataAll primitives written against (threadsafe)interfaces rather than arrays SequenceVector (1:1e10)interface Vector ArrayVector JDBCVector BufferedSequential Vector TempFileBacked Vector DecoratingVector (inner + 1)
  • Progress to date• 68% of builtins/internals implemented• Runs complex packages out of the box (base, survey, aspect)• 700+ micro tests covering edges cases related to subscripts, promises, missing arguments, etc• Missing pieces: ▫ S4 object system ▫ Base graphics incomplete ▫ Stats package port incomplete ▫ No good strategy yet for native dependencies
  • Performance: Benchmark-25 Runtime (% difference vs R2.14.2) Sorting of 7,000,000 random values 1.17 2800x2800 cross-product matrix (b = a * a) 1.01Creation, transp., deformation of a 2500x2500 matrix 1.352400x2400 normal distributed random matrix ^1000 1.45 Creation of a 3000x3000 Hilbert matrix (matrix calc) 1.333,500,000 Fibonacci numbers calculation (vector calc) 2.38 Escoufiers method on a 45x45 matrix (mixed) 3.26 1.00 1.50 2.00 2.50 3.00 3.50
  • Future Directions: Short-term• Testing: Complete test harness to systematically evaluate completeness against CRAN packages• Dependency management: Augument/replace package management system with Aether to simplify mgmt of mixed R, Java, Scala artifacts• Command-line and eclipse plugins for ad hoc analysis
  • Future Directions: Longer-term• JIT compilation to JVM bytecode – for, lapply, while, etc• Alternative backing stores for Vector interface• GCC-bridge to translate existing C/Fortran sources to JVM byte code (Gimple->Shimple- >JVM bytecode)
  • Compilation: Rely on an <- function(x) { 0: xbar ← primitive<[>(x, 1.0) xbar <- x[1] 1: τ₃ ← Δ length(x)d 2: τ₄ ← Δ seq(2.0, τ₃) 3: Λ0 ← 0 for(n in seq(2,length(x)) { 4: τ₂ ← primitive<length>(τ₄) xbar <- ((n – 1) * L0 5: if Λ0 >= τ₂ goto L3 else L1 xbar + x[n]) / n L1 6: n ← τ₄[Λ0] } 7: τ₅ ← primitive<->(n, 1.0) xbar 8: τ₆ ← primitive<*>(τ₅, xbar)} 9: τ₇ ← primitive<[>(x, n) 10: τ₈ ← primitive<+>(τ₆, τ₇) 11: xbar ← primitive</>(τ₈, n) L2 12: Λ0 ← increment counter Λ0 13: goto L0 L3 14: return xbar
  • Contact & Thanks•• alex@bedatadriven.comThanks to Renjin contributors:• M.Hakan Satman of Istanbul University, Department of Econometrics• Jamie Kingsbery , Yodle