Renjin @ Purdue

  • 1,079 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,079
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
19
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Most benchmarks from http://r.research.att.com/benchmarks/See SVN repo for source of all benchmarks @ r476Run with 3 warmup runs (not timed) and 5 timed runsUsing ATLAS blas libs on Ubuntu

Transcript

  • 1. RenjinAlexander BertrambedatadrivenPurdue24.6.2012
  • 2. Agenda• About bedatadriven• Motivation for Renjin• Design• Progress to-date• Future Directions
  • 3. About bedatadriven• Consulting and software company based in the Netherlands with clients primarily in emerging markets and conflict zones• We use R extensively in: ▫ Predictive analytics for telecoms in Africa and Asia ▫ Web-based tools for market researchers ▫ Highly-clustered public health, opinion surveys in Afghanistan• Biggest datasets on order of 5-10 billion rows, small number of columns• Small datasets are also key when collection is expensive ($10-$20 / obs)
  • 4. Motivations for Developing Renjin• Deploy R programs to AppEngine & other PaaS• Seamless integration with our Java codebases• Thread-based parallelization of machine- learning tasks• Simplify deployment in complex (read: chaotic) production environments of clients• Fascinating project!
  • 5. Renjin Design: Goals• Run existing R packages without modification• No (required) native dependencies• Multithreaded• Security model that supports running unsafe code in webapp environments
  • 6. Renjin Design: Approach• Parser is ported directly (via Bison-Java)• Simple AST-based interpreter closely modeled after C-R.• R-language portion of base library reused without modification• Leverage existing JVM libraries (commons- math, netlib, jtransforms) when possible• C/Fortran-language portions ported or rewritten in Java
  • 7. Renjin Design: Threading JVM Process RenjinEngine 1 RenjinEngine 2 Base Base Stats Stats Global Dataframe Global
  • 8. Renjin Design: Embedding JVM Process Hadoop FS AppEngine RenjinEngine 1 Virtual File BlobStore Base System Local File System Stats Security Global Manager
  • 9. Renjin Design: DataAll primitives written against (threadsafe)interfaces rather than arrays SequenceVector (1:1e10)interface Vector ArrayVector JDBCVector BufferedSequential Vector TempFileBacked Vector DecoratingVector (inner + 1)
  • 10. Progress to date• 68% of builtins/internals implemented• Runs complex packages out of the box (base, survey, aspect)• 700+ micro tests covering edges cases related to subscripts, promises, missing arguments, etc• Missing pieces: ▫ S4 object system ▫ Base graphics incomplete ▫ Stats package port incomplete ▫ No good strategy yet for native dependencies
  • 11. Performance: Benchmark-25 Runtime (% difference vs R2.14.2) Sorting of 7,000,000 random values 1.17 2800x2800 cross-product matrix (b = a * a) 1.01Creation, transp., deformation of a 2500x2500 matrix 1.352400x2400 normal distributed random matrix ^1000 1.45 Creation of a 3000x3000 Hilbert matrix (matrix calc) 1.333,500,000 Fibonacci numbers calculation (vector calc) 2.38 Escoufiers method on a 45x45 matrix (mixed) 3.26 1.00 1.50 2.00 2.50 3.00 3.50
  • 12. Future Directions: Short-term• Testing: Complete test harness to systematically evaluate completeness against CRAN packages• Dependency management: Augument/replace package management system with Aether to simplify mgmt of mixed R, Java, Scala artifacts• Command-line and eclipse plugins for ad hoc analysis
  • 13. Future Directions: Longer-term• JIT compilation to JVM bytecode – for, lapply, while, etc• Alternative backing stores for Vector interface• GCC-bridge to translate existing C/Fortran sources to JVM byte code (Gimple->Shimple- >JVM bytecode)
  • 14. Compilation: Rely on an IRmean.online <- function(x) { 0: xbar ← primitive<[>(x, 1.0) xbar <- x[1] 1: τ₃ ← Δ length(x)d 2: τ₄ ← Δ seq(2.0, τ₃) 3: Λ0 ← 0 for(n in seq(2,length(x)) { 4: τ₂ ← primitive<length>(τ₄) xbar <- ((n – 1) * L0 5: if Λ0 >= τ₂ goto L3 else L1 xbar + x[n]) / n L1 6: n ← τ₄[Λ0] } 7: τ₅ ← primitive<->(n, 1.0) xbar 8: τ₆ ← primitive<*>(τ₅, xbar)} 9: τ₇ ← primitive<[>(x, n) 10: τ₈ ← primitive<+>(τ₆, τ₇) 11: xbar ← primitive</>(τ₈, n) L2 12: Λ0 ← increment counter Λ0 13: goto L0 L3 14: return xbar
  • 15. Contact & Thanks• http://code.google.com/p/renjin• alex@bedatadriven.comThanks to Renjin contributors:• M.Hakan Satman of Istanbul University, Department of Econometrics• Jamie Kingsbery , Yodle