Your SlideShare is downloading. ×
Renjin @ Purdue
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Renjin @ Purdue


Published on

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Most benchmarks from SVN repo for source of all benchmarks @ r476Run with 3 warmup runs (not timed) and 5 timed runsUsing ATLAS blas libs on Ubuntu
  • Transcript

    • 1. RenjinAlexander BertrambedatadrivenPurdue24.6.2012
    • 2. Agenda• About bedatadriven• Motivation for Renjin• Design• Progress to-date• Future Directions
    • 3. About bedatadriven• Consulting and software company based in the Netherlands with clients primarily in emerging markets and conflict zones• We use R extensively in: ▫ Predictive analytics for telecoms in Africa and Asia ▫ Web-based tools for market researchers ▫ Highly-clustered public health, opinion surveys in Afghanistan• Biggest datasets on order of 5-10 billion rows, small number of columns• Small datasets are also key when collection is expensive ($10-$20 / obs)
    • 4. Motivations for Developing Renjin• Deploy R programs to AppEngine & other PaaS• Seamless integration with our Java codebases• Thread-based parallelization of machine- learning tasks• Simplify deployment in complex (read: chaotic) production environments of clients• Fascinating project!
    • 5. Renjin Design: Goals• Run existing R packages without modification• No (required) native dependencies• Multithreaded• Security model that supports running unsafe code in webapp environments
    • 6. Renjin Design: Approach• Parser is ported directly (via Bison-Java)• Simple AST-based interpreter closely modeled after C-R.• R-language portion of base library reused without modification• Leverage existing JVM libraries (commons- math, netlib, jtransforms) when possible• C/Fortran-language portions ported or rewritten in Java
    • 7. Renjin Design: Threading JVM Process RenjinEngine 1 RenjinEngine 2 Base Base Stats Stats Global Dataframe Global
    • 8. Renjin Design: Embedding JVM Process Hadoop FS AppEngine RenjinEngine 1 Virtual File BlobStore Base System Local File System Stats Security Global Manager
    • 9. Renjin Design: DataAll primitives written against (threadsafe)interfaces rather than arrays SequenceVector (1:1e10)interface Vector ArrayVector JDBCVector BufferedSequential Vector TempFileBacked Vector DecoratingVector (inner + 1)
    • 10. Progress to date• 68% of builtins/internals implemented• Runs complex packages out of the box (base, survey, aspect)• 700+ micro tests covering edges cases related to subscripts, promises, missing arguments, etc• Missing pieces: ▫ S4 object system ▫ Base graphics incomplete ▫ Stats package port incomplete ▫ No good strategy yet for native dependencies
    • 11. Performance: Benchmark-25 Runtime (% difference vs R2.14.2) Sorting of 7,000,000 random values 1.17 2800x2800 cross-product matrix (b = a * a) 1.01Creation, transp., deformation of a 2500x2500 matrix 1.352400x2400 normal distributed random matrix ^1000 1.45 Creation of a 3000x3000 Hilbert matrix (matrix calc) 1.333,500,000 Fibonacci numbers calculation (vector calc) 2.38 Escoufiers method on a 45x45 matrix (mixed) 3.26 1.00 1.50 2.00 2.50 3.00 3.50
    • 12. Future Directions: Short-term• Testing: Complete test harness to systematically evaluate completeness against CRAN packages• Dependency management: Augument/replace package management system with Aether to simplify mgmt of mixed R, Java, Scala artifacts• Command-line and eclipse plugins for ad hoc analysis
    • 13. Future Directions: Longer-term• JIT compilation to JVM bytecode – for, lapply, while, etc• Alternative backing stores for Vector interface• GCC-bridge to translate existing C/Fortran sources to JVM byte code (Gimple->Shimple- >JVM bytecode)
    • 14. Compilation: Rely on an <- function(x) { 0: xbar ← primitive<[>(x, 1.0) xbar <- x[1] 1: τ₃ ← Δ length(x)d 2: τ₄ ← Δ seq(2.0, τ₃) 3: Λ0 ← 0 for(n in seq(2,length(x)) { 4: τ₂ ← primitive<length>(τ₄) xbar <- ((n – 1) * L0 5: if Λ0 >= τ₂ goto L3 else L1 xbar + x[n]) / n L1 6: n ← τ₄[Λ0] } 7: τ₅ ← primitive<->(n, 1.0) xbar 8: τ₆ ← primitive<*>(τ₅, xbar)} 9: τ₇ ← primitive<[>(x, n) 10: τ₈ ← primitive<+>(τ₆, τ₇) 11: xbar ← primitive</>(τ₈, n) L2 12: Λ0 ← increment counter Λ0 13: goto L0 L3 14: return xbar
    • 15. Contact & Thanks•• alex@bedatadriven.comThanks to Renjin contributors:• M.Hakan Satman of Istanbul University, Department of Econometrics• Jamie Kingsbery , Yodle