View stunning SlideShares in full-screen with the new iOS app!Introducing SlideShare for AndroidExplore all your favorite topics in the SlideShare appGet the SlideShare app to Save for Later — even offline
View stunning SlideShares in full-screen with the new Android app!View stunning SlideShares in full-screen with the new iOS app!
Agenda• About bedatadriven• Motivation for Renjin• Design• Progress to-date• Future Directions
About bedatadriven• Consulting and software company based in the Netherlands with clients primarily in emerging markets and conflict zones• We use R extensively in: ▫ Predictive analytics for telecoms in Africa and Asia ▫ Web-based tools for market researchers ▫ Highly-clustered public health, opinion surveys in Afghanistan• Biggest datasets on order of 5-10 billion rows, small number of columns• Small datasets are also key when collection is expensive ($10-$20 / obs)
Motivations for Developing Renjin• Deploy R programs to AppEngine & other PaaS• Seamless integration with our Java codebases• Thread-based parallelization of machine- learning tasks• Simplify deployment in complex (read: chaotic) production environments of clients• Fascinating project!
Renjin Design: Goals• Run existing R packages without modification• No (required) native dependencies• Multithreaded• Security model that supports running unsafe code in webapp environments
Renjin Design: Approach• Parser is ported directly (via Bison-Java)• Simple AST-based interpreter closely modeled after C-R.• R-language portion of base library reused without modification• Leverage existing JVM libraries (commons- math, netlib, jtransforms) when possible• C/Fortran-language portions ported or rewritten in Java
Renjin Design: Threading JVM Process RenjinEngine 1 RenjinEngine 2 Base Base Stats Stats Global Dataframe Global
Renjin Design: Embedding JVM Process Hadoop FS AppEngine RenjinEngine 1 Virtual File BlobStore Base System Local File System Stats Security Global Manager
Renjin Design: DataAll primitives written against (threadsafe)interfaces rather than arrays SequenceVector (1:1e10)interface Vector ArrayVector JDBCVector BufferedSequential Vector TempFileBacked Vector DecoratingVector (inner + 1)
Progress to date• 68% of builtins/internals implemented• Runs complex packages out of the box (base, survey, aspect)• 700+ micro tests covering edges cases related to subscripts, promises, missing arguments, etc• Missing pieces: ▫ S4 object system ▫ Base graphics incomplete ▫ Stats package port incomplete ▫ No good strategy yet for native dependencies
Performance: Benchmark-25 Runtime (% difference vs R2.14.2) Sorting of 7,000,000 random values 1.17 2800x2800 cross-product matrix (b = a * a) 1.01Creation, transp., deformation of a 2500x2500 matrix 1.352400x2400 normal distributed random matrix ^1000 1.45 Creation of a 3000x3000 Hilbert matrix (matrix calc) 1.333,500,000 Fibonacci numbers calculation (vector calc) 2.38 Escoufiers method on a 45x45 matrix (mixed) 3.26 1.00 1.50 2.00 2.50 3.00 3.50
Future Directions: Short-term• Testing: Complete test harness to systematically evaluate completeness against CRAN packages• Dependency management: Augument/replace package management system with Aether to simplify mgmt of mixed R, Java, Scala artifacts• Command-line and eclipse plugins for ad hoc analysis
Future Directions: Longer-term• JIT compilation to JVM bytecode – for, lapply, while, etc• Alternative backing stores for Vector interface• GCC-bridge to translate existing C/Fortran sources to JVM byte code (Gimple->Shimple- >JVM bytecode)