Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Easy R


Published on

creating new stats algorithms easily in R

Published in: Technology
  • Be the first to comment

Easy R

  1. 1. Creating an Optimized  Algorithm in R:  Version 1 October 22, 2009
  2. 2. R : Background <ul><li>Nobody owns it , yet R related products have been created by </li></ul><ul><li>REvolution Computing (Partnering with Microsoft/Intel) </li></ul><ul><li> </li></ul><ul><li>SAS (Interface to SAS/IML) </li></ul><ul><li> </li></ul><ul><li>and SPSS (Interface to SPSS including some use of Python) </li></ul><ul><li> </li></ul><ul><li>Blue Reference Inc ( Plugin for MS Office) </li></ul><ul><li> </li></ul><ul><li>and </li></ul><ul><li>Information Focus ( R GUI for Data Mining) </li></ul><ul><li> </li></ul>
  3. 3. R Packages <ul><li>CRAN - 1783 Packages in R 2.11 </li></ul><ul><li>               1977 Packages in R 2.9   </li></ul><ul><li>  </li></ul><ul><li>  COST -0 $   BUT a lot of hours. </li></ul><ul><li>Question: </li></ul><ul><li>Number of People in the World who know all 1977 R Packages? </li></ul>
  4. 4. Some uses of R <ul><li>Citation: </li></ul><ul><li>  httP:// </li></ul><ul><li>library ( maps ) </li></ul><ul><li>map ( &quot;state&quot; , interior = FALSE ) </li></ul><ul><li>map ( &quot;state&quot; , boundary = FALSE , col = &quot;gray&quot; , add = TRUE ) </li></ul><ul><li>  </li></ul><ul><li>  </li></ul><ul><li>  </li></ul><ul><li>GADM is a spatial database of the location of the world's administrative boundaries </li></ul><ul><li>the spplot function (from the sp package ). </li></ul><ul><li>the data for Switzerland, and then plot each canton with a color denoting its primary language: </li></ul><ul><li>library ( sp ) con <- url ( &quot;; ) print ( load ( con )) close ( con ) </li></ul><ul><li>language <- c ( &quot;german&quot; , &quot;german&quot; , &quot;german&quot; , &quot;german&quot; ,   &quot;german&quot; , &quot;german&quot; , &quot;french&quot; ,  &quot;french&quot; ,   &quot;german&quot; , &quot;german&quot; , &quot;french&quot; ,  &quot;french&quot; ,    &quot;german&quot; ,  &quot;french&quot; , &quot;german&quot; , &quot;german&quot; ,   &quot;german&quot; , &quot;german&quot; , &quot;german&quot; ,  &quot;german&quot; ,   &quot;german&quot; , &quot;italian&quot; , &quot;german&quot; , &quot;french&quot; ,   &quot;french&quot; , &quot;german&quot; , &quot;german&quot; ) gadm $ language <- as.factor ( language ) col = rainbow ( length ( levels ( gadm $ language ))) spplot ( gadm , &quot;language&quot; , col.regions =c ol , main= &quot;Swiss Language Regions&quot; ) </li></ul><ul><li>  </li></ul><ul><li>AnthroSpace:  Download Global Administrative Areas as RData files </li></ul><ul><li>  </li></ul><ul><li>  </li></ul>
  5. 5. Seven tips for &quot;surviving&quot; R   <ul><ul><li>Keep extensive written notes </li></ul></ul><ul><ul><li>Find a way to search for R answers </li></ul></ul><ul><ul><li>Learn to convert complex objects to canonical forms with unclass() </li></ul></ul><ul><ul><li>Learn how to find and inspect classes and methods for objects </li></ul></ul><ul><ul><li>Learn how to clear pesky attributes from objects </li></ul></ul><ul><ul><li>Swallow your pride  </li></ul></ul><ul><ul><li>and learn and use R's many one-line idioms, rather than reinventing the wheel </li></ul></ul><ul><li>  </li></ul><ul><li>John Mount from Win-Vector LLC :Citation </li></ul><ul><li>  </li></ul><ul><li>  </li></ul>
  6. 6. Writing a Function/ Algorithm in R <ul><li>Simply enough, </li></ul><ul><li>newRalgorithm(x) <- function(x) OldAlgorithm(x) </li></ul><ul><li>Eg- do_something <- function ( x , y ){ </li></ul><ul><li># Function code goes here ... }   </li></ul><ul><li># Subset my data </li></ul><ul><li>orange_girls <- subset ( crabs , sex == 'F' & sp == 'O' )   </li></ul><ul><li># Call my function do_something ( orange_girls $ CW , orange_girls $ C </li></ul><ul><li>  </li></ul><ul><li>  </li></ul><ul><li>  </li></ul><ul><li>Citation- </li></ul><ul><li>  </li></ul><ul><li> </li></ul><ul><li> </li></ul>
  7. 7. Writing a new stats algorithm ( in R /other language) <ul><li>Steps ( Basic Idea)- </li></ul><ul><li>Journal Review of Study Area </li></ul><ul><li>Existing Algorithm Study for GAP analysis </li></ul><ul><li>And add creativity </li></ul><ul><li>Test and Iterate within community </li></ul><ul><li>Publish </li></ul>
  8. 8. Choosing Clustering as the area of interest <ul><li>  Clustering works with Big Data.  </li></ul><ul><li>  </li></ul><ul><li>Can work with lots of incomplete column variables when other </li></ul><ul><li>techniques may not be suitable. </li></ul><ul><li>  </li></ul><ul><li>Works when data cannot be used for regression models. </li></ul><ul><li>Groups of clusters can be merged and combined to make new clusters so a case for parallel processing </li></ul><ul><li>Useful for product marketing, business, medicine  and </li></ul><ul><li>financial </li></ul>
  9. 9. K Means Clustering using R <ul><li>R> data(&quot;planets&quot;, package = &quot;HSAUR&quot;) R> library(&quot;scatterplot3d&quot;) R> scatterplot3d(log(planets$mass), log(planets$period), + log(planets$eccen), type = &quot;h&quot;, angle = 55, + pch = 16, y.ticklabs = seq(0, 10, by = 2), + y.margin.add = 0.1, scale.y = 0.7) </li></ul><ul><li>  </li></ul><ul><li>  </li></ul>
  10. 10. Writing a Function/ Algorithm in R 2 <ul><li>Adding loops and multiple function </li></ul><ul><li>Eg- # Arrays of values for each type of species and sex </li></ul><ul><li>species <- unique ( crabs $ sp ) </li></ul><ul><li>sexes <- unique ( crabs $ sex )   </li></ul><ul><li># Loop through species ... </li></ul><ul><li>for ( i in 1 :length ( species )){ </li></ul><ul><li># ... loop through sex .. for ( j in 1 :length ( sexes )){ </li></ul><ul><li>#... and finally call a function on each subset something_else ( subset ( crabs , sp == species [ i ] & sex == sexes [ j ])) } </li></ul><ul><li>Citation- </li></ul>
  11. 11. Writing a Function/ Algorithm in R 2 <ul><li>Adding loops and multiple function </li></ul><ul><li>Eg- # Arrays of values for each type of species and sex </li></ul><ul><li>species <- unique ( crabs $ sp ) </li></ul><ul><li>sexes <- unique ( crabs $ sex )   </li></ul><ul><li>  # Loop through species ... </li></ul><ul><li>for ( i in 1 :length ( species )){ </li></ul><ul><li># ... loop through sex ..   </li></ul><ul><li>  for ( j in 1 :length ( sexes )){   </li></ul><ul><li>  #... and finally call a function on each subset </li></ul><ul><li>something_else ( subset ( crabs , sp == species [ i ] &  sex == sexes [ j ])) }   </li></ul><ul><li>  </li></ul><ul><li>  </li></ul><ul><li>  </li></ul><ul><li>Citation- </li></ul><ul><li>  </li></ul><ul><li>   </li></ul><ul><li>  </li></ul><ul><li> </li></ul>
  12. 12. More ways to write functions <ul><li>each <- function ( . column , . data , . lambda ){ </li></ul><ul><li># Find the column index from it's name </li></ul><ul><li>column_index <- which(names( . data) == . column) </li></ul><ul><li># Find the unique values in the column </li></ul><ul><li>column_levels <- unique( . data[,column_index])  </li></ul><ul><li>  # Loop over these values </li></ul><ul><li>for (i in 1 :length(column_levels)){ </li></ul><ul><li># Subset the data and call the passed function on it </li></ul><ul><li>. lambda( . data[ . data[,column_index] == column_levels[i],]) } } </li></ul><ul><li>The last argument  .lambda  is an R function, because R treats functions as objects this allows them to be passed as arguments to other functions. # Another function as the last argument to this function </li></ul><ul><li>each ( &quot;sp&quot; , crabs , something_else ) # Or create a new anonymous function ...   </li></ul><ul><li>  each ( &quot;sp&quot; , crabs , function ( x ){ # ... and run multiple lines of code here something_else ( x ) with ( x , lm ( CW ~ CL )) }) </li></ul>
  13. 13. Additionally create new functions use a Plyr <ul><li>From </li></ul><ul><li>plyr is a set of tools that solves a common set of problems: you need to break a big problem down into manageable pieces, operate on each pieces and then put all the pieces back together. It's already possible to do this with split and the apply functions, but plyr just makes it all a bit easier with: </li></ul><ul><ul><li>consistent names, arguments and outputs </li></ul></ul><ul><ul><li>input from and output to data.frames, matrices and lists </li></ul></ul><ul><ul><li>progress bars to keep track of long running operations </li></ul></ul><ul><ul><li>built-in error recovery </li></ul></ul><ul><li>a consistent and useful set of tools for solving the split-apply-combine problem. library ( plyr ) # Three arguments # 1. The dataframe </li></ul><ul><li># 2. The name of columns to subset by # 3. The function to call on each subset d_ply ( crabs , . ( sp , sex ), something_else ) </li></ul>
  14. 14. Quick Recap <ul><li>  </li></ul><ul><li>We have an algorithm in mind or create a new alogirthm ( toughest part)  </li></ul><ul><li>( Eg.  </li></ul><ul><li>Genetic K-Means (GKM) or Genetic Regularized Mahalanobis (GARM) distances to compute the initial cluster parameters, with little difference in the final results. This innovation allows our algorithm to find optimal parameter estimates of complex hyperellisoidal clusters. We develop and score the information complexity (ICOMP) criterion of Bozdogan (1994a,b, 2004) as our fitness function to choose the number of clusters present in the data sets ) </li></ul><ul><li>We created a function in R on it.We can also use to rename Package Functions (like a SAS R Package I created) </li></ul><ul><li>We now need to create a package so we all 2 million R users may have a chance to use it </li></ul>
  15. 15. Creating a New Package <ul><li>Citation- </li></ul><ul><li> </li></ul><ul><li>   1. Load all functions and data sets you want in the package into a clean R session, and run package.skeleton(). The objects are sorted into data and functions, skeleton help files </li></ul><ul><li>       are created for them using prompt() and a DESCRIPTION file is created. The function then prints out a list of things for you to do next. </li></ul><ul><li>This creates the Package within the Current Working Directory </li></ul><ul><li>> package.skeleton(name=&quot;NAME_OF_PACKAGE&quot;,code_files=&quot;FILENAME.R&quot;) </li></ul><ul><li>Creating directories ... </li></ul><ul><li>Creating DESCRIPTION ... </li></ul><ul><li>Creating Read-and-delete-me ... </li></ul><ul><li>Copying code files ... </li></ul><ul><li>Making help files ... </li></ul><ul><li>Done. </li></ul><ul><li>Further steps are described in './linmod/Read-and-delete-me'. </li></ul><ul><li>Q WHERE IS MY PACKAGE? </li></ul><ul><li>A  getwd() </li></ul>
  16. 16. Creating a New Package <ul><li>Citation- </li></ul><ul><li> </li></ul><ul><li>Q What is the best step in making a software- </li></ul><ul><li>A Documenting HELP </li></ul><ul><li>FINALLY  </li></ul><ul><li>* Edit the help file skeletons in 'man', possibly combining help files </li></ul><ul><li>   for multiple functions. </li></ul><ul><li>* Put any C/C++/Fortran code in 'src'. </li></ul><ul><li>* If you have compiled code, add a .First.lib() function in 'R' to load </li></ul><ul><li>   the shared library. </li></ul><ul><li>* Run R CMD build to build the package tarball. </li></ul><ul><li>* Run R CMD check to check the package tarball. </li></ul><ul><li>Read &quot;Writing R Extensions&quot; for more information.  </li></ul><ul><li>http://cran. r R -exts.pdf     Also see guidelines for CRAN submission </li></ul>
  17. 17. Next Steps <ul><li>We have New functions and a new Package </li></ul><ul><li>We now need to optimize the R Package for Performance  </li></ul><ul><li>Using </li></ul><ul><li>1) Parallel Computing </li></ul><ul><li>2) High Performance Computing </li></ul><ul><li>3) Code Optimization </li></ul>
  18. 18. Optimizing Code <ul><li>Citation: </li></ul><ul><li>Dirk Eddelbuettel </li></ul><ul><li>http://dirk useR 2009 hpcTutorial .pdf </li></ul><ul><li>R already provides the basic tools for performance analysis.      the system.time function for simple measurements.      the Rprof function for profiling R code.      the Rprofmem function for profiling R memory usage. In addition, the profr and proftools package on CRAN can be used to visualize Rprof data. We use tools for creating visual images of how the algorithm is looping- in case we dont know how the algorithm we created looks visually and to avoid multiple calls. </li></ul><ul><li>  </li></ul>
  19. 19. Optimizing Code :Example <ul><li>Citation: </li></ul><ul><li>Dirk Eddelbuettel </li></ul><ul><li>http://dirk useR 2009 hpcTutorial .pdf </li></ul><ul><li>> sillysum <- function(N) { s <- 0;        for (i in 1:N) s <- s + i; return(s) } > system.time(print(sillysum(1e7))) [1] 5e+13    user system elapsed  13.617   0.020 13.701> > system.time(print(sum(as.numeric(seq(1,1e7))))) [1] 5e+13    user system elapsed   0.224   0.092   0.315> Replacing the loop yielded a gain of a factor of more than 40. </li></ul>
  20. 20. Running R Parallel <ul><li>We need a cluster ( like Newton with 1500 processors  </li></ul><ul><li>run on 2 nd floor SMC ) </li></ul><ul><li>Several R packages to execute code in </li></ul><ul><li>parallel: </li></ul><ul><li>     NWS </li></ul><ul><li>     Rmpi </li></ul><ul><li>     snow (using MPI, PVM, NWS or sockets) </li></ul><ul><li>     papply </li></ul><ul><li>     taskPR </li></ul><ul><li>     multicore </li></ul>
  21. 21. Running R Parallel <ul><li>We need a HPC cluster and also Que time in terms of how long we can run our query on the shared resource. </li></ul><ul><li>Using SNOW </li></ul><ul><li>A simple example: </li></ul><ul><li>cl <- makeCluster(4, &quot;MPI&quot;) </li></ul><ul><li>print(clusterCall(cl, function() </li></ul><ul><li> [c(&quot;nodename&quot;,&quot;machine&quot;)])) </li></ul><ul><li>stopCluster(cl) </li></ul><ul><li>and  </li></ul><ul><li>params <- c(&quot;A&quot;, &quot;B&quot;, &quot;C&quot;, &quot;D&quot;, &quot;E&quot;, &quot;F&quot;, &quot;G&quot;, &quot;H&quot;) </li></ul><ul><li>cl <- makeCluster( 8 , &quot;MPI&quot;) </li></ul><ul><li>res <- parSapply( cl , params ,  </li></ul><ul><li>                       FUN= function(x) myNEWFunction(x)) </li></ul><ul><li>will ’unroll’ the parameters params one-each over the function argument given, utilising the cluster cl. In other words, we will be running eight copies of myNEWFunction() at once. </li></ul>
  22. 22. Current Status <ul><li>We are writing the algorithm we have selected for optimized use on Newton </li></ul><ul><li>We will create a Package and release it with a paper once project is over </li></ul>