r,rstats,r language,r packages

1,122 views
1,038 views

Published on

Published in: Business, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,122
On SlideShare
0
From Embeds
0
Number of Embeds
249
Actions
Shares
0
Downloads
38
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

r,rstats,r language,r packages

  1. 1. Creating an Optimized  Algorithm in R:  Version 1 October 21, 2009
  2. 2. R : Background <ul><li>Nobody owns it , yet R related products have been created by </li></ul><ul><li>REvolution Computing (Partnering with Microsoft/Intel) </li></ul><ul><li>http://www.revolution-computing.com/industry/academic.php </li></ul><ul><li>SAS (Interface to SAS/IML) </li></ul><ul><li>http://support.sas.com/rnd/app/studio/Rinterface2.html </li></ul><ul><li>and SPSS (Interface to SPSS including some use of Python) </li></ul><ul><li>http://insideout.spss.com/2009/01/13/spss-statistics-and-r/ </li></ul><ul><li>Blue Reference Inc ( Plugin for MS Office) </li></ul><ul><li>http://inferenceforr.com/default.aspx </li></ul><ul><li>and </li></ul><ul><li>Information Focus ( R GUI for Data Mining) </li></ul><ul><li>http://www.informationbuilders.com/products/webfocus/predictivemodeling.html </li></ul>
  3. 3. R Packages <ul><li>CRAN - 1783 Packages in R 2.11 </li></ul><ul><li>               1977 Packages in R 2.9   </li></ul><ul><li>  </li></ul><ul><li>  COST -0 $   BUT a lot of hours. </li></ul><ul><li>Question: </li></ul><ul><li>Number of People in the World who know all 1977 R Packages? </li></ul>
  4. 4. Writing a Function/ Algorithm in R <ul><li>Simply enough, </li></ul><ul><li>newRalgorithm(x) <- function(x) OldAlgorithm(x) </li></ul><ul><li>Eg- do_something <- function ( x , y ){ </li></ul><ul><li># Function code goes here ... }   </li></ul><ul><li># Subset my data </li></ul><ul><li>orange_girls <- subset ( crabs , sex == 'F' & sp == 'O' )   </li></ul><ul><li># Call my function do_something ( orange_girls $ CW , orange_girls $ C </li></ul><ul><li>Citation-  http://cran.r-project.org/doc/manuals/R-exts.html#Top </li></ul><ul><li>http://www.bioinformaticszen.com/r_programming/data_analysis_using_r_functions_as_objects/ </li></ul>
  5. 5. Writing a new stats algorithm ( in R /other language) <ul><li>Steps ( Basic Idea)- </li></ul><ul><li>Journal Review of Study Area </li></ul><ul><li>Existing Algorithm Study for GAP analysis </li></ul><ul><li>And add creativity </li></ul><ul><li>Test and Iterate within community </li></ul><ul><li>Publish </li></ul>
  6. 6. Writing a Function/ Algorithm in R 2 <ul><li>Adding loops and multiple function </li></ul><ul><li>Eg- # Arrays of values for each type of species and sex </li></ul><ul><li>species <- unique ( crabs $ sp ) </li></ul><ul><li>sexes <- unique ( crabs $ sex )   </li></ul><ul><li># Loop through species ... </li></ul><ul><li>for ( i in 1 :length ( species )){ </li></ul><ul><li># ... loop through sex .. for ( j in 1 :length ( sexes )){ </li></ul><ul><li>#... and finally call a function on each subset something_else ( subset ( crabs , sp == species [ i ] & sex == sexes [ j ])) } </li></ul><ul><li>Citation-  http://cran.r-project.org/doc/manuals/R-exts.html#Top   http://www.bioinformaticszen.com/r_programming/data_analysis_using_r_functions_as_objects/ </li></ul>
  7. 7. Writing a Function/ Algorithm in R 2 <ul><li>Adding loops and multiple function </li></ul><ul><li>Eg- # Arrays of values for each type of species and sex </li></ul><ul><li>species <- unique ( crabs $ sp ) </li></ul><ul><li>sexes <- unique ( crabs $ sex )   </li></ul><ul><li>  # Loop through species ... </li></ul><ul><li>for ( i in 1 :length ( species )){ </li></ul><ul><li># ... loop through sex ..   </li></ul><ul><li>  for ( j in 1 :length ( sexes )){   </li></ul><ul><li>  #... and finally call a function on each subset </li></ul><ul><li>something_else ( subset ( crabs , sp == species [ i ] &  sex == sexes [ j ])) } Citation-  http://cran.r-project.org/doc/manuals/R-exts.html#Top   http://www.bioinformaticszen.com/r_programming/data_analysis_using_r_functions_as_objects/ </li></ul>
  8. 8. More ways to write functions <ul><li>each <- function ( . column , . data , . lambda ){ </li></ul><ul><li># Find the column index from it's name </li></ul><ul><li>column_index <- which(names( . data) == . column) </li></ul><ul><li># Find the unique values in the column </li></ul><ul><li>column_levels <- unique( . data[,column_index])  </li></ul><ul><li>  # Loop over these values </li></ul><ul><li>for (i in 1 :length(column_levels)){ </li></ul><ul><li># Subset the data and call the passed function on it </li></ul><ul><li>. lambda( . data[ . data[,column_index] == column_levels[i],]) } } </li></ul><ul><li>The last argument  .lambda  is an R function, because R treats functions as objects this allows them to be passed as arguments to other functions. # Another function as the last argument to this function </li></ul><ul><li>each ( &quot;sp&quot; , crabs , something_else ) # Or create a new anonymous function ...   </li></ul><ul><li>  each ( &quot;sp&quot; , crabs , function ( x ){ # ... and run multiple lines of code here something_else ( x ) with ( x , lm ( CW ~ CL )) }) </li></ul>
  9. 9. Additionally create new functions use a Plyr <ul><li>From  http://had.co.nz/plyr/ </li></ul><ul><li>plyr is a set of tools that solves a common set of problems: you need to break a big problem down into manageable pieces, operate on each pieces and then put all the pieces back together. It's already possible to do this with split and the apply functions, but plyr just makes it all a bit easier with: </li></ul><ul><ul><li>consistent names, arguments and outputs </li></ul></ul><ul><ul><li>input from and output to data.frames, matrices and lists </li></ul></ul><ul><ul><li>progress bars to keep track of long running operations </li></ul></ul><ul><ul><li>built-in error recovery </li></ul></ul><ul><li>a consistent and useful set of tools for solving the split-apply-combine problem. library ( plyr ) # Three arguments # 1. The dataframe </li></ul><ul><li># 2. The name of columns to subset by # 3. The function to call on each subset d_ply ( crabs , . ( sp , sex ), something_else ) </li></ul>
  10. 10. Quick Recap <ul><li>  </li></ul><ul><li>We have an algorithm in mind or create a new alogirthm ( toughest part)  </li></ul><ul><li>( Eg. http://en.scientificcommons.org/42572415  </li></ul><ul><li>Genetic K-Means (GKM) or Genetic Regularized Mahalanobis (GARM) distances to compute the initial cluster parameters, with little difference in the final results. This innovation allows our algorithm to find optimal parameter estimates of complex hyperellisoidal clusters. We develop and score the information complexity (ICOMP) criterion of Bozdogan (1994a,b, 2004) as our fitness function to choose the number of clusters present in the data sets ) </li></ul><ul><li>We created a function in R on it.We can also use to rename Package Functions (like a SAS R Package I created) </li></ul><ul><li>We now need to create a package so we all 2 million R users may have a chance to use it </li></ul>
  11. 11. Creating a New Package <ul><li>Citation- </li></ul><ul><li>http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf </li></ul><ul><li>   1. Load all functions and data sets you want in the package into a clean R session, and run package.skeleton(). The objects are sorted into data and functions, skeleton help files </li></ul><ul><li>       are created for them using prompt() and a DESCRIPTION file is created. The function then prints out a list of things for you to do next. </li></ul><ul><li>This creates the Package within the Current Working Directory </li></ul><ul><li>> package.skeleton(name=&quot;NAME_OF_PACKAGE&quot;,code_files=&quot;FILENAME.R&quot;) </li></ul><ul><li>Creating directories ... </li></ul><ul><li>Creating DESCRIPTION ... </li></ul><ul><li>Creating Read-and-delete-me ... </li></ul><ul><li>Copying code files ... </li></ul><ul><li>Making help files ... </li></ul><ul><li>Done. </li></ul><ul><li>Further steps are described in './linmod/Read-and-delete-me'. </li></ul><ul><li>Q WHERE IS MY PACKAGE? </li></ul><ul><li>A  getwd() </li></ul>
  12. 12. Creating a New Package <ul><li>Citation- </li></ul><ul><li>http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf </li></ul><ul><li>Q What is the best step in making a software- </li></ul><ul><li>A Documenting HELP </li></ul><ul><li>FINALLY  </li></ul><ul><li>* Edit the help file skeletons in 'man', possibly combining help files </li></ul><ul><li>   for multiple functions. </li></ul><ul><li>* Put any C/C++/Fortran code in 'src'. </li></ul><ul><li>* If you have compiled code, add a .First.lib() function in 'R' to load </li></ul><ul><li>   the shared library. </li></ul><ul><li>* Run R CMD build to build the package tarball. </li></ul><ul><li>* Run R CMD check to check the package tarball. </li></ul><ul><li>Read &quot;Writing R Extensions&quot; for more information.  </li></ul><ul><li>http://cran. r -project.org/doc/manuals/ R -exts.pdf     Also see guidelines for CRAN submission </li></ul>
  13. 13. Next Steps <ul><li>We have New functions and a new Package </li></ul><ul><li>We now need to optimize the R Package for Performance  </li></ul><ul><li>Using </li></ul><ul><li>1) Parallel Computing </li></ul><ul><li>2) High Performance Computing </li></ul><ul><li>3) Code Optimization </li></ul>
  14. 14. Optimizing Code <ul><li>Citation: </li></ul><ul><li>Dirk Eddelbuettel </li></ul><ul><li>http://dirk .eddelbuettel.com/papers/ useR 2009 hpcTutorial .pdf </li></ul><ul><li>R already provides the basic tools for performance analysis.      the system.time function for simple measurements.      the Rprof function for profiling R code.      the Rprofmem function for profiling R memory usage. In addition, the profr and proftools package on CRAN can be used to visualize Rprof data. We use tools for creating visual images of how the algorithm is looping- in case we dont know how the algorithm we created looks visually and to avoid multiple calls. </li></ul><ul><li>  </li></ul>
  15. 15. Optimizing Code :Example <ul><li>Citation: </li></ul><ul><li>Dirk Eddelbuettel </li></ul><ul><li>http://dirk .eddelbuettel.com/papers/ useR 2009 hpcTutorial .pdf </li></ul><ul><li>> sillysum <- function(N) { s <- 0;        for (i in 1:N) s <- s + i; return(s) } > system.time(print(sillysum(1e7))) [1] 5e+13    user system elapsed  13.617   0.020 13.701> > system.time(print(sum(as.numeric(seq(1,1e7))))) [1] 5e+13    user system elapsed   0.224   0.092   0.315> Replacing the loop yielded a gain of a factor of more than 40. </li></ul>
  16. 16. Running R Parallel <ul><li>We need a cluster ( like Newton with 1500 processors  </li></ul><ul><li>run on 2 nd floor SMC) </li></ul><ul><li>Several R packages to execute code in </li></ul><ul><li>parallel: </li></ul><ul><li>     NWS </li></ul><ul><li>     Rmpi </li></ul><ul><li>     snow (using MPI, PVM, NWS or sockets) </li></ul><ul><li>     papply </li></ul><ul><li>     taskPR </li></ul><ul><li>     multicore </li></ul>
  17. 17. Running R Parallel <ul><li>We need a HPC cluster and also Que time in terms of how long we can run our query on the shared resource. </li></ul><ul><li>Using SNOW </li></ul><ul><li>A simple example: </li></ul><ul><li>cl <- makeCluster(4, &quot;MPI&quot;) </li></ul><ul><li>print(clusterCall(cl, function() </li></ul><ul><li>           Sys.info()[c(&quot;nodename&quot;,&quot;machine&quot;)])) </li></ul><ul><li>stopCluster(cl) </li></ul><ul><li>and  </li></ul><ul><li>params <- c(&quot;A&quot;, &quot;B&quot;, &quot;C&quot;, &quot;D&quot;, &quot;E&quot;, &quot;F&quot;, &quot;G&quot;, &quot;H&quot;) </li></ul><ul><li>cl <- makeCluster( 8 , &quot;MPI&quot;) </li></ul><ul><li>res <- parSapply( cl , params ,  </li></ul><ul><li>                       FUN= function(x) myNEWFunction(x)) </li></ul><ul><li>will ’unroll’ the parameters params one-each over the function argument given, utilising the cluster cl. In other words, we will be running eight copies of myNEWFunction() at once. </li></ul>
  18. 18. Current Status <ul><li>We are writing the algorithm we have selected for optimized use on Newton </li></ul><ul><li>We will create a Package and release it with a paper once project is over </li></ul>

×