Data analysis with R and Julia

4,950 views

Published on

R is a free, open-source environment for statistical analysis and graphing. In its almost 20 years of existence, R has remained popular in both academic and business environments. The newer Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. This session outlines functional and performance differences between these two software packages. Youโ€™ll see demonstrations of best tips for integrating this software with Windows and walk away with guidelines for working with commercial software. A version of this presentation had 100 attendees at the PASS Business Analytics Conference in Chicago (April 2013), and 40 attendees for the PASS Virtual Business Analytics meeting (May 2013).

Published in: Business

Data analysis with R and Julia

  1. 1. Data Analysis with R and JuliaAdvanced Analytics and InsightsMark Tabladillo Ph.D., Data Mining Scientist, MarkTab Inc.
  2. 2. NetworkingInteractive
  3. 3. About MarkTabTraining and Consulting withhttp://marktab.comData Mining Resources and Blog athttp://marktab.netTwitter @marktabnet
  4. 4. OutlineR LanguageMarket AnalysisPerformanceProduction UseJulia LanguagePerformance
  5. 5. The R Languagehttp://cran.r-project.org
  6. 6. Major R VersionsVersion Description01996Initial release: University of Auckland, New Zealand12000Completeness and stability high enough to characterize a full statistical system, which could be putto production use22004Strong enhancements of the memory management subsystem as well as several major features,including Sweave (into LaTeX or LyX).32013The inclusion of long vectors (containing more than 2^31-1 elements!). Also, we now have 64 bitsupport on all platforms, support for parallel processing, the Matrix packagehttp://www.r-project.org/
  7. 7. How R WorksAs with an automobile, you can use R without worrying very much about how itworks.But computing with data is more complicated than driving a car (fortunately forhighway safety)John ChambersSoftware for Data Analysis, page 453
  8. 8. R works in a shellCross-platform, including Windows x32 or x64Interactive graphical user interface (GUI) to interpret commandsRead โ€“ accept user inputParse -- interpret input using expected syntaxEvaluate โ€“ execute commandsEverything is an objectData are stored in data frames, named listsR implements S language grammar, with a few extensions
  9. 9. R GUI
  10. 10. Read-Parse-Evaluate LoopReadParseEvaluate
  11. 11. R and SQL Serverinstall.packages("RODBC")library(RODBC)MDAC Downloads
  12. 12. R Market Analysis
  13. 13. Listserv Discussionhttp://r4stats.com/articles/popularity/
  14. 14. Estimated R UsageEstimated 250,000 people use it regularly (as of 2009)http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?pagewanted=2&_r=0
  15. 15. General Forum Postingshttp://r4stats.com/articles/popularity/
  16. 16. Stack Overflow Alonehttp://r4stats.com/articles/popularity/
  17. 17. Academic Publicationshttp://r4stats.com/articles/popularity/
  18. 18. Comparison of R, Matlab, SAS, Stata,SPSShttp://www.analyticbridge.com/group/productreviews2/forum/topics/product-reviews-comparing-r-matlab-sas-stata-spss
  19. 19. R Performance
  20. 20. R is Memory-Bound๐‘€๐‘’๐‘š๐‘œ๐‘Ÿ๐‘ฆ ๐‘†๐‘–๐‘ง๐‘’4= ๐ด๐‘š๐‘œ๐‘ข๐‘›๐‘ก ๐‘œ๐‘“ ๐‘… ๐ท๐‘Ž๐‘ก๐‘ŽSource: Joseph B. Rickert, February 14, 201364๐‘๐‘–๐‘ก ๐‘€๐‘’๐‘š๐‘œ๐‘Ÿ๐‘ฆ ๐‘†๐‘–๐‘ง๐‘’ = ๐‘…๐ด๐‘€32๐‘๐‘–๐‘ก ๐‘€๐‘’๐‘š๐‘œ๐‘Ÿ๐‘ฆ ๐‘†๐‘–๐‘ง๐‘’ = ๐‘ˆ๐‘ ๐‘’๐‘Ÿ ๐‘‰๐‘–๐‘Ÿ๐‘ก๐‘ข๐‘Ž๐‘™ ๐‘€๐‘’๐‘š๐‘œ๐‘Ÿ๐‘ฆ โˆ’ 0.5๐บ๐ต โ‰… 2 ๐บ๐ตSource: http://cran.r-project.org/bin/windows/base/rw-FAQ.html retrieved March 1,2013
  21. 21. R is Memory-BoundAll objects in an R session are stored in memoryR places a limit of 231 โˆ’ 1 bytes on all object sizes, independent of RAMThe Art of R Programming, Norman Matloff
  22. 22. R Memory ManagementAutomatic including garbage collectionrm()removes object assignment, but does not delete memorygc() forces garbage collection with substantial computation
  23. 23. Improving PerformanceThe Art of R Programming, Chapter 14, Norman MatloffPowerSimplicityVectorization Byte-Code CompilationParallel RC/C++
  24. 24. Improving PerformanceMethod DescriptionC/C++ Call C programs from RVectorization Recode for vectorization replacing slower functionsByte-code compilation cmpfun()Parallel R parallel packagehttp://cran.r-project.org/web/views/HighPerformanceComputing.html
  25. 25. Improving PerformanceRprof()โ€“ measures speed of functionsff โ€“ memory-efficient storage of large data on disk and fast access functionsbigmemory -- Manage massive matrices with shared memory and memory-mapped files
  26. 26. R for Production Use
  27. 27. Derivative ProjectsRStudio โ€“ Integrated Development Environment (IDE)Rattle โ€“ Data Mining PackageRExcel โ€“ (Statconn) Connection between R and ExcelWeka โ€“ Java-based data mining, statistical analysis by RRapidMiner โ€“ Java-based Weka data mining, statistical analysis by RRevolution Analytics โ€“ Scaling R for the EnterpriseOracle R Enterprise โ€“ Integrated into Oracle
  28. 28. About Statconn (as of March 2013)Produces RAndFriends under noncommercial and commercial licensesAll the statconn tools work ONLY with 32-bit RstatconnDCOMrcom (GPL2, but requires statconnDCOM)RExcel 3.2.9 (ONLY 32-bit Office: 2003, 2007, 2010)http://rcom.univie.ac.at/
  29. 29. Sample Projects Using RThe Heritage Health Prize, Thomas NguyenA Direct Marketing In-flight Forecasting System, Shannon Terry & Ben OgorekMining Twitter for Airline Consumer Sentiment, Jeffrey BreenAlternative Data Sources for Measuring Market Sentiment and Events (Using R), JoeRothermich
  30. 30. The Julia Languagehttp://julialang.org/
  31. 31. About JuliaHigh-level, high-performance dynamic open-source programming language for technicalcomputingSyntax similar to other technical computing environmentsFeaturesSophisticated compilerDistributed parallel executionNumerical accuracyExtensive mathematical function libraryUses C, C++, Fortran libraries extensively
  32. 32. Why Julia: โ€œBecause we are greedyโ€http://julialang.org/blog/2012/04/nyc-open-stats-meetup-announcement/
  33. 33. Julia CommunityHosted on github550 mailing list subscribers (Google Groups)1,500 github followers190 forks50 total contributorsAs of September 2012, all contributors except the core developers had knownof the language for six months or lessJulia: A Fast Dynamic Language for Technical Computing (2012), Beazanson, Karpinski,Shah, Edelman
  34. 34. The Julia Manualhttp://docs.julialang.org/en/latest/manual/
  35. 35. Julia Mathematical Functionshttp://docs.julialang.org/en/latest/manual/mathematical-operations/
  36. 36. Julia Standard Libraryhttp://docs.julialang.org/en/latest/stdlib/
  37. 37. Julia Performance
  38. 38. Key Ingredients of Julia PerformanceRich type information, provided naturally by multiple dispatchAggressive code specialization against run-time typesJuliaโ€™s LLVM-based just-in-time (JIT) compilerJulia: A Fast Dynamic Language for Technical Computing (2012), Beazanson, Karpinski,Shah, Edelman
  39. 39. Julia Performance Comparisonhttp://julialang.org/
  40. 40. Julia Performance ComparisonJulia: A Fast Dynamic Language for Technical Computing (2012), Beazanson, Karpinski,Shah, Edelman
  41. 41. Julia RecommendationsThe software is ready for people already using C or FortranThe software will develop into a usable scripting language for R usersWait until version one for production use
  42. 42. Send me YourQuestionshttp://marktab.net
  43. 43. ConclusionR provides production-ready software for statistical analysisJulia merits personal investment and promises high performance

ร—