Data analysis with R and Julia
Upcoming SlideShare
Loading in...5

Data analysis with R and Julia



R is a free, open-source environment for statistical analysis and graphing. In its almost 20 years of existence, R has remained popular in both academic and business environments. The newer Julia is a ...

R is a free, open-source environment for statistical analysis and graphing. In its almost 20 years of existence, R has remained popular in both academic and business environments. The newer Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. This session outlines functional and performance differences between these two software packages. You’ll see demonstrations of best tips for integrating this software with Windows and walk away with guidelines for working with commercial software. A version of this presentation had 100 attendees at the PASS Business Analytics Conference in Chicago (April 2013), and 40 attendees for the PASS Virtual Business Analytics meeting (May 2013).



Total Views
Views on SlideShare
Embed Views



3 Embeds 18 8 8 2



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.


11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Data analysis with R and Julia Data analysis with R and Julia Presentation Transcript

    • Data Analysis with R and JuliaAdvanced Analytics and InsightsMark Tabladillo Ph.D., Data Mining Scientist, MarkTab Inc.
    • NetworkingInteractive
    • About MarkTabTraining and Consulting withhttp://marktab.comData Mining Resources and Blog athttp://marktab.netTwitter @marktabnet
    • OutlineR LanguageMarket AnalysisPerformanceProduction UseJulia LanguagePerformance
    • The R Language
    • Major R VersionsVersion Description01996Initial release: University of Auckland, New Zealand12000Completeness and stability high enough to characterize a full statistical system, which could be putto production use22004Strong enhancements of the memory management subsystem as well as several major features,including Sweave (into LaTeX or LyX).32013The inclusion of long vectors (containing more than 2^31-1 elements!). Also, we now have 64 bitsupport on all platforms, support for parallel processing, the Matrix package
    • How R WorksAs with an automobile, you can use R without worrying very much about how itworks.But computing with data is more complicated than driving a car (fortunately forhighway safety)John ChambersSoftware for Data Analysis, page 453
    • R works in a shellCross-platform, including Windows x32 or x64Interactive graphical user interface (GUI) to interpret commandsRead – accept user inputParse -- interpret input using expected syntaxEvaluate – execute commandsEverything is an objectData are stored in data frames, named listsR implements S language grammar, with a few extensions
    • R GUI
    • Read-Parse-Evaluate LoopReadParseEvaluate
    • R and SQL Serverinstall.packages("RODBC")library(RODBC)MDAC Downloads
    • R Market Analysis
    • Listserv Discussion
    • Estimated R UsageEstimated 250,000 people use it regularly (as of 2009)
    • General Forum Postings
    • Stack Overflow Alone
    • Academic Publications
    • Comparison of R, Matlab, SAS, Stata,SPSS
    • R Performance
    • R is Memory-Bound𝑀𝑒𝑚𝑜𝑟𝑦 𝑆𝑖𝑧𝑒4= 𝐴𝑚𝑜𝑢𝑛𝑡 𝑜𝑓 𝑅 𝐷𝑎𝑡𝑎Source: Joseph B. Rickert, February 14, 201364𝑏𝑖𝑡 𝑀𝑒𝑚𝑜𝑟𝑦 𝑆𝑖𝑧𝑒 = 𝑅𝐴𝑀32𝑏𝑖𝑡 𝑀𝑒𝑚𝑜𝑟𝑦 𝑆𝑖𝑧𝑒 = 𝑈𝑠𝑒𝑟 𝑉𝑖𝑟𝑡𝑢𝑎𝑙 𝑀𝑒𝑚𝑜𝑟𝑦 − 0.5𝐺𝐵 ≅ 2 𝐺𝐵Source: retrieved March 1,2013
    • R is Memory-BoundAll objects in an R session are stored in memoryR places a limit of 231 − 1 bytes on all object sizes, independent of RAMThe Art of R Programming, Norman Matloff
    • R Memory ManagementAutomatic including garbage collectionrm()removes object assignment, but does not delete memorygc() forces garbage collection with substantial computation
    • Improving PerformanceThe Art of R Programming, Chapter 14, Norman MatloffPowerSimplicityVectorization Byte-Code CompilationParallel RC/C++
    • Improving PerformanceMethod DescriptionC/C++ Call C programs from RVectorization Recode for vectorization replacing slower functionsByte-code compilation cmpfun()Parallel R parallel package
    • Improving PerformanceRprof()– measures speed of functionsff – memory-efficient storage of large data on disk and fast access functionsbigmemory -- Manage massive matrices with shared memory and memory-mapped files
    • R for Production Use
    • Derivative ProjectsRStudio – Integrated Development Environment (IDE)Rattle – Data Mining PackageRExcel – (Statconn) Connection between R and ExcelWeka – Java-based data mining, statistical analysis by RRapidMiner – Java-based Weka data mining, statistical analysis by RRevolution Analytics – Scaling R for the EnterpriseOracle R Enterprise – Integrated into Oracle
    • About Statconn (as of March 2013)Produces RAndFriends under noncommercial and commercial licensesAll the statconn tools work ONLY with 32-bit RstatconnDCOMrcom (GPL2, but requires statconnDCOM)RExcel 3.2.9 (ONLY 32-bit Office: 2003, 2007, 2010)
    • Sample Projects Using RThe Heritage Health Prize, Thomas NguyenA Direct Marketing In-flight Forecasting System, Shannon Terry & Ben OgorekMining Twitter for Airline Consumer Sentiment, Jeffrey BreenAlternative Data Sources for Measuring Market Sentiment and Events (Using R), JoeRothermich
    • The Julia Language
    • About JuliaHigh-level, high-performance dynamic open-source programming language for technicalcomputingSyntax similar to other technical computing environmentsFeaturesSophisticated compilerDistributed parallel executionNumerical accuracyExtensive mathematical function libraryUses C, C++, Fortran libraries extensively
    • Why Julia: “Because we are greedy”
    • Julia CommunityHosted on github550 mailing list subscribers (Google Groups)1,500 github followers190 forks50 total contributorsAs of September 2012, all contributors except the core developers had knownof the language for six months or lessJulia: A Fast Dynamic Language for Technical Computing (2012), Beazanson, Karpinski,Shah, Edelman
    • The Julia Manual
    • Julia Mathematical Functions
    • Julia Standard Library
    • Julia Performance
    • Key Ingredients of Julia PerformanceRich type information, provided naturally by multiple dispatchAggressive code specialization against run-time typesJulia’s LLVM-based just-in-time (JIT) compilerJulia: A Fast Dynamic Language for Technical Computing (2012), Beazanson, Karpinski,Shah, Edelman
    • Julia Performance Comparison
    • Julia Performance ComparisonJulia: A Fast Dynamic Language for Technical Computing (2012), Beazanson, Karpinski,Shah, Edelman
    • Julia RecommendationsThe software is ready for people already using C or FortranThe software will develop into a usable scripting language for R usersWait until version one for production use
    • Send me YourQuestions
    • ConclusionR provides production-ready software for statistical analysisJulia merits personal investment and promises high performance