Open Source in Analytics
Introduction
IIML ,DCE
Founder Decisionstats.com
Author R for Business Analytics
Brief History of Analytics
SAS and SPSS led from 1970-s to early 2000s
SAS leads market but very expensive
IBM bought SPSS...
Analytics Sub Components
Data Storage
Data Querying
Data Summarization
Data Visualization
Statistical Routines
Analytics Sub Components
Data Storage
Data Querying
Data Summarization
Data Visualization
Statistical Routines
Proprietary...
Analytics Sub Components
Data Storage
Data Querying
Data Summarization
Data Visualization
Statistical Routines
Proprietary...
Analytics using Python
● pandas http://pandas.pydata.org/ High-performance, easy-to-use data structures and data analysis ...
Analytics using R
http://www.r-project.org/
● RStudio and Revolution Analytics
● sqldf https://code.google.com/p/sqldf/ an...
Analytics using R
http://www.revolutionanalytics.com/
Analytics using R
http://www.revolutionanalytics.com/
Analytics using R
<blatant self promotion>
http://www.amazon.com/R-Business-Analytics-A-Ohri/dp/1461443423
R for Business ...
Analytics using Rapid Miner
Early adopter of open source analytics
Recently moved from Germany to USA
following PE infusio...
Analytics using Rapid Miner
Analytics using Rapid Miner
Analytics using other languages
Julia- faster than R http://julialang.org/
Julia is a high-level, high-performance dynamic...
IJulia !!
IJulia !!
Analytics using other languages
Clojure- for JVM http://clojure.org/
Clojure is a dynamic programming language that target...
Analytics using other languages
bigml.com (using clojure)
https://bigml.com/gallery/models
Analytics using other languages
Scala- for big data analytics http://www.scala-lang.org/
● A Scalable language
● Object-Or...
Analytics using Jaspersoft
OLAP
BIG DATA
(offered through cloud, mobile)
Analytics using Pentaho
Basically Weka
Reporting as well
Complete BI and Analytics Stack
Weka
Hadoop
http://hadoop.apache.org/
Hadoop- evolving ecosystem
Hadoop- evolving ecosystem
Hadoop- evolving ecosystem
R
http://www.r-project.org/
Open Source
Free
5000+ Packages
Growing Faster
>2 million users
RAM constraints??
R
http://www.r-project.org/
Object Oriented
has GUI and IDE
has Commercial offerings
R
http://www.r-project.org/
Object Oriented
has GUI and IDE
has Commercial offerings
R - Rattle- Data Mining GUI
http://www.r-project.org/
Object Oriented
has GUI and IDE
has Commercial offerings
R - R Commander
http://www.r-project.org/
Object Oriented
has GUI and IDE
has Commercial offerings
R -R Studio
R -Revolution Analytics
Free for Academics
World Wide !!
RevoScaleR package
for Big Data
Recommended Install -
http://info...
R -Revolution Analytics
Free for Academics
World Wide !!
RevoScaleR package
for Big Data
R -Big Data Packages
http://cran.r-project.org/web/views/HighPerformanceComputing.html
● The RHIPE package, started by Sap...
Terrific Data Mining using R GUI
Great Data Visualization using R GUI
So many packages- CRAN Views to
the rescue
http://cran.r-project.org/web/views/
Bayesian Bayesian Inference
ChemPhys Chemo...
So many packages- CRAN Views to
the rescue
http://cran.r-project.org/web/views/
NumericalMathematics Numerical Mathematics...
R in the Browser
http://www.r-fiddle.org/#/
http://statace.com/
http://www.rstudio.com/ide/server/
R -Hadoop Packages
https://github.com/RevolutionAnalytics/RHadoop/wiki
● plyrmr - higher level plyr-like data processing f...
R - Cloud Computing
http://cran.r-project.org/web/views/WebTechnologies.html
R -Big Data Packages
http://cran.r-project.org/web/views/HighPerformanceComputing.html
Large memory and out-of-memory data...
Data Scientist Tool Kit
● web scraping
● visualization
● machine learning
● data mining
● modeling
● sna
● social media an...
Data Scientist Programming Skills
Java http://www.learnjavaonline.org/
Python http://www.codecademy.com/tracks/python
SQL ...
Other place to learn
MOOCs 1 https://www.edx.org/ 2 https://www.coursera.org/ 3 https://www.udacity.com/ 4 https://www.ude...
Summary
Open source has greatly helped cut down cost
of software in analytics
The benefits of analytics continue to be man...
Thanks
Contact and Feedback-
ohri2007@gmail.com via http://linkedin.com/in/ajayohri
Upcoming SlideShare
Loading in...5
×

Open source analytics

2,321

Published on

how open source has helped analytics

Published in: Technology
1 Comment
3 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,321
On Slideshare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
45
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Open source analytics"

  1. 1. Open Source in Analytics
  2. 2. Introduction IIML ,DCE Founder Decisionstats.com Author R for Business Analytics
  3. 3. Brief History of Analytics SAS and SPSS led from 1970-s to early 2000s SAS leads market but very expensive IBM bought SPSS but still not open source R, Python and Hadoop Challenged this
  4. 4. Analytics Sub Components Data Storage Data Querying Data Summarization Data Visualization Statistical Routines
  5. 5. Analytics Sub Components Data Storage Data Querying Data Summarization Data Visualization Statistical Routines Proprietary Open Source OracleDBMS SQL Server Business Objects SAP SQL, SAS,Crystal Reports Tableau SAS,SPSS
  6. 6. Analytics Sub Components Data Storage Data Querying Data Summarization Data Visualization Statistical Routines Proprietary Open Source OracleDBMS SQL Server MySQL, NoSQL, Hadoop Business Objects SAP Pentaho, Jaspersoft SQL, SAS,Crystal Reports Still SQL,Pig, Hive Tableau R,Python,Javascript SAS,SPSS R,Python,RapidMiner
  7. 7. Analytics using Python ● pandas http://pandas.pydata.org/ High-performance, easy-to-use data structures and data analysis tools ● scikit-learn http://scikit-learn.org/stable/ Simple and efficient tools for data mining and data analysis and built on NumPy, SciPy, and matplotlib ● NumPy http://www.numpy.org/ ● SciPy http://www.scipy.org/scipylib/index.html ● matplotlib http://matplotlib.org/ ● statsmodels http://statsmodels.sourceforge.net/# Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available ● iPython http://ipython.org/ interactive computing
  8. 8. Analytics using R http://www.r-project.org/ ● RStudio and Revolution Analytics ● sqldf https://code.google.com/p/sqldf/ and RODBC http://cran.r-project.org/web/packages/RODBC/index.html ● ggplot2 http://ggplot2.org/ and ggmap and shiny ● RHadoop et al https://github.com/RevolutionAnalytics/RHadoop ● car, stats, forecast, sna,tm ● rattle and Rcommander (with plugins) More at http://rforanalytics.wordpress.com/
  9. 9. Analytics using R http://www.revolutionanalytics.com/
  10. 10. Analytics using R http://www.revolutionanalytics.com/
  11. 11. Analytics using R <blatant self promotion> http://www.amazon.com/R-Business-Analytics-A-Ohri/dp/1461443423 R for Business Analytics looks at some of the most common tasks performed by business analysts and helps the user navigate the wealth of information in R and its packages. With this information the reader can select the packages that can help process the analytical tasks with minimum effort and maximum usefulness . The use of Graphical User Interfaces (GUI) is emphasized in this book to further cut down and bend the famous learning curve in learning R. </blatant self promotion>
  12. 12. Analytics using Rapid Miner Early adopter of open source analytics Recently moved from Germany to USA following PE infusion One of the first marketplace for analytics extensions http://marketplace.rapid-i.com/UpdateServer/ One of the best GUI - Drag and Drop using flow
  13. 13. Analytics using Rapid Miner
  14. 14. Analytics using Rapid Miner
  15. 15. Analytics using other languages Julia- faster than R http://julialang.org/ Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library. The library, largely written in Julia itself, also integrates mature, best-of- breed C and Fortran libraries for linear algebra, random number generation, signal processing, and string processing.
  16. 16. IJulia !!
  17. 17. IJulia !!
  18. 18. Analytics using other languages Clojure- for JVM http://clojure.org/ Clojure is a dynamic programming language that targets the Java Virtual Machine . It is designed to be a general-purpose language, combining the approachability and interactive development of a scripting language with an efficient and robust infrastructure for multithreaded programming. Clojure is a compiled language - it compiles directly to JVM bytecode, yet remains completely dynamic. Every feature supported by Clojure is supported at runtime. Clojure is a dialect of Lisp https://bigml.com/gallery/models
  19. 19. Analytics using other languages bigml.com (using clojure) https://bigml.com/gallery/models
  20. 20. Analytics using other languages Scala- for big data analytics http://www.scala-lang.org/ ● A Scalable language ● Object-Oriented ● Functional ● Seamless Java Interop ● Functions are Objects ● Future-Proof ● Fun
  21. 21. Analytics using Jaspersoft OLAP BIG DATA (offered through cloud, mobile)
  22. 22. Analytics using Pentaho Basically Weka Reporting as well Complete BI and Analytics Stack
  23. 23. Weka
  24. 24. Hadoop http://hadoop.apache.org/
  25. 25. Hadoop- evolving ecosystem
  26. 26. Hadoop- evolving ecosystem
  27. 27. Hadoop- evolving ecosystem
  28. 28. R http://www.r-project.org/ Open Source Free 5000+ Packages Growing Faster >2 million users RAM constraints??
  29. 29. R http://www.r-project.org/ Object Oriented has GUI and IDE has Commercial offerings
  30. 30. R http://www.r-project.org/ Object Oriented has GUI and IDE has Commercial offerings
  31. 31. R - Rattle- Data Mining GUI http://www.r-project.org/ Object Oriented has GUI and IDE has Commercial offerings
  32. 32. R - R Commander http://www.r-project.org/ Object Oriented has GUI and IDE has Commercial offerings
  33. 33. R -R Studio
  34. 34. R -Revolution Analytics Free for Academics World Wide !! RevoScaleR package for Big Data Recommended Install - http://info.revolutionanalytics.com/free-academic.html
  35. 35. R -Revolution Analytics Free for Academics World Wide !! RevoScaleR package for Big Data
  36. 36. R -Big Data Packages http://cran.r-project.org/web/views/HighPerformanceComputing.html ● The RHIPE package, started by Saptarshi Guha and now developed by a core team via GitHub, provides an interface between R and Hadoop for analysis of large complex data wholly from within R using the Divide and Recombine approach to big data. ( link ) ● The rmr package by Revolution Analytics also provides an interface between R and Hadoop for a Map/Reduce programming framework. ( link ) ● A related package, segue package by Long, permits easy execution of embarassingly parallel task on Elastic Map Reduce (EMR) at Amazon. ( link ) ● The RProtoBuf package provides an interface to Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data. This package can be used in R code to read data streams from other systems in a distributed MapReduce setting where data is serialized and passed back and forth between tasks. ● The HistogramTools package provides a number of routines useful for the construction, aggregation, manipulation, and plotting of large numbers of Histograms such as those created by Mappers in a MapReduce application.
  37. 37. Terrific Data Mining using R GUI
  38. 38. Great Data Visualization using R GUI
  39. 39. So many packages- CRAN Views to the rescue http://cran.r-project.org/web/views/ Bayesian Bayesian Inference ChemPhys Chemometrics and Computational Physics ClinicalTrials Clinical Trial Design, Monitoring, and Analysis Cluster Cluster Analysis & Finite Mixture Models DifferentialEquations Differential Equations Distributions Probability Distributions Econometrics Computational Econometrics Environmetrics Analysis of Ecological and Environmental Data ExperimentalDesign Design of Experiments (DoE) & Analysis of Experimental Data Finance Empirical Finance Genetics Statistical Genetics Graphics Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization HighPerformanceComputing High-Performance and Parallel Computing with R MachineLearning Machine Learning & Statistical Learning MedicalImaging Medical Image Analysis MetaAnalysis Meta-Analysis Multivariate Multivariate Statistics NaturalLanguageProcessing Natural Language Processing
  40. 40. So many packages- CRAN Views to the rescue http://cran.r-project.org/web/views/ NumericalMathematics Numerical Mathematics OfficialStatistics Official Statistics & Survey Methodology Optimization Optimization and Mathematical Programming Pharmacokinetics Analysis of Pharmacokinetic Data Phylogenetics Phylogenetics, Especially Comparative Methods Psychometrics Psychometric Models and Methods ReproducibleResearch Reproducible Research Robust Robust Statistical Methods SocialSciences Statistics for the Social Sciences Spatial Analysis of Spatial Data SpatioTemporal Handling and Analyzing Spatio-Temporal Data Survival Survival Analysis TimeSeries Time Series Analysis WebTechnologies Web Technologies and Services gR gRaphical Models in R
  41. 41. R in the Browser http://www.r-fiddle.org/#/ http://statace.com/ http://www.rstudio.com/ide/server/
  42. 42. R -Hadoop Packages https://github.com/RevolutionAnalytics/RHadoop/wiki ● plyrmr - higher level plyr-like data processing for structured data, powered by rmr ● rmr - functions providing Hadoop MapReduce functionality in R ● rhdfs - functions providing file management of the HDFS from within R ● rhbase - functions providing database management for the HBase distributed database from within R http://amplab-extras.github.io/SparkR-pkg/ SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. https://github.com/nexr/RHive RHive is an R extension facilitating distributed computing via HIVE query. RHive allows easy usage of HQL(Hive SQL) in R, and allows easy usage of R objects and R functions in Hive.
  43. 43. R - Cloud Computing http://cran.r-project.org/web/views/WebTechnologies.html
  44. 44. R -Big Data Packages http://cran.r-project.org/web/views/HighPerformanceComputing.html Large memory and out-of-memory data ● The biglm package by Lumley uses incremental computations to offer lm() and glm() functionality to data sets stored outside of R's main memory. ● The ff package by Adler et al. offers file-based access to data sets that are too large to be loaded into memory, along with a number of higher-level functions. ● The bigmemory package by Kane and Emerson permits storing large objects such as matrices in memory (as well as via files) and uses external pointer objects to refer to them. . ● A large number of database packages, and database-alike packages (such as sqldf by Grothendieck and data.table ● The HadoopStreaming package provides a framework for writing map/reduce scripts for use in Hadoop Streaming; it also facilitates operating on data in a streaming fashion which does not require Hadoop. ● The speedglm package permits to fit (generalised) linear models to large data. ● The biglars package by Seligman et al can use the ff to support large-than-memory datasets for least-angle regression, lasso and stepwise regression. ● The bigrf package provides a Random Forests implementation with support for parellel execution and large memory. ● The MonetDB.R package allows R to access the MonetDB column-oriented, open source database system as a backend.
  45. 45. Data Scientist Tool Kit ● web scraping ● visualization ● machine learning ● data mining ● modeling ● sna ● social media analytics ● web analytics ● reproducible research ● TS forecasting ● spatial analysis ● data storage ● data querying
  46. 46. Data Scientist Programming Skills Java http://www.learnjavaonline.org/ Python http://www.codecademy.com/tracks/python SQL http://www.w3schools.com/sql/ R http://bigdatauniversity.com/bdu-wp/bdu-course/introduction-to-data-analysis-using-r/ http://www.statmethods.net/ Hadoop http://hortonworks.com/hadoop-training/ Linuxhttps://github.com/WilliamHackmore/linuxgems/blob/master/cheat_sheet.org.sh
  47. 47. Other place to learn MOOCs 1 https://www.edx.org/ 2 https://www.coursera.org/ 3 https://www.udacity.com/ 4 https://www.udemy.com/ Books Courses Workshops
  48. 48. Summary Open source has greatly helped cut down cost of software in analytics The benefits of analytics continue to be many Added with Big Data and Cloud and MOOCs -----total cost to geeks is much lower !!
  49. 49. Thanks Contact and Feedback- ohri2007@gmail.com via http://linkedin.com/in/ajayohri
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×