Great Wide Open - Day 1

Derek Norton - Revolution Analytics

11:15 AM - Operations 2 (Big Data)

Technology

- 1. Big Data Analytics with R Derek McCrae Norton, Senior Sales Engineer April 2, 2014
- 2. Agenda Introduction Big Data Analytics R Revolution R Enterprise Synergy Conclusion © 2013 Revolution Analytics
- 3. Who are you anyway? Statistician – My degrees are all in statistics. Consultant – My experience has been mostly in Marketing Analytics focusing on Predictive Analytics. Sales Engineer – Still consulting, just with a much heavier emphasis on client interaction. Founder/Director Atlanta R Users Group. – Shameless plug. Please join if interested. – http://www.meetup.com/R-Users-Atlanta/ Husband, Father, Outdoorsman, Serial Hobbyist, … © 2013 Revolution Analytics
- 4. Big Data © 2013 Revolution Analytics
- 5. Big Data and Big Opportunities © 2013 Revolution Analytics “Big data is data that exceeds the processing capability of conventional database systems” Edd Dumbill O’Reilly Radar*, Jan 2012 Worldwide data created and replicated, Zettabytes 1 2 35 * radar.oreilly.com/2012/01/what-is-big-data.html
- 6. What is Big Data? Big Data is a loosely defined term used to describe data sets so large and complex that they become awkward to work with using standard statistical software. © 2013 Revolution Analytics Snijders, Matzat, & Reips (2012)
- 7. Does Big Data Mean Hadoop? The short answer is no. The longer answer is maybe. Hadoop adoption is turning that maybe into a probably. © 2013 Revolution Analytics ?
- 8. Analytics © 2013 Revolution Analytics
- 9. What is Analytics? Analytics is the combination of mathematical, statistical, and heuristic techniques to glean useful insights from data and to implement actions derived from those insights. © 2013 Revolution Analytics Derek McCrae Norton
- 10. Analytics The current buzzword is “Data Science,” but I don’t really agree with that nomenclature. – What statistician, analyst, (data scientist) actually follows the scientific method? That being said, the current definition of “Data Science” is a pretty good surrogate for what we are discussing. Whatever descriptors you use, one thing is clear… You must use something to help you carry out the actual work. – R, Python, SAS, etc. – RDBMS, Hadoop, etc. © 2013 Revolution Analytics
- 11. © 2013 Revolution Analytics
- 12. What is the R language? A Platform… – A Procedural Language for Stats, Math and Data Science – A Complete Data Visualization Framework – Provided as Open Source A Community… – 2M+ Users with the Skill to Tackle Big Data Statistical and Numerical Analysis and Machine Learning Projects – Active User Groups Across the World An Ecosystem – CRAN: 5000+ Freely Available Packages – Applicable to Big Data if scaled © 2013 Revolution Analytics
- 13. THE R USER COMMUNITY
- 14. A brief history of R 1993: Research project in Auckland, NZ – Ross Ihaka and Robert Gentlemen 1995: Released as open-source software – Generally compatible with the “S” language 1997: R core group formed 2000: R 1.0.0 released 2004: First international user conference in Vienna 2013: R 3.0.0 released © 2013 Revolution Analytics
- 15. R is Free Open Source, licensed under GPL (like Linux!) – Free as in beer – Free as in freedom Flexible Open for integration – Data (SAS, SPSS, Excel, SQL Server, Oracle, …) – Systems (applications, webservers, …) Broad user-base – De-facto standard for data analysis teaching © 2013 Revolution Analytics
- 16. 16 R is exploding in popularity & function Web Site Popularity Number of links to main web site R SAS SPSS S-Plus Stata Scholarly Activity Google Scholar hits (’05-’09 CAGR) R 46% SAS -11% SPSS -27% S-Plus 0% Stata 10% Internet Discussion Mean monthly traffic on email discussion list R SAS Stata SPSS S-Plus Package Growth Number of R packages listed on CRAN 4,332 as of Feb 2013 © 2013 Revolution Analytics
- 17. So why isn’t everyone using R? “The best thing about R is that it was developed by statisticians. The worst thing about R is that it was developed by statisticians.” © 2013 Revolution Analytics Bo Cowgill Google (at SF R Meetup)
- 18. Otherwise R is Great! Right? Who here has used R? – Thoughts? Who has never seen this? Who here has more than 1 core/processor? Who has ever used r-help? – ’They’ did write documentation that told you that Perl was needed, but ‘they’ can’t read it for you. - Brian D. Ripley, R-help (February 2001) – This is all documented in TFM. Those who WTFM don’t want to have to WTFM again on the mailing list. RTFM. - Barry Rowlingson, R-help (October 2003) © 2013 Revolution Analytics
- 19. What is Revolution R Enterprise? © 2013 Revolution Analytics
- 20. Motivators © 2013 Revolution Analytics Big Data In-memory bound Hybrid memory & disk scalability Operates on bigger volumes & factors Speed of Analysis Single threaded Parallel threading Shrinks analysis time Enterprise Readiness Community support Commercial support Delivers full service production support Analytic Breadth & Depth 5000+ innovative analytic packages Leverage open source packages plus Big Data ready packages Supercharges R Commercial Viability Risk of deployment of open source Commercial license Eliminate risk with open source
- 21. Introducing Revolution R Enterprise (RRE) The Big Data Big Analytics Platform DistributedR DevelopR DeployR ScaleR ConnectR Big Data Big Analytics Ready – Enterprise readiness – High performance analytics – Multi-platform architecture – Data source integration – Development tools – Deployment tools © 2013 Revolution Analytics
- 22. The Platform Step by Step: R Capabilities R+CRAN • Open source R interpreter • UPDATED R 3.0.2 • Freely-available R algorithms • Algorithms callable by RevoR • Embeddable in R scripts • 100% Compatible with existing R scripts, functions and packages RevoR • Performance enhanced R interpreter • Based on open source R • Adds high-performance math Available On: • PlatformTM LSFTM Linux® • Microsoft® HPC Clusters • Windows® & Linux Servers • Windows & Linux Workstations • IBM® Netezza® • NEW Cloudera Hadoop® • NEW Hortonworks Hadoop • NEW Teradata® Database • Intel® Hadoop • IBM BigInsightsTM © 2013 Revolution Analytics
- 23. The Platform Step by Step: Parallelization & Data Sourcing ConnectR • High-speed & direct connectors Available for: • High-performance XDF • SAS, SPSS, delimited & fixed format text data files • Hadoop HDFS (text & XDF) • Teradata Database & Aster • EDWs and ADWs • ODBC ScaleR • Ready-to-Use high-performance big data big analytics • Fully-parallelized analytics • Data prep & data distillation • Descriptive statistics & statistical tests • Correlation & covariance matrices • Predictive Models – linear, logistic, GLM • Machine learning • Monte Carlo simulation • NEW Tools for distributing customized algorithms across nodes DistributedR • Distributed computing framework • Delivers portability across platforms Available on: • Windows Servers • Red Hat and NEW SuSE Linux Servers • IBM Platform LSF Linux • Microsoft HPC Clusters • NEW Teradata Database • NEW Cloudera Hadoop • NEW Hortonworks Hadoop © 2013 Revolution Analytics A single package (RevoScaleR)
- 24. DeployR • Web services software development kit for integration analytics via Java, JavaScript or .NET APIs • Integrates R Into application infrastructures Capabilities: • Invokes R Scripts from web services calls • RESTful interface for easy integration • Works with web & mobile apps, leading BI & Visualization tools and business rules engines DevelopR • Integrated development environment for R • Visual ‘step-into’ debugger Available on: • Windows The Platform Step by Step: Tools & Deployment DevelopR DeployR © 2013 Revolution Analytics
- 25. DistributedR ScaleR ConnectR DeployR Write Once. Deploy Anywhere. DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE In the Cloud Amazon AWS Workstations & Servers Desktop Server Clustered Systems IBM Platform LSF Microsoft HPC EDW Teradata Hadoop Hortonworks Cloudera © 2013 Revolution Analytics
- 26. Synergy © 2013 Revolution Analytics
- 27. Put it all together Talent fresh out of school knows R. RRE is R plus more. RRE provides a unified way of carrying out analytics (small or big). RRE code is portable… © 2013 Revolution Analytics
- 28. Scale and Portability Set “compute context” to define hardware (one line of code) – Native job-scheduler handles distribution, monitoring, failover etc. Same code runs on other supported architectures – Just change compute context © 2013 Revolution Analytics 42 seconds instead of 6 minutes on the local machine
- 29. References 1. Snijders, C., Matzat, U., & Reips, U.-D. (2012). ‘Big Data’: Big gaps of knowledge in the field of Internet. International Journal of Internet Science, 7, 1-5. http://www.ijis.net/ijis7_1/ijis7_1_editorial.html 2. Conway, D, THE DATA SCIENCE VENN DIAGRAM © 2013 Revolution Analytics

