•10 likes•5,885 views

Report

Share

Download to read offline

Data Science, Statistical Analysis and R... Learn what those mean, how they can help you find answers to your questions and complement the existing toolsets and processes you are currently using to make sense of data. We will explore R and the RStudio development environment, installing and using R packages, basic and essential data structures and data types, plotting graphics, manipulating data frames and how to connect R and SQL Server.

Follow

- 1. Data Analytics with R and SQL Server Stéphane Fréchette Thursday March 19, 2015
- 2. Who am I? My name is Stéphane Fréchette SQL Server MVP | Consultant | Speaker | Data & BI Architect | Big Data |NoSQL | Data Science. Drums, good food and fine wine. I have a passion for architecting, designing and building solutions that matter. Twitter: @sfrechette Blog: stephanefrechette.com Email: stephanefrechette@ukubu.com
- 3. Topics • What is R? • Should I use R? • Data Structures • Graphics • Data Manipulation in R • Connecting to SQL Server • Demos • Resources • Q&A
- 4. DISCLAIMER This is not a course nor a tutorial, but an introduction, a walkthrough to inspire you to further explore and learn more about R and statistical computing
- 5. “ Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.” - Wikipedia
- 6. What is R? • A programming language, environment for statistical computing and graphics • R has its origins in the S programming language created in the 1970’s • Best used to manipulate moderately sized datasets, do statistical analysis and produce data-centric documents and presentations • These tools are distributed as packages, which any user can download to customize the R environment • Cross-platform: runs on Mac, Windows and Unix based systems
- 7. Should I use R? Are you doing statistics ? No Yes No Yes Where “statistics” can mean machine learning, predictive analytics, data science, anything that falls under a rather broad umbrella… But if you have some data that makes sense to represent in a tabular like structure, and you want to do some cool analytical or statistics stuff with it, R is definitely a good choice…
- 8. Downloading and Installing R http://www.r-project.org/ http://www.rstudio.com/
- 9. The IDE (RStudio) 1. View Files and Data 2. See Workspace and History 3. See Files, Plots, Packages and Help 4. Console 1 2 34
- 10. Installing Packages • To use packages in R, one must first install them using the install.packages function • Downloads the packages from CRAN and installs it to ready to be use
- 11. Loading Packages • To use particular packages in your current R session, one must load it into the R environment using the library or require functions
- 12. Common Data Structures in R To make the best of the R language, one needs a strong understanding of the basic data types and data structures and how to operate and use them. R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists… To understand computations in R, two slogans are helpful: • Everything that exists is an object • Everything that happens is a function call John Chambers creator of the S programming language, and core member of the R programming language project.
- 13. Data Structures - Vectors The simplest structure is the numeric vector, which is a single entity consisting of an ordered collection of numbers.
- 14. Data Structures - Matrices Matrices are nothing more than 2-dimensional vectors. To define a matrix, use the function matrix.
- 15. Data Structures - Data frames Time series are often ordered in data frames. A data frame is a matrix with names above the columns. This is nice, because you can call and use one of the columns without knowing in which position it is.
- 16. Data Structures - Lists An R list is an object consisting of an ordered collection of objects known as its components.
- 17. Data Structures - Date and Time Sys.time() # returns the current system date time
- 18. Data Structures - Date and Time Two main (internal) formats for date-time are: POSIXct and POSIXlt • POSIXct: A short format of date-time, typically used to store date-time columns in a data-frame • POSIXlt: A long format of date-time, various other sub-units of time can be extracted from here
- 19. Data Structures - Others Other useful and important data type • NULL: Typically used for initializing variables. (x = NULL) creates a variable x of length zero. The function is.null() returns TRUE or FALSE and tells whether a variable is NULL or not. • NA: Used for denoting missing values. (x = NA) creates a variable x with missing values. The function is.na() returns TRUE or FALSE and tells whether a variable is NA or not. • NaN: NaN stands for “Not a Number”. Prints a warning message in console. The function is.nan() lets you check whether the value of a variable is NaN or not. • Inf: Inf stands for “Infinity”. (x = 10/0 ; y = -3/0) sets value of x to Inf ad y to –Inf. The function is.finite() lets you check whether the value of a variable is infinity or not.
- 20. Graphics One of the main reasons data analysts and data scientists turn to R is for its strong graphic capabilities. Basic Graphs: • These include density plots (histograms and kernel density plots), dot plots, bar charts (simple, stacked, grouped), line charts, pie charts (simple, annotated, 3D), boxplots (simple, notched, violin plots, bagplots) and scatter plots (simple, with fit lines, scatterplot matrices, high density plots, and 3D plots).
- 21. Graphics Advances Graphs: • Graphical parameters describes how to change a graph's symbols, fonts, colors, and lines. Axes and text describe how to customize a graph's axes, add reference lines, text annotations and a legend. Combining plots describes how to organize multiple plots into a single graph. • The lattice package provides a comprehensive system for visualizing multivariate data, including the ability to create plots conditioned on one or more variables. The ggplot2 package offers a elegant systems for generating univariate and multivariate graphs based on a grammar of graphics.
- 22. Data Manipulation in R dplyr an R package for fast and easy data manipulation. Data manipulation often involves common tasks, such as selecting certain variables, filtering on certain conditions, deriving new variables from existing variables, and so forth. If we think of these tasks as “verbs”, we can define a grammar of sorts for data manipulation. In dplyr the main verbs (or functions) are: • filter: select a subset of the rows of a data frame • arrange: works similarly to filter, except that instead of filtering or selecting rows, it reorders them • select: select columns of a data frame • mutate: add new columns to a data frame that are functions of existing columns • summarize: summarize values • group_by: describe how to break a data frame into groups of rows
- 23. Demo [dplyr – manipulating data]
- 24. Connecting R and SQL Server The RODBC package provides access to databases (including Microsoft Access and Microsoft SQL Server) through an ODBC interface Function Description odbcConnection(dsn, uid = “”, pwd = “”) Open a connection to an ODBC database sqlFetch(channel, sqtable) Read a table from an ODBC database into a data frame sqlQuery(channel, query) Submit a query to an ODBC database and return the results sqlSave(channel, mydf, tablename = sqtable, append = FALSE) Write or update (append=TRUE) a data frame to a table in the ODBC database sqlDrop(channel, sqtable) Remove a table from the ODBC database close(channel) Close the connection
- 25. RODBC Example
- 26. Other interface The RJDBC package provides access to databases through a JDBC interface. (requires JDBC driver from Microsoft)
- 27. Demo [Let’s analyze - R and SQL Server]
- 28. Resources • The R Project for Statistical Computing http://www.r-project.org/ • RStudio http://www.rstudio.com/ • Revolution Analytics http://www.revolutionanalytics.com/ • Shiny http://shiny.rstudio.com/ • {swirl} Learn R, in R http://swirlstats.com/ • R-bloggers http://www.r-bloggers.com/ • Online R resources for Beginners http://bit.ly/1x2q6Gl • 60+ R resources to improve your data skills http://bit.ly/1BzW4ox • Stack Overflow - R http://stackoverflow.com/tags/r • Cerebral Mastication - R Resources http://bit.ly/17YhZj4 • Microsoft JDBC Drivers 4.1 and 4.0 for SQL Server http://bit.ly/1kEgJ7O
- 29. What Questions Do You Have?
- 30. Thank You For attending this session