Getting Started with R


Published on

Part of advanced analytics course.

Published in: Education, Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The other way is to ATTACH() the ICECREAM data structure. Then you can refer to the variable names directly.
  • Getting Started with R

    1. 1. Advanced Data Analytics: Getting Started with R Jeffrey Stanton School of Information Studies Syracuse University
    2. 2. Analytics: Key Steps• Learn the application domain• Locate or develop a data source or data set• Clean and preprocess data: May take 60% of effort!• Data reduction and transformation – Find useful pieces, squeeze out redundancies• Choose analytical approaches – summarize, visualize, organize, describe, explore, find patterns, predict, test, infer• Communicate the results and implications to data users• Deploy discovered knowledge in a system• Monitor and evaluate the effectiveness of the system 2
    3. 3. First Example: Ice Cream Consumption• We all know the domain, we have all eaten ice cream• Public data set obtained from supplement to Verbeek’s text:• Let’s read the data into R and summarize it:ICECREAM=read.csv("[pathname]/icecream.csv",header=T)summary(ICECREAM)• What do these two R commands do? Did you get a mean of 84.6 for Income? What are “Min,” “1st Qu.” and all of those other things? 3
    4. 4. Metadata• There is a text file that goes with the CSV dataset: “icecream.txt”• This describes the meaning of the variables provided in the dataset; essential if we are to make sense of these data:Variable labels:cons: consumption of ice cream per head (in pints);income: average family income per week (in US Dollars);price: price of ice cream (per pint);temp: average temperature (in Fahrenheit);Time: index from 1 to 30• We also learn from the metadata that these are time series data with monthly observations from 18 March 1951 to 11 July 1953 4
    5. 5. “Sanity Check” Using Histograms and Boxplots• Cleaning, screening, and preprocessing is essential to ensure that you understand what your data set contains and that it does not contain garbage; it is impractical to look at every data point so we use histograms and boxplots to overview our data:hist(ICECREAM$income)boxplot(ICECREAM$income)• What is the purpose of the “$” notation in the commands above? Is there any other way of referring to these variables? 5
    6. 6. Interpret These Graphics 6
    7. 7. Explore• Perhaps a family with greater income can afford to purchase more ice cream:plot(ICECREAM$income,ICECREAM$cons)• How do you interpret a scatterplot?• Is there a pattern here?• Does our intuitive hypothesis fit the scatterplot?• What else could scatterplots show? 7
    8. 8. More Tools to Support Explorationresults=lm(ICECREAM$cons~ICECREAM$temp)# This is a comment line# The previous command calculates a line# that best fits the scatterplot with temp# on the X axis and cons on the Y axisplot(ICECREAM$temp,ICECREAM$cons)abline(results) # Plots the best fit line# The new data structure “results” has# lots of information about the analysis.# What does this list contain:results$residuals 8
    9. 9. What is the effect of time on these data?plot(ICECREAM$time,ICECREAM$temp)plot(ICECREAM$time,ICECREAM$cons)• What do these plots show? Can you explain why these are shaped the way they are?• Based on your answer to the previous question, how does the situation affect your strategies for understanding ice cream consumption? 9
    10. 10. Demonstrating Mastery• Find a small numeric dataset; try starting at the Journal of Statistical Education data website:• Read the dataset into R• Summarize the variables in that dataset• Use histograms and boxplots to check and understand your data; use the metadata description that came with the dataset to make sure that you know the variables• Explore the data using plot; look for something interesting• Put your findings in a slide and communicate them to me or someone else 10