INFX 502 Semester Project
Due Date: December 6, 23:55pm
Description:
In this project your task is to analyze/visualize a dataset which has at least two categorical and three numerical variables (or columns, or features). The higher the number of variables the richer the analyses. It is important to find or compile a dataset that you are truly interested in. You may choose one of the built-in R datasets. More preferably, you may search for datasets in the Internet or resort to the web sites provided below. You are allowed to use MS Excel to merge different datasets and clean your data, before you save it in .csv format and load into R environment for visual analyses. You are supposed to use all applicable techniques that you have learned during the semester as well as the past statistics course. For example:
You may plot figures of two-variables and/or three-variables in order to find if a variable is correlated to another variable(s).
You may analyze (visualize) the summary statistics of individual variables, as well as their conditional statistics.
Related to the previous items:
You may visualize two continuous variables together to show their correlation and discuss the coefficient of determination.
You may visualize a continuous variable together with a categorical variable to show how univariate statistics of the continuous variables change with respect to different levels of the categorical variable. You may apply, t-test, ANOVA, F-test to test various hypothesis that you learned in your STAT course.
You may compute and show the contingency table of two categorical variables and visualize it using a heatmap. Moreover you can apply Chi-square test of independence to reveal relations between the variables.
You may detect outliers and try to reason their existence in the dataset.
Depending on your data, you may model your data using linear regression or some other regression technique along with residual analysis and explain the reasoning behind your model and the coefficients that you found.
If you have time series data, you may decompose your series into trend, seasonal and random components. Then, develop discussions on those components individually or together.
You may use clustering techniques to cluster your instances based on one or more features.
Datasets:
Resource
Description
> library(help="datasets")
The R Datasets Package
http://www.data.gov/
US Federal Government Dataset Collection
https://wonder.cdc.gov/welcomet.html
Centers for Disease Control and Prevention
http://www.loc.gov/rr/main/alcove9/statdata.html
Statistical Databases and Data Sets
http://r-dir.com/reference/datasets.html
R-Dir Free Datasets
http://www.r-bloggers.com/datasets-to-practice-your-data-mining/
Datasets to Practice Your Data Mining
Deliverables:
You need to write a detailed .pdf report. Your report should have a cover page with at least a report title, your name, and ULID.
Your report consists of three sections namely, Dataset, Analysis, ...
INFX 502 Semester ProjectDue Date December 6, 2355pmDesc.docx
1. INFX 502 Semester Project
Due Date: December 6, 23:55pm
Description:
In this project your task is to analyze/visualize a dataset which
has at least two categorical and three numerical variables (or
columns, or features). The higher the number of variables the
richer the analyses. It is important to find or compile a dataset
that you are truly interested in. You may choose one of the
built-in R datasets. More preferably, you may search for
datasets in the Internet or resort to the web sites provided
below. You are allowed to use MS Excel to merge different
datasets and clean your data, before you save it in .csv format
and load into R environment for visual analyses. You are
supposed to use all applicable techniques that you have learned
during the semester as well as the past statistics course. For
example:
You may plot figures of two-variables and/or three-variables in
order to find if a variable is correlated to another variable(s).
You may analyze (visualize) the summary statistics of
individual variables, as well as their conditional statistics.
Related to the previous items:
You may visualize two continuous variables together to show
their correlation and discuss the coefficient of determination.
You may visualize a continuous variable together with a
categorical variable to show how univariate statistics of the
continuous variables change with respect to different levels of
the categorical variable. You may apply, t-test, ANOVA, F-test
to test various hypothesis that you learned in your STAT course.
You may compute and show the contingency table of two
categorical variables and visualize it using a heatmap. Moreover
you can apply Chi-square test of independence to reveal
relations between the variables.
2. You may detect outliers and try to reason their existence in the
dataset.
Depending on your data, you may model your data using linear
regression or some other regression technique along with
residual analysis and explain the reasoning behind your model
and the coefficients that you found.
If you have time series data, you may decompose your series
into trend, seasonal and random components. Then, develop
discussions on those components individually or together.
You may use clustering techniques to cluster your instances
based on one or more features.
Datasets:
Resource
Description
> library(help="datasets")
The R Datasets Package
http://www.data.gov/
US Federal Government Dataset Collection
https://wonder.cdc.gov/welcomet.html
Centers for Disease Control and Prevention
http://www.loc.gov/rr/main/alcove9/statdata.html
Statistical Databases and Data Sets
http://r-dir.com/reference/datasets.html
R-Dir Free Datasets
http://www.r-bloggers.com/datasets-to-practice-your-data-
mining/
Datasets to Practice Your Data Mining
Deliverables:
You need to write a detailed .pdf report. Your report should
3. have a cover page with at least a report title, your name, and
ULID.
Your report consists of three sections namely, Dataset,
Analysis, and Summary.
1) Dataset: In the first section you are expected to thoroughly
explain your dataset. Your explanation should at least include
the following:
A description of the dataset
A table with variable (column) names in the dataset and their
descriptions
From where and when you obtained the dataset
What you expect to find during your analysis.
First few lines of your dataset obtained through the “head”
command
2) Analysis: In the second section you are expected to analyze
your data in detail. You are required to use all applicable
techniques covered throughout the course as well as your past
statistics courses.
3) Summary: In the summary section you need to briefly
mention your findings in the dataset and whether they match
what you were expecting to find before the analysis.