Competencies
Synthesize the application of software used in data science environments.
Explain data storage processes and database management systems.
Explain statistical techniques used in data science.
Explain the use of classification analysis in data science.
Explain the use of cluster analysis in data science.
Describe the data science project lifecycle.
Scenario
After working in the industry for a number of years, you have decided to become a full time, self-employed consultant. William Cogswell, President of Cogswell Cogs, works with highly proprietary information, but has some sample data that he is familiar with. He requests that you perform a quick proof of concept with this sample data to showcase your skills and show William and his leadership team what you can offer them. If he and his team at Cogswell Cogs likes what they see, they will likely offer you a long term consulting contract for their data analysis business needs, at which time you would be allowed to access their proprietary data and information.
Instructions
In a comprehensive presentation to William Cogswell and the Leadership Team at Cogswell Cogs, address the following items. Include all code, screenshots, explanations, and other information necessary to prove that you will be a worthwhile hire as their consultant.
Present a statistical overview on the
Sales Forecasting Data file
and the following data:
Store
Dept
Date
Weekly_Sales
IsHoliday
1. Using the R programming language, complete the following tasks:
Generate the mean and standard deviation of the weekly sales using the R programming language.
Generate a histogram for the weekly sales.
Using the ‘cor’ function, generate individual correlations between “Weekly Sales” and the following parameters: “store, dept, Date (break out by month and year), and Holiday
2. Using the “R” statistical package, complete the following task:
Perform a multiple regression, modeling between “Weekly Sales” and the following parameters: “store, dept, Date (break out by month and year), and Holiday
3. Using the R programming rpart function, complete the following task:
Generate a decision tree model using the sales price “Weekly Sales” and the following parameters: “store, dept, Date (break out by month and year), and Holiday prune the tree appropriately in order to support a concise description that can lead to actionable results.
4. Using the
Email Dataset
, complete the following tasks:
Use the clusters.py Python module from the
Programming Collective Intelligence
text to perform a hierarchical clustering model.
Generate a cluster representation (image). You may wish to explore a subset of your data in order to support a smaller cluster representation.
Leverage the same module to perform a k-means clustering model. In this model you are not required to print out the cluster but rather the groups of the clusters (which rows are clustered together). Again, you may use a subset of ...
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
CompetenciesSynthesize the application of software used in dat
1. Competencies
Synthesize the application of software used in data science
environments.
Explain data storage processes and database management
systems.
Explain statistical techniques used in data science.
Explain the use of classification analysis in data science.
Explain the use of cluster analysis in data science.
Describe the data science project lifecycle.
Scenario
After working in the industry for a number of years, you have
decided to become a full time, self-employed consultant.
William Cogswell, President of Cogswell Cogs, works with
highly proprietary information, but has some sample data that
he is familiar with. He requests that you perform a quick proof
of concept with this sample data to showcase your skills and
show William and his leadership team what you can offer them.
If he and his team at Cogswell Cogs likes what they see, they
will likely offer you a long term consulting contract for their
data analysis business needs, at which time you would be
allowed to access their proprietary data and information.
Instructions
In a comprehensive presentation to William Cogswell and the
2. Leadership Team at Cogswell Cogs, address the following
items. Include all code, screenshots, explanations, and other
information necessary to prove that you will be a worthwhile
hire as their consultant.
Present a statistical overview on the
Sales Forecasting Data file
and the following data:
Store
Dept
Date
Weekly_Sales
IsHoliday
1. Using the R programming language, complete the following
tasks:
Generate the mean and standard deviation of the weekly sales
using the R programming language.
3. Generate a histogram for the weekly sales.
Using the ‘cor’ function, generate individual correlations
between “Weekly Sales” and the following parameters: “store,
dept, Date (break out by month and year), and Holiday
2. Using the “R” statistical package, complete the following
task:
Perform a multiple regression, modeling between “Weekly
Sales” and the following parameters: “store, dept, Date (break
out by month and year), and Holiday
3. Using the R programming rpart function, complete the
following task:
Generate a decision tree model using the sales price “Weekly
Sales” and the following parameters: “store, dept, Date (break
out by month and year), and Holiday prune the tree
appropriately in order to support a concise description that can
lead to actionable results.
4. Using the
Email Dataset
, complete the following tasks:
Use the clusters.py Python module from the
Programming Collective Intelligence
text to perform a hierarchical clustering model.
4. Generate a cluster representation (image). You may wish to
explore a subset of your data in order to support a smaller
cluster representation.
Leverage the same module to perform a k-means clustering
model. In this model you are not required to print out the
cluster but rather the groups of the clusters (which rows are
clustered together). Again, you may use a subset of the data in
order to represent a more tractable output.
5. Provide a summary recommending the tools that you think
best fit for the means of establishing a complete
institutionalized data pipeline for data analysis and
presentation. Address your recommendations in terms of Big
Data (extremely large data sets), as William Cogswell has
expressed that his proprietary data sets are extremely large.
Include the following topic areas, stating advantages and
disadvantages of the packages described and your
recommendation. Note: you may have overlap in your packages
as they can support more than one need. Again, note that you
need to express the support of advantages and disadvantages of
each in the context of extremely large data sets (Big Data).
Programming Languages (e.g. R, Python)
Machine Learning Libraries (e.g. Anaconda)
Extract-Transform-Load Utilities (e.g. Pentaho, Alteryx)
Databases
Graphic Support/ Dashboard Analytics (e.g. Tableau, Qlikview)
5. BI Software and Big Data (Hadoop, Apache Spark).
https://learning.rasmussen.edu/bbcswebdav/pid-5855341-dt-
content-rid-151629594_1/xid-151629594_1