Running head: CS688 – Data Analytics with R1
CS688 – Data Analytics with R10
CS688 – Data Analytics with R
Surendra Parimi
CS688 – Introduction to CRISP-DM and the R platform IP 1
Colorado Technical University
07/10/2019
Table of Contents
Introduction to CRISP-DM and the R Platform Organizational Background3
Organizational Background:3
CRISP-DM(Cross-industry standard process for data mining):3
Data Maturity:4
Role of Data Analyst:6
How Do we Implement the R Platform:6
R Modeling With Regressions and Classifications (TBD)7
Model Performance Evaluation (TBD)8
Visualizations With R (TBD)9
Machine Learning (TBD)10
References11
Introduction to CRISP-DM and the R Platform Organizational BackgroundOrganizational Background:
The organization I currently work for and planning to implement the techniques of the data analytics course is T-Mobile USA, which offers wireless mobile phone services to 0ver 80 million customers in the United States. It’s a huge enterprise with large scale information technology systems that support the business that T-Mobile does. The company is seeing significant growth in terms of business and therefore the IT systems that are supporting the business. Myself as a DEVOPS engineer works on deploying the code to these mission critical systems, host them and operate to make sure the systems are working as expected. As the land scape of our IT systems grow, we want to be able to identify the issues in our systems in advance so that we can prevent them before causing any outage to the business. To achieve such a result, our IT systems logs needs to be analyzed in-depth to unleash the critical insights about the system performance and apply the feedback to improve our systems.
CRISP-DM(Cross-industry standard process for data mining):
The CRISP-DM helps us ensure our data analysis adheres certain standards and CRISP-DM is a proven strategy worldwide. Corporations like IBM have further enhanced and or customized the standard and came up with their own methodology knows as ‘Analytics
Solution
s Unified Method for Data Mining/Predictive Analytics(ASUS_DM)’
The CRISP-DM methodology involves 6 different steps
Business Understanding: Building the knowledge about business requirements and objectives from functional aspect and transforming this knowledge as a data mining objective with an implementation plan.
Data Understanding: Involves the process of data collection from diverse sources of data, review and understand the data to be able to identify the problems which compromise data quality and also give the initial understanding of what the data can deliver.
Data Preparation: The data preparation phase covers all activities to build the final dataset from the initial raw data collected.
Modeling: Modeling techniques are based on the objective of the problem being tried. So, based on the problem, model is decided and based on the model, data is collected.
Evaluation: The evaluation phase is taken up once.
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Running head CS688 – Data Analytics with R1CS688 – Data Analyt.docx
1. Running head: CS688 – Data Analytics with R1
CS688 – Data Analytics with R10
CS688 – Data Analytics with R
Surendra Parimi
CS688 – Introduction to CRISP-DM and the R platform IP 1
Colorado Technical University
07/10/2019
Table of Contents
Introduction to CRISP-DM and the R Platform Organizational
Background3
Organizational Background:3
CRISP-DM(Cross-industry standard process for data mining):3
Data Maturity:4
Role of Data Analyst:6
2. How Do we Implement the R Platform:6
R Modeling With Regressions and Classifications (TBD)7
Model Performance Evaluation (TBD)8
Visualizations With R (TBD)9
Machine Learning (TBD)10
References11
Introduction to CRISP-DM and the R Platform Organizational
BackgroundOrganizational Background:
The organization I currently work for and planning to
implement the techniques of the data analytics course is T-
Mobile USA, which offers wireless mobile phone services to
0ver 80 million customers in the United States. It’s a huge
enterprise with large scale information technology systems that
support the business that T-Mobile does. The company is seeing
significant growth in terms of business and therefore the IT
systems that are supporting the business. Myself as a DEVOPS
engineer works on deploying the code to these mission critical
systems, host them and operate to make sure the systems are
working as expected. As the land scape of our IT systems grow,
we want to be able to identify the issues in our systems in
advance so that we can prevent them before causing any outage
to the business. To achieve such a result, our IT systems logs
needs to be analyzed in-depth to unleash the critical insights
about the system performance and apply the feedback to
improve our systems.
CRISP-DM(Cross-industry standard process for data mining):
The CRISP-DM helps us ensure our data analysis adheres
3. certain standards and CRISP-DM is a proven strategy
worldwide. Corporations like IBM have further enhanced and or
customized the standard and came up with their own
methodology knows as ‘Analytics
Solution
s Unified Method for Data Mining/Predictive
Analytics(ASUS_DM)’
The CRISP-DM methodology involves 6 different steps
Business Understanding: Building the knowledge about
business requirements and objectives from functional aspect and
transforming this knowledge as a data mining objective with an
implementation plan.
Data Understanding: Involves the process of data collection
from diverse sources of data, review and understand the data to
be able to identify the problems which compromise data quality
and also give the initial understanding of what the data can
deliver.
Data Preparation: The data preparation phase covers all
activities to build the final dataset from the initial raw data
collected.
Modeling: Modeling techniques are based on the objective of
the problem being tried. So, based on the problem, model is
decided and based on the model, data is collected.
4. Evaluation: The evaluation phase is taken up once a few models
are tested and the evaluation takes places against these models
to see which models fits the need best.
Deployment: Generally this will mean deploying a code
representation of the model into an operating system to score or
categorize new unseen data as it arises and to create a
mechanism for the use of that new information in the solution of
the original business problem. Importantly, the code
representation must also include all the data prep steps leading
up to modeling so that the model will treat new raw data in the
same manner as during model development.
You may well observe that there is nothing special here and
that’s largely true. From today’s data science perspective this
seems like common sense. This is exactly the point. The
common process is so logical that it has become embedded into
all our education, training, and practice. Data Maturity:
The data maturity in our organization has a traditional approach
so far with applications writing critical data from business
related, transactional to application logs which includes diverse
information such as functional health of the application, and
nonfunctional aspects such as performance etc., and all this data
is written into RDBMS databases which has tables and rows. All
this data is organized in such a way that the Data Warehouse
schemas are implemented by logically segregating the data
based on business functions. Over a period of time, as the
5. systems matured, we started to realize that there is abundant of
useful information in our data which is not inclusive in the
RDBMS design and applied big data techniques to be able to
include the unstructured or big data that has lot of critical
information for our business.
Once we started including the unstructured data into our
analysis, we are able to draw better insights, but it wasn’t still
perfect, so we adopted additional research methodologies into
the data to be able to focus on the right data (quality of data),
which helps us give right insight and aids in making critical
business decisions to that of better maintaining our application
suite based on the insight we get from our application logs.
Once we arrived at this stage, we are able to perform predictive
analyses to some extent and based on the results, we are
updating our data strategy so that the right data is captured into
the systems from the very beginning and a robust analysis can
be done to achieve critical insights with predictability, research
and classification.
A separate metadata repository has been created with logical
grouping so that the data organization, utilization would be
efficient and accurate. Also the metadata management helped us
avoid the redundant tasks in dealing with the data and helped
faster analytics. Entire teams and stake holder teams have been
involved and agreed upon the metadata management policies
and implemented across the organization to achieve consistency
6. across the systems.Role of Data Analyst:
The role of an analyst in dealing with cutting edge needs of the
data analysis gets interesting and makes it a very dynamic job.
The analyst while using the data would now think about what
kind of data needs to be collected and makes sure to call out the
missing elements of the data beforehand so that all the critical
pieces of data , both structured and unstructured gets into the
data analytics system. The analyst is expected to have the
ability to visualize the insights to be drawn out of the data and
perform relative comparisons, develop predictive models so that
To cover all the diverse insights that the data has to offer, the
analyst needs to be able to think of correlation factors, and the
data that is needed to correlate, compare and apply the
predictive models.
The Role of the Analyst from a typical analysis to the data
focused analysis would greatly enhance the overall
understanding of the analyst about the organizational objectives
and how to extract the desired insights out of this data and more
importantly, collaborate with all the required teams to be able to
get the right data to work on. How Do we Implement the R
Platform:
We utilize the R platforms robust data analytics capability in
tandem with the R studio to develop and experiment our
planning and implementation. R being an integrated suite, we
are planning to utilize it’s computing power with statistical data
7. manipulation, graphical display for analysis, presentations and
reviews. R’s suite of operations in calculations over arrays of
data especially with huge data sets would be very helpful in
achieving our goal.
The linear, non-linier, time series analysis, and various
statistical patterns are offered by R as libraries and it’s easy to
implement R in our organization as it lays out the inroads for
data analysis.
R Modeling With Regressions and Classifications (TBD)
12. Machine Learning Strategy with examples. Provide the
background for an organization, including the type of business,
major data types, and the business processes that use these data.
Describe the maturity of the data, the role of analysis versus
analytics, and how the R language can be used to improve
business.
During Week 1, you will establish the foundation and shell
document for your final assignment, which will be your
Enterprise Data Analytics and Machine Learning Strategy. Each
subsequent week, you will revise and complete an additional
section.
First, you will select an organization (real or hypothetical) and
apply your research to the development of an Enterprise Data
Analytics and Machine Learning Strategy that would be
appropriate for statistical data mining with the organization.
The project deliverables include the following:
· Organizational Background
· Provide a brief description of the organization (real or
hypothetical), type of business, major data types, and the
business processes that use these data.
· Data Maturity Within the Organization
· Describe the maturity of data within the organization,
including data quality, master data management, use of data
warehouses, and the importance of data in making business
decisions.
13. · Describe the analyst role and his or her use of data. Elaborate
on how analytics may augment or replace the common analyst
role.
· Discuss how the R platform may be useful to the organization.
The draft paper should be 10–12 pages, including empty
sections. It should be formatted using APA style and include at
least two references. The addition of the new material shall be
3–4 pages of original content.
Assignment 2
For Week 2, you will extend your Enterprise Data Analytics and
Machine Learning Strategy to include the appropriate use of
regression and classification methods within your organization.
During this Week, you will utilize the Iris dataset provided with
14. R or locate or create an example dataset that meets the
assumptions of either a regression or classification model and
illustrate the application with R or RStudio. The example is
intended to illustrate how these techniques may be applied
within the organization. Additional discussion should occur
around similar approaches with organizational-specific data.
The project deliverables include the following:
· Describe regression and classification techniques, their uses,
and when and why they are used.
· Utilize example data, such as the Iris data provided with R, or
locate or create an example dataset that meets the assumptions
for regression or classification models. Describe the data, and
provide code examples for utilization of either a regression or
classification approach in R or RStudio. Include screenshots
where appropriate, and discuss the steps utilized.
· Discuss how these results would be communicated to a
technical and nontechnical audience.
· Discuss how these techniques would be used within the
organization. Use examples to reinforce your ideas.
Using the partially completed template you created last week,
add 3–4 pages of new content. It should be formatted using APA
style and include at least two references.
15. Assignment 3
Extend the Enterprise Data Analytics and Machine Learning
Strategy plan to include model performance evaluation
techniques. Building upon the regression or classification
technique discussed via an analysis with code examples, provide
additional code that evaluates the performance of the model.
For Week 3, you will extend your Enterprise Data Analytics and
Machine Learning Strategy to include the appropriate use of
performance evaluation.
During this week, you will continue to utilize either the Iris
dataset or an example dataset of your choosing to evaluate the
performance of prior modeling. Screenshots of R or RStudio
16. should be provided to support the research and analysis.
Additional discussion should occur around similar approaches
with organizational-specific data.
The project deliverables must include the following:
· Describe performance evaluation for regression and
classification.
· Expanding upon the modeling example in Week 2, discuss
specific performance evaluation considerations for the modeling
technique.
· Provide code examples and a discussion for how the model
will be evaluated.
· Discuss the overall fit for use of the algorithm to make a data-
driven decision as well as any risks for use.
· Discuss how these techniques would be used within the
organization, specific to the available data and desired
outcomes. Use examples to reinforce your ideas.
Using the partially completed template created in Week 1 and
extended in Week 2, add 3–4 pages of new content. It should be
formatted using APA style and include at least two references.
17. Assignment 4
Building upon the modeling, optimization, and validation, you
will now explore how visualization can assist with these
activities as well as communicate the findings of the analytics
project.
For Week 4, you will extend your Enterprise Data Analytics and
Machine Learning Strategy to include visualization techniques.
During this week, you will continue to utilize either the Iris
dataset or an example dataset of your choosing to apply
analytics visualization techniques. Screenshots of R or RStudio
should be provided to support the research and analysis.
Additional discussion should occur around similar approaches
with organizational-specific data.
The project deliverables include the following:
· Describe the benefits of visualization in an analytics project
from two perspectives: interpreting models and communicating
results.
18. · Out of the following potential visualization techniques, or
others if you choose to research additional techniques, choose 3,
and compare and contrast their benefit and when they should be
used. In this discussion, include the role of static and
interactive visualizations.
· Histogram, box plot, bar or line chart, scatter plot, heat map,
mosaic map, geolocation map, three-dimensional (3-D) graphs,
correlogram, bubble chart, or arc graph
· Provide code examples and output examples for one of the
chosen models using the Iris or example dataset.
· Discuss how these techniques would be used within the
organization, specific to the available data and desired
outcomes. Use examples to reinforce your ideas.
Add 3–4 pages of new content to the plan you developed over
the length of this course. It should be formatted using APA
style and include at least two references.
19. Assignment 5
Extend your Enterprise Data Analytics and Machine Learning
Strategy plan to include the end-to-end predictive modeling
process with the R language. Emphasis will be placed on good
data management practices, automation, predictive modeling
competency, and communicating the results.
The first four weeks consisted of identifying an organizational
opportunity, utilizing regression and classification techniques,
evaluating model performance, and visualizations.
For Week 5, you will extend your Enterprise Data Analytics and
Machine Learning Strategy to include both flowcharts as well as
the use of an industry process, such as the Cross Industry
Standard Process for Data Mining (CRISP-DM).
During this week, you will continue to utilize either the Iris
dataset or an example dataset of your choosing to illustrate the
end-to-end use of R that aligns with the identified flowchart and
industry data mining process.
The project deliverables include the following:
· Identify an industry process upon which the organization may
choose to standardize. Examples could be CRISP-DM, Sample,
Explore, Modify, Model, and Assess (SEMMA), or Knowledge
Discovery in Databases (KDD).
20. · Create a workflow that illustrates the flow of data through the
identified process. This should include data source origination
through organizational use of the findings of modeling.
· Provide R code examples or screenshots of the end-to-end
process. Previous code from prior sections should be utilized;
however, the end-to-end code with supporting screenshots,
plots, and visualizations should be provided.
· Describe how the model would be deployed and used.
· Describe how the results of the identified modeling activity
would be communicated to a technical and nontechnical
audience.