Introduction to Data Science - Week 3 - Steps involved in Data Science

DSA – 105 Introduction to
Data Science
Week 3 – Steps involved in Data Science
Ferdin Joe John Joseph, PhD
Faculty of Information Technology
Thai-Nichi Institute of Technology

Week 3
Agenda
• Steps involved in Data Science
Faculty of Information Technology, Thai - Nichi Institute of
Technology
2

Process in Data Science Life Cycle (DSLC)
Technology
3

DSLC
• Business understanding
• Data acquisition and understanding
• Modeling
• Deployment
• Customer acceptance
Technology
4

Business Understanding
Technology
5

Data Acquisition and Understanding
Technology
6

Data Modelling
Technology
7

Data Modelling (Contd)
Types of Data Models
• Conceptual: This Data Model defines WHAT the system contains. This
model is typically created by Business stakeholders and Data Architects.
The purpose is to organize, scope and define business concepts and rules.
• Logical: Defines HOW the system should be implemented regardless of the
DBMS. This model is typically created by Data Architects and Business
Analysts. The purpose is to developed technical map of rules and data
structures.
• Physical: This Data Model describes HOW the system will be implemented
using a specific DBMS system. This model is typically created by DBA and
developers. The purpose is actual implementation of the database.
Technology
8

Advantages and Disadvantages of Data Model
Advantages of Data model:
• The main goal of a designing data model is to make certain that data objects offered by the functional team are represented
accurately.
• The data model should be detailed enough to be used for building the physical database.
• The information in the data model can be used for defining the relationship between tables, primary and foreign keys, and stored
procedures.
• Data Model helps business to communicate the within and across organizations.
• Data model helps to documents data mappings in ETL process
• Help to recognize correct sources of data to populate the model
Disadvantages of Data model:
• To developer Data model one should know physical data stored characteristics.
• This is a navigational system produces complex application development, management. Thus, it requires a knowledge of the
biographical truth.
• Even smaller change made in structure require modification in the entire application.
• There is no set data manipulation language in DBMS.
Technology
9

Data Model - Nutshell
• Data modeling is the process of developing data model for the data to be stored in a Database.
• Data Models ensure consistency in naming conventions, default values, semantics, security while
ensuring quality of the data.
• Data Model structure helps to define the relational tables, primary and foreign keys and stored
procedures.
• There are three types of conceptual, logical, and physical.
• The main aim of conceptual model is to establish the entities, their attributes, and their
relationships.
• Logical data model defines the structure of the data elements and set the relationships between
them.
• A Physical Data Model describes the database specific implementation of the data model.
• The main goal of a designing data model is to make certain that data objects offered by the
functional team are represented accurately.
• The biggest drawback is that even smaller change made in structure require modification in the
entire application.
Technology
10

Data Vs Meta Data
Technology
11

Data Model Definition
Technology
12

Data Model Representation
Technology
13

Representation of Data Model
Technology
14

Scope of Data Modelling
Technology
15

Data Model Aspects
• Business
• Technical
Technology
16

Data Model Aspects: Business
Technology
17

Data Model Aspects: Technical
Technology
18

Levels of Data Models
• Logical
• Enterprise
• Conceptual
• Physical
Technology
19

Logical Data Modelling Components
Technology
20

Logical Data Modelling Components
Technology
21

Technology
22

Steps in Data Science Process
Technology
23

Define the Project Objective
• Goal: Clearly and explicitly specifying the model target as a sharp
question which is use to drive the customer engagement.
• Responsibility: This will be customer driven to maximize business value,
with guidance from the data science team to make the end objective
answerable and actionable.
• The first step towards a successful data science project is to define the
question we are interested in answering. This is where we define a
hypothesis we’d like to test, or the objective of the project. It helps to
describe what the expected end result of the engagement would be, so
that we can use these results to add business value.
Technology
24

Define the Project Objective
• A key component of successful data science projects is defining the project objective with a sharp
question. A sharp question is well defined and can be answered with a name or number.
Remember that data science can only be used to answer five different types of questions:
How much or how many? (regression)
Which category? (classification)
Which group? (clustering)
Is this weird? (anomaly detection)
Which option should be taken? (recommendation)
• The type (or class) of the question restricts and informs the following:
Which algorithms the data scientist can use to address the problem.
How to measure the algorithms accuracy.
Data requirements.
A success metric is typically determined by which question is asked. The metric is defined by how
we measure accuracy within that question class. Once we have an idea of the measure, we can
discuss what success would look like in terms of this metric.
Technology
25

Deliverable
• Deliverable: Project Objective This is usually a single-page document
clearly stating the question of interest and how the expected answer
will look. The document should also include some criteria for
customer acceptance of the final solution and an expected
implementation of the solution.
• We can think of this as an initial contract that defines the customer
expectations in terms of an achievable end point of the engagement.
This is often an exercise that is completed in collaboration between
the customer and data science team. This deliverable will prove to be
valuable as it encourages customer engagement in the process.
Technology
26

Identifying Data Sources
Goal: Clearly specifying where to find the data sources of interest. Define the machine
learning target in this step and determine if we need to bring in ancillary data from other
sources.
Responsibility: Typically, the customer comes with data in hand. With a sharp question, the
data science team can begin formulating an answer by locating the data required to answer
that question.
Just because we have a lot of data does not mean we will use it all, or that it contains all
that we need to answer the question. In addition, all data sources are not equally helpful in
answering the specific question of interest. We are looking for:
• Data that is Relevant to the question. Do we have measures of the target and features
that are related to the target?
• Data that is an Accurate measure of our model target and the features of interest.
Technology
27

Identifying Data Sources
We are typically using data sources that are collected for reasons other than
answering our specific question. This means we are collecting data sources
opportunistically, so some information that could be extremely helpful in
answering the question may not have been collected. We also are not
controlling the environment of observations, which means we are only able
to determine correlations between collected information and the outcome
of interest, not specific causal inferences.
Deliverable: Data Sources Usually a single-page document clearly stating
where the data resides. This could include one or more data sources and
possibly the associated entity-relation diagrams. This document should also
include the target variable definition.
Technology
28

Initial Data Exploration
Goal: To determine if the data we have can be used to answer the question.
If not, we may need to collect more data.
Responsibility: Data science team begins to evaluate the data.
Once we know where to find the data, this initial pass will help us determine
the quality of the data provided to answer the question. Here we are looking
to determine if the data is:
• Connected to the target.
• Large enough to move forward.
Technology
29

• At this point graphical methods are extremely helpful. Have we measured the features
consistently enough for them to be useful or are there a lot of missing values in the data?
Has the data been consistently collected over the time period of interest or are there
blocks of missing observations? If the data does not pass this quality check, we may need
to go back to the previous step to correct or get more data.
• We also need enough observations to build a meaningful model and enough features for
our methods to differentiate between different observations. If we’re trying to
differentiate between groups or categories, are there enough examples of all possible
outcomes?
• The initial data exploration step (step 3) is done in parallel with identifying data
sources (step 2). As we determine if the data is connected or if we have enough data, we
may need to find new data sources with more accurate or more relevant data to
complete the data set initially identified in step 2.
Technology
30

Deliverables: Data Exploration This step should produce the initial draft of the following documents:
Exploratory Data Analysis Report: A document detailing data requirements, quality (accuracy, connectedness)
and relevance to the target and the ability to answer the question of interest. It is best to use graphical
methods to clearly show data features in an understandable way. Additionally, we should have an idea if there
enough data to answer the question of interest with some confidence in the end result.
Analytics Architecture Diagram (initial draft): With the data sources in hand, we can start to define how the
machine learning pipeline will work? How often will the data sources be updated? What actions should be
taken on those updates? Is there a retraining criteria as we collect and label new observations? Documenting
this now can help us define and capture the required artifacts for use in later steps.
Checkpoint Decision
Before we begin to do the full feature engineering and model building process, we can reevaluate the project
to determine value in continuing this effort. We may be ready to proceed, need to collect more data, or it’s
possible the data does not exist to answer the question.
Technology
31

Construction of Analysis Data
Goal: Construct the analysis data set, with associated feature
engineering, for building the machine learning model.
Responsibility: Data science team usually made up of data engineers,
experts in getting data from disparate sources, and data scientists
performing additional quality and quantity checks.
Technology
32

The analysis data set is defined by the following:
Inclusion/Exclusion criteria: Evaluate observations on multiple levels to determine if they are part of the
population of interest. Are they connected in time? Are there observations that are missing large chunks of
information? We look at both business reasons and data quality reasons for observation inclusion/exclusion
criteria.
Feature engineering involves inclusion, aggregation and transformation of raw variables to create the features
used in the analysis. If we want insight into what is driving the model, then we need to take care in how
features are related to each other, and how the machine learning method will be using those features. This is a
balancing act of including informative variables without including too many unrelated variables. Informative
variables will improve our result; unrelated variables will introduce unnecessary noise into the model.
Avoid leakage: Leakage is caused by including variables that can perfectly predict the target. These are usually
variables that may have been used to detect the target initially. As the target is redefined, these dependencies
can be hidden from the original definition. To avoid this often requires iterating between building an analysis
data set, and creating a model and evaluating the accuracy. Leakage is a major reason data scientists get
nervous when they get really good predictive results.
Technology
33

Deliverable: Feature Engineering
This step produces the following initial draft artifacts:
• The analysis data set itself, which will be used to train and test the machine learning
model in the next step.
• A document describing the feature engineering required to construct the analysis data
set.
The source code to build the analysis data set, including queries or other source code to
produce the model features and the model targets. The model features should be held
separate from the target calculations for use when predicting on new observations in a
production setting. This artifact will be directly used in the production pipeline of step 7.
Technology
34

Machine Learning Model
Goal: Answer the question by constructing and evaluating an
informative model to predict the target.
Responsibility: Data science.
After a large amount of data specific work, we are now ready to start
building a model. This machine learning step is often executed in
parallel with constructing the analysis data set as information from our
model can be used to build better features in the analysis data set.
Technology
35

The process involves:
• Splitting analysis data into training and testing data sources.
• Evaluate (training and testing) a series of competing machine learning
methods that are geared toward answering the question of interest
with the data we currently have at hand.
• Determine the “best” solution to answer the question by comparing
the success metric between alternative methods.
Technology
36

Deliverables: Machine Learning
• The machine learning model which can be used to predict the target for new
observations. This artifact will be directly used in the production pipeline of step 7.
• A document describing the model, how to use the model and findings from the
modelling process. What do these initial results look like? What do these tell us about
our hypotheses and about the data we are using? Additionally, we can define
visualizations of the model results here.
Checkpoint Decision
• Again, we can reevaluate if moving on to a production system here. Does the model
answer the question sufficiently given the test data? Should we go back and collect more
data (step 2) or change how the data is being used (step 4)?
Technology
37

Validation and Customer Acceptance
Goal: To finalize the machine learning deliverable by confirming the
model and the evidence for the model acceptance.
Responsibility: Customer focused evaluation of the project artifacts.
In order to get to this point, the data science team has some
confidence that the project has progressed in answering the question
of interest. The answer may not be perfect, but given the data sources,
data exploration, the analysis data set, and the machine learning
model, the data science team has some estimates of the ability and
accuracy of the model attaining the project objective.
Technology
38

This step formalizes the delivery of the engagement artifacts and results to the customer for final review before
committing to building out the production pipeline. The customer can then determine if the model meets the
success metrics and whether the production pipeline would add business value.
Deliverable:
The following finalized documents and artifacts from each of the project milestones:
• Project Objective (step 1)
• Data Sources (step 2)
• Data Exploration (step 3)
• Feature Engineering (step 4).
• Machine Learning (step 5)
Technology
39

Checkpoint Decision
For the most part, the customer should be familiar with all of these
deliverables, and be aware of the current state of the project
throughout the process. The validation and customer acceptance step
gives the customer a change to evaluate the validity and value of the
data science solution from a business perspective, before committing
to continue with the production implementation.
Technology
40

Production Pipeline Implementation
Goal: Implement the full process to use the model and insights
obtained from the engagement. The pipeline is the actual delivery of
the business value to the customer.
Responsibility: The data science team, typically data engineers building
out the system described initially in the initial data exploration step.
Technology
41

Production Pipeline Implementation
Deliverable: The deliverable here is defined by how the customer
intends on using the results of this engagement. This could and should
include delivery of actionable insights obtained throughout the
engagement. These insights can be delivered through:
Data and machine learning visualizations.
Operationalized data/machine learning pipeline to predict outcomes on
new observations as they become available.
Technology
42

Goals of Data Science Process
• The goal of this process is to continue to move a data science project
forward towards a clear engagement end point.
• We recognize that data science is a research activity and that progress
often entails an approach that moves two steps forward and one step
(or worse) backwards.
• Being able to clearly communicate this to customers can help avoid
misunderstanding and frustration for all parties involved, and increase
the odds of success.
Technology
43

Activity
• Perform Data Science Process on Olympic medal tally for events post
WW2
Technology
44

Next Week…
• Tools and Technologies in Data Science
Technology
45

Introduction to Data Science - Week 3 - Steps involved in Data Science

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Data Science - Week 3 - Steps involved in Data Science

Similar to Introduction to Data Science - Week 3 - Steps involved in Data Science (20)

More from Ferdin Joe John Joseph PhD

More from Ferdin Joe John Joseph PhD (20)

Recently uploaded

Recently uploaded (20)

Introduction to Data Science - Week 3 - Steps involved in Data Science