C0-01 OEAD0002.pptx ,msbxkasbdkbakwdbkawdka

CO-01
DATA SCIENCE & VISUALIZATION

SYLLABUS
Definition of Data Science
AI, Machine Learning and Data Science
Extracting Meaningful patterns
Building representative models
Statistics ML and Computing
Learning Algorithms
Data Science Classification
Data Science Algorithms
Data Science process
Prior knowledge
Data Preparation, Modeling and Application.

Definition of Data Science
• Data science is the process of using scientific
methods, algorithms, and programming to
extract knowledge and insights from data. It
involves collecting, analyzing, and interpreting
large amounts of data to find hidden patterns
and trends

AI, Machine Learning and Data Science

Data science and artificial intelligence are not the same.
“Data science and artificial intelligence are two technologies that are
transforming the world. While artificial intelligence powers data science
operations, data science is not completely dependent on AI. Data Science is
leading the fourth industrial revolution. ”

Data science also requires machine learning algorithms,
which results in dependency on AI.

Comparison Between AI and Data Science
• Data science jobs require the knowledge of ML languages like R and
Python to perform various data operations and computer science
expertise.
• Data science uses more tools apart from AI. This is because data
science involves multiples steps to analyze data and generate
insights.
• Data science models are built for statistical insights whereas AI is
used to build models that mimic cognition and human
understanding.

Comparison Between AI and Data
Science
• Today’s industries require both, data science and
artificial intelligence.
• Data science will help them make necessary data-
driven decisions and assess their performance in
the market, while artificial intelligence will help
industries work with smarter devices and
software that will minimize workload and
optimize all the processes for improves
innovation.

Data
Science
Artifici
al
Intellige
nce
ML/
DM/
Analyt
ics
1
0
“Data science produces insights.
Machine learning produces predictions”
DATA SCIENCE AND ARTIFICIAL INTELLIGENCE

Comparison Between AI and Data Science

9
• They are not the “same thing”
• Big data = crude oil
• Big data is about extracting “crude oil”,
transporting it in “mega tankers”, siphoning it
through “pipelines”, and storing it in “massive
silos”
• Data science is about refining the “crude oil” Carlos
Samohano Founder,
Data Science London
DATA SCIENCE AND BIG DATA

8
Dat
a
Scien
ce
Humani
ties Machi
ne/
Statisti
cal
Learni
ngApplica
tion
Domai
n
Experti
se
Visual
izatio
n
Mathe
matic
Soci
al
Scien
ce
La
w
Data
Manage
ment
DATA SCIENCE AS A UNIFIER

Computer
Science
Machine
Learning
Statistics
Data Science is a science which uses :
Visualization
To collect ,clean, Integrate, analyze, visualize, interact with data to
create data products.
Human Computer Interaction

A. Programming Languages
• Python
• R
B. Data Visualization Tools
• Tableau
• Power BI
C. Machine Learning
Libraries
• Scikit-learn
• TensorFlow
D. Big Data Frameworks
• Apache HadooP
• Apache Spark
DATA SCIENCE TOOLS AND TECHNOLOGIES

A pattern is a set of objects or phenomena or
concepts where the elements of the set are
similar to one another in certain ways or aspects.
A pattern is an entity , that could be given a
name .
• Example : Fingerprint Image, handwritten word
, human face , speech signal , DNA sequence etc

16

18
PATTERN RECOGNITION PROCESS
Data acquisition and sensing:
Measurements of physical variables.
Important issues: bandwidth, resolution , etc.
Pre-processing:
Removal of noise in data.
Isolation of patterns of interest from the background.
Feature extraction:
Finding a new representation in terms of features.
Classification
Using features and learned models to assign a pattern
to a category.
Post-processing
Evaluation of confidence in decisions.

19
PATTERN RECOGNITION MODEL
Statistical model: Pattern recognition systems are based on statistics and
probabilities.
Syntactic model: Structural models for pattern recognition and are based on
the relation between features. Here the patterns are represented by structures .
Template matching model: In this model, a template or a prototype of
the pattern to be recognized is available.
Neural network model: An artificial neural network (ANN) is a self-adaptive
trainable process that is able to learn and resolve complex problems based on available
knowledge.

PATTERN CLASS
20
A Pattern class is a set of patterns sharing
common attributes .
A collection of “Similar” ( not necessarily
identical ) objects.
During recognition given objects are assigned
to prescribed classes.

21
APPLICATIONS OF PATTERN RECOGNITION
Input: Images with characters (normally contaminated with noise)
Output: The identified character string
Useful in scenarios such as automatic license plate recognition (ALPR), optical
character recognition(OCR) ,etc.
CHARACTER RECOGNITION

22
Input: Acoustic signal (Sound waves etc)
Output: Contents of the speech
Useful in scenarios such as speech-to-text (STT), voice command and control etc.
SPEECH RECOGNITION

23
FINGERPRINT RECOGNITION
Input: Fingerprint of some person
Output: The persons identity.
Useful in scenarios such as computerized access control , criminal pursuit, etc.

Building representative models/Data science
Life Cycle
• The Data Science lifecycle is mainly designed for Big
Data problems and data science projects.
• The cycle is iterative to represent real project.
• To address the distinct requirements for performing
analysis on Big Data, step – by – step methodology is
needed to organize the activities and tasks involved with
acquiring, processing, analyzing, and repurposing data.

DATA SCIENCE LIFECYCLE …….contd..

• Exploratory Data Analysis (EDA) is a method used by data
scientists to analyze and investigate data sets. It's a
critical initial step in the data science workflow. EDA
involves:
• Summarizing:
• Visualizing:
• Identifying:
• Exploring:
• Outlier detection:
• Hypothesis testing:

Phase 1: Understanding the problem –
 The data science team learn and investigate the
problem.
 Develop context and understanding.
 Come to know about data sources needed and available
for the project.
 The team formulates initial hypothesis that can be later
tested with data.
 Hypothesis: An idea that is suggested as the possible

Phase 2: Gathering Relevant Data–
 Steps to explore, preprocess, and condition data prior to modeling
and analysis.
 It requires the presence of an analytic sandbox, the team execute,
load, and transform, to get data into the sandbox.
 Data preparation tasks are likely to be performed multiple times
and not in predefined order.
 Several tools commonly used for this phase are – Hadoop, Alpine
Miner, Open Refine, etc.

Phase 3: Data Preparation–
 Team explores data to learn about relationships between
variables and subsequently, selects key variables and the
most suitable models.
 In this phase, data science team develop data sets for
training, testing, and production purposes.
 Team builds and executes models based on the work
done in the model planning phase.
 Several tools commonly used for this phase are –
Matlab, STASTICA.

Phase 4: Feature Engineering and Feature Extraction –
 Team develops datasets for testing, training, and
production purposes.
 Team also considers whether its existing tools will
suffice for running the models or if they need more
robust environment for executing models.
 Free or open-source tools – Rand PL/R, Octave, WEKA.
 Commercial tools – Matlab , STASTICA.

Phase 5: Model Building and Deployment–
 After executing model team need to compare outcomes of
modeling to criteria established for success and failure.
 Team considers how best to articulate findings and
outcomes to various team members and stakeholders,
taking into account warning, assumptions.
 Team should identify key findings, quantify business value,
and develop narrative to summarize and convey findings to
stakeholders.

Statistics , ML and Computing
• Statistical Foundations for ML (e.g.,
probability, hypothesis testing, Bayesian
inference)
• Machine Learning Algorithms (e.g.,
supervised/unsupervised learning, neural
networks)
• Computational Techniques (e.g., optimization
algorithms, numerical methods)

Learning Algorithms
Below is the list of the top 10 commonly used Machine Learning Algorithms:
• Linear regression
• Logistic regression
• Decision tree
• SVM algorithm
• Naive Bayes algorithm
• KNN algorithm
• K-means
• Random forest algorithm
• Dimensionality reduction algorithms
• Gradient boosting algorithm and Ada-Boosting algorithm

1. Linear Regression
• To understand the working functionality of Linear Regression, imagine how you would
arrange random logs of wood in increasing order of their weight. There is a catch;
however – you cannot weigh each log. You have to guess its weight just by looking at the
height and girth of the log (visual analysis) and arranging them using a combination of
these visible parameters. This is what linear regression in machine learning is like.
• In this process, a relationship is established between independent and dependent
variables by fitting them to a line. This line is known as the regression line and is
represented by a linear equation Y= a *X + b.
• In this equation:
• Y – Dependent Variable
• a – Slope
• X – Independent variable
• b – Intercept
• The coefficients a & b are derived by minimizing the sum of the squared difference of
distance between data points and the regression line.

2. Logistic Regression
• Logistic Regression is used to estimate discrete values
(usually binary values like 0/1) from a set of independent
variables. It helps predict the probability of an event by fitting
data to a logit function. It is also called logit regression.
• These methods listed below are often used to help improve
logistic regression models:
• include interaction terms
• eliminate features
• regularize techniques
• use a non-linear model

3. Decision Tree
• Decision Tree algorithm in machine learning is
one of the most popular algorithm in use
today; this is a supervised learning algorithm
that is used for classifying problems. It works
well in classifying both categorical and
continuous dependent variables. This
algorithm divides the population into two or
more homogeneous sets based on the most
significant attributes/ independent variables.

4. SVM (Support Vector Machine) Algorithm
• SVM algorithm is a method of a classification
algorithm in which you plot raw data as points
in an n-dimensional space (where n is the
number of features you have). The value of
each feature is then tied to a particular
coordinate, making it easy to classify the data.
Lines called classifiers can be used to split the
data and plot them on a graph.

5. Naive Bayes Algorithm
• A Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence
of any other feature.
• Even if these features are related to each other, a Naive
Bayes classifier would consider all of these properties
independently when calculating the probability of a
particular outcome.
• A Naive Bayesian model is easy to build and useful for
massive datasets. It's simple and is known to outperform
even highly sophisticated classification methods.

6.KNN (K- Nearest Neighbors) Algorithm
• This algorithm can be applied to both classification and regression problems.
Apparently, within the Data Science industry, it's more widely used to solve
classification problems. It’s a simple algorithm that stores all available cases and
classifies any new cases by taking a majority vote of its k neighbors. The case is then
assigned to the class with which it has the most in common. A distance function
performs this measurement.
• KNN can be easily understood by comparing it to real life. For example, if you want
information about a person, it makes sense to talk to his or her friends and
colleagues!
• Things to consider before selecting K Nearest Neighbours Algorithm:
• KNN is computationally expensive
• Variables should be normalized, or else higher range variables can bias the
algorithm
• Data still needs to be pre-processed.

7. K-Means
• It is an unsupervised learning algorithm that solves clustering problems.
Data sets are classified into a particular number of clusters (let's call that
number K) in such a way that all the data points within a cluster are
homogenous and heterogeneous from the data in other clusters.
• How K-means forms clusters:
• The K-means algorithm picks k number of points, called centroids, for each
cluster.
• Each data point forms a cluster with the closest centroids, i.e., K clusters.
• It now creates new centroids based on the existing cluster members.
• With these new centroids, the closest distance for each data point is
determined. This process is repeated until the centroids do not change.

8.Random Forest Algorithm
• A collective of decision trees is called a Random Forest. To classify a new
object based on its attributes, each tree is classified, and the tree “votes”
for that class. The forest chooses the classification having the most votes
(over all the trees in the forest).
• Each tree is planted & grown as follows:
• If the number of cases in the training set is N, then a sample of N cases is
taken at random. This sample will be the training set for growing the tree.
• If there are M input variables, a number m<<M is specified such that at
each node, m variables are selected at random out of the M, and the best
split on this m is used to split the node. The value of m is held constant
during this process.
• Each tree is grown to the most substantial extent possible. There is no
pruning.

9. Dimensionality Reduction Algorithms
• In today's world, vast amounts of data are being
stored and analyzed by corporates, government
agencies, and research organizations. As a data
scientist, you know that this raw data contains a
lot of information - the challenge is to identify
significant patterns and variables.
• Dimensionality reduction algorithms like Decision
Tree, Factor Analysis, Missing Value Ratio, and
Random Forest can help you find relevant details.

10.Gradient Boosting Algorithm and AdaBoosting Algorithm
• Gradient Boosting Algorithm and AdaBoosting Algorithm are
boosting algorithms used when massive loads of data have to
be handled to make predictions with high accuracy. Boosting is
an ensemble learning algorithm that combines the predictive
power of several base estimators to improve robustness.
• In short, it combines multiple weak or average predictors to
build a strong predictor. These boosting algorithms always
work well in data science competitions like Kaggle, AV
Hackathon, CrowdAnalytix. These are the most preferred
machine learning algorithms today. Use them, along with
Python and R Codes, to achieve accurate outcomes

Data Science Classification
• In data science, classification is a type of
supervised machine learning where the goal is
to predict the category or label of new
observations based on past data.
• It involves mapping input data to discrete
output classes.

Key Concepts in Classification
• Supervised Learning:
– The training dataset contains both input features and corresponding
class labels.
– The model learns to map input features to labels.
• Classes (Categories):
– Binary Classification: Two classes (e.g., spam vs. not spam).
– Multiclass Classification: More than two classes (e.g., classifying
emails into spam, promotions, and social).
• Features:
– Independent variables or attributes used to predict class labels.
• Target Variable:
– The dependent variable representing the class label.

Common Classification Algorithms
• Logistic Regression: For binary classification
problems.
• Decision Trees: Tree-like structure for decision
making.
• Random Forest: Ensemble of multiple decision trees.
• Support Vector Machine (SVM): Finds the optimal
hyperplane for separation.
• k-Nearest Neighbors (k-NN): Classifies based on the
majority class of nearest data points.

Evaluation Metrics for Classification
• Accuracy: Overall correctness of predictions.
• Precision: True positives divided by all predicted
positives.
• Recall (Sensitivity): True positives divided by all actual
positives.
• F1-Score: Harmonic mean of precision and recall.
• ROC-AUC Curve: Measures the trade-off between
sensitivity and specificity.
• Confusion Matrix: Visualization of model performance.

Applications of Classification in Data Science
• Email Spam Detection: Classify emails as spam or
not.
• Customer Churn Prediction: Identify customers
likely to leave a service.
• Medical Diagnosis: Classify diseases based on
symptoms.
• Sentiment Analysis: Determine if a review is
positive, negative, or neutral.
• Fraud Detection: Identify fraudulent transactions.

Prior knowledge
• Before diving into the field of Data Science,
it's essential to have a solid foundation in
certain subjects.
• This prior knowledge will help you better
understand data, choose appropriate
algorithms, and interpret results. Here's a
breakdown of the key areas to focus on

• 1. Mathematics
• 2. Programming & Tools
• 3. Machine Learning and Algorithms
• 4. Data Engineering and Databases
• 5. Data Wrangling and Preprocessing
• 6. Communication and Storytelling
• 7. Soft Skills

1. Mathematics
• Linear Algebra:
– Vectors, matrices, and matrix operations (addition,
multiplication, inversion).
– Eigenvalues and eigenvectors (important in dimensionality
reduction algorithms like PCA).
• Probability & Statistics:
– Probability theory: Bayes' Theorem, distributions (e.g., normal,
binomial), and random variables.
– Regression analysis: Linear regression, correlation, and
multivariate regression.
– Sampling techniques: Random sampling, stratified sampling, and
sampling distribution.

2. Programming & Tools
• Python:
– Python is the most widely used language in data science due to its
simplicity and vast ecosystem of libraries like Pandas, NumPy,
SciPy, Matplotlib, Scikit-learn, TensorFlow, PyTorch.
– R: Another popular language, particularly for statistical analysis
and data visualization.
• SQL (Structured Query Language):
– Essential for extracting, manipulating, and querying data from
relational databases (e.g., MySQL, PostgreSQL).
• Data Visualization Tools:
– Tools like Matplotlib, Seaborn, and Plotly (in Python)
– Tableau or Power BI for interactive dashboards.

3.Machine Learning and Algorithms
• Supervised Learning:
– Classification (e.g., Logistic Regression, Decision Trees, k-NN, SVM)
and Regression (e.g., Linear Regression, Ridge/Lasso).
– Performance metrics: Accuracy, precision, recall, F1-score, ROC-AUC,
etc.
• Unsupervised Learning:
– Clustering (e.g., K-means, Hierarchical Clustering, DBSCAN).
– Association Rules (e.g., Apriori, Eclat).
• Deep Learning:
– Neural Networks: Perceptrons, Feedforward Neural Networks.
– Convolutional Neural Networks (CNNs) for image data.
– Recurrent Neural Networks (RNNs) and LSTMs for sequence data.

4. Data Engineering and Databases
• Databases and Data Warehouses:
– Understanding relational databases
– Big Data Technologies:
– Familiarity with frameworks like Hadoop and
Spark for processing large datasets efficiently.

5. Data Wrangling and Preprocessing
• Handling missing data: Techniques like
imputation, removal, or filling with default
values.
• Feature engineering: Creating new features from
raw data to improve model performance.
• Data transformation: Normalization, scaling,
encoding categorical variables (e.g., one-hot
encoding).

6. Communication and Storytelling
• Data Visualization: Being able to visualize data
insights clearly using charts and graphs.
• Reporting: Summarizing the analysis,
explaining methodology, and presenting
results in a way that non-technical
stakeholders can understand.
• Storytelling: Presenting a narrative around the
data to support decision-making.

7. Soft Skills
• Critical Thinking: Ability to approach problems
from different angles and think analytically.
• Problem-Solving: Tackling complex business
problems with a data-driven approach.

63
 Data collection is a fundamental step in the data analytics and
visualization process.
 The quality and relevance of the collected data significantly impact the
insights and decisions derived from the analysis.
 Effective data collection and visualization strategies are essential for
extracting valuable insights and empowering data-driven decision-
making.
 It's a dynamic process that requires continuous refinement based on
user feedback and changing business needs.
Data Collection/preparation Strategies

64
 Define Clear Objectives
 Identify Relevant Data Sources
 Data Quality Assessment
 Consider Structured and Unstructured Data
 Real-time Data Collection
 Data Privacy and Ethics
 Sampling Techniques
 Surveys and Questionnaires
 Collaboration with Stakeholders
 Data Integration
 Feed back loops
Data collection/preparation strategies in the context of data
analytics and visualization

65
Clearly outline the goals and objectives
of your data analytics and visualization
project.
 Understand the questions you want to
answer and the insights you aim to
derive.
 Knowing what insights you aim to gain
will guide your data collection efforts.
1.Define Clear Objectives

66
Determine the sources of data that are
relevant to your objectives.
This can include databases, spread
sheets, API’s(Application Program
Interface), external datasets, or a
combination of these.
2.Identify Relevant Data Sources
 Determine the key performance indicators (KPIs)
and metrics relevant to your analysis and
visualization goals.
 These metrics will drive the selection of data
sources and variables.
Choose data sources that align with your

67
Assess the quality of available data. Check
for completeness, accuracy, consistency,
and relevance.
 Cleaning and pre-processing may be
necessary to address any issues.
Prioritize data quality over quantity.
Ensure that the data collected is accurate, complete,
and reliable.
Address issues such as missing values, outliers, and
inconsistencies early in the process
3.Data Quality Assessment

68
Depending on your objectives,
collect both structured data
(e.g., databases) and
unstructured data (e.g., text,
images) for a more
comprehensive analysis.
4.Consider Structured and Unstructured Data

69
If your analysis requires real-time
insights, consider implementing
systems for collecting and
processing data in real-time.
This is especially important for
dynamic datasets.
Depending on the nature of your
analysis, consider implementing
real-time data collection
processes.
5.Real-time and Automated Data Collection

70
Ensure compliance with data privacy
regulations.
Obtain necessary permissions for data
collection, especially when dealing with
personal or sensitive information.
Adhere to data security and privacy regulations.
Protect sensitive information and ensure
compliance with legal requirements to maintain
the trust of stakeholders
6.Data Privacy and Ethics

71
 Use sampling methods if working with large datasets.
 This involves selecting a representative subset of data
for analysis, which can save time and resources.
 If working with large datasets, consider using sampling
techniques to extract a representative subset for analysis.
 This can save computational resources and speed up the
analysis.
7.Sampling Techniques

72
Design and deploy surveys or
questionnaires to gather specific
information directly from users or
relevant stakeholders.
Ensure that the questions align
with your objectives.
8.Surveys and Questionnaires

73
Collaborate with domain experts and
stakeholders to gain insights into the
context of the data.
Their input can help refine data
collection strategies.
 Foster collaboration between teams responsible for data collection
and those performing analytics and visualization.
 Clear communication ensures that the collected data meets the
analytical requirements.
9.Collaboration with
Stakeholders

74
Integrate data from different
sources to create a unified
dataset.
Ensure compatibility and
consistency when combining data
from various platforms.
10.Data Integration

75
Establish feedback loops to
continuously improve data collection
processes.
 Monitor the performance of analytics
and visualizations and adjust data
collection strategies as needed.
11.Feedback Loops

76
 Remember to consider ethical and privacy considerations during
data collection, and ensure data quality through proper cleaning
and pre-processing before analysis and visualization.
 Gather specific information through structured sets of questions.
Collect subjective data, opinions, and preferences.
 Qualitative data collection, understanding perspectives. Study
natural behaviour, processes, and situations.
 Collect large volumes of diverse data from the internet. Common in
healthcare, manufacturing, and environmental monitoring.
 Social Media Monitoring Monitor public opinions, brand sentiment,
and identify trends.
 Transaction Records Identify patterns, trends, and anomalies in
business processes. Focus Groups Obtain diverse opinions and
insights on specific topics.
Summary

78
 Effective data collection is the foundation of
meaningful data analytics and visualization.
 Data pre-processing/modelling is a crucial step
in the data analytics and visualization process.
 It involves cleaning, transforming, and
organizing raw data into a format that can be
effectively utilized for analysis and
visualization
Data pre-processing/modelling overview in Data
science & visualization

79
Data Pre-Processing/pre-
processing Overview

80
• Data cleaning is the process of detecting corrupt
data and inaccurate records from a record set or
database table.
• The main use of cleaning step is based on
detecting incomplete, inaccurate, inconsistent
and irrelevant data and applying techniques to
modify or delete this useless data.
DATA CLEANING

81
• Data Integration focuses on unification of data
residing in different sources and presenting a
unified view of these data.
• Data with different representations are put together
and any conflicts resulting from it are resolved.
• This process becomes vital in a number of scientific
and commercial applications. With increasing
volume and exponential growth of data, integrating
it becomes even more significant.
DATA INTEGRATION

82
• Data reduction is the process of transforming
digital info into ordered and simplified form.
• This data is generally derived through empirical
and experimental means.
• It involves reducing large amounts of data into
smaller and meaningful fragments.
DATA REDUCTION

83
• Data discretization is an important concept
when you have a large amount of numeric
data, but only want to classify it based on
nominal values.
• In this scenario, the continuous data is split
into discrete forms and the values of these
discrete sets are said to be the nominal value.
It is basically a process of converting
continuous data attributes into a finite set of
intervals with minimal loss of information.
DTATA DISRETIZATION

85
• Data in the real world is dirty
– Incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data e.g., occupation=“ ”
– Noisy: containing errors or outliers e.g.,
Salary=“-10”
– Inconsistent: containing discrepancies in codes or
names e.g., Age=“42” Birthday=“03/07/2005”
DATA PREPROCESSING

86
• Incomplete data comes from
– n/a data value when collected
– different consideration between the time when the
data was collected and when it is analyzed.
– human/hardware/software problems
• Noisy data comes from the process of data
– collection
– entry
– transmission
• Inconsistent data comes from
– Different data sources
– Functional dependency violation
DIRTY DATA COMES FROM

87
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause
incorrect or even misleading statistics.
– Data warehouse needs consistent integration of
quality data
• Data extraction, cleaning, and transformation
comprises the majority of the work of building
a data warehouse. —Bill Inmon
IMPORTANCE OF DATA PREPROCESSING

88
A. Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
B. Data integration
– Integration of multiple databases, data cubes, or files
C. Data transformation
– Normalization and aggregation
D. Data reduction
– Obtains reduced representation in volume but produces the
same or similar analytical results
E. Data discretization
– Part of data reduction but with particular importance,
especially for numerical data
MAJOR TASKS IN DATA PREPROCESSING

89
FORMS OF DATA PREPROCESSING

90
FORMS OF DATA PREPROCESSING (CONT…)

91
• Importance
– “Data cleaning is one of the three biggest
problems in data Processing”—Ralph Kimball
– “Data cleaning is the number one problem in data
Processing”—DCI survey
• Data cleaning tasks
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
– Resolve redundancy caused by data integration
A. DATA CLEANING

92
1. Data acquisition and metadata
2. Fill in missing values
3. Unified date format
4. Converting nominal to numeric
5. Identify outliers and smooth out noisy data
6. Correct inconsistent data
DATA CLEANING TASKS

93
• Data can be in DBMS
• ODBC, JDBC protocols
• Data in a flat file
• Fixed-column format
• Delimited format: tab, comma “,”, other
• E.g. C4.5 and Weka “arff”use comma-delimited
data
• Attention: Convert field delimiters inside strings
• Verify the number of fields before and after
1.DATA CLEANING:
ACQUISITION

94
• Field types:
– binary, nominal (categorical),ordinal, numeric, …
– For nominal fields: tables translating codes to full
descriptions
• Field role:
– input : inputs for modelling
– target : output
– id/auxiliary : keep, but not use for modelling
– ignore : don’t use for modelling
– weight : instance weight
– …
• Field descriptions
DATA CLEANING: METADATA

95
• Convert data to a standard format (e.g. arff or
csv)
• Missing values
• Unified date format
• Binning of numeric data
• Fix errors and outliers
• Convert nominal fields whose values have
order to numeric.
DATA CLEANING: REFORMATTING

96
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– Equipment malfunction
– Inconsistent with other recorded data and thus deleted
– Data not entered due to misunderstanding
– Certain data may not be considered important at the time of
entry
– Not register history or changes of the data
• Missing data may need to be inferred.
2.FILLING MISSING VALUES

97
• Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification—not effective when
the percentage of missing values per attribute varies
considerably.
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!
• Imputation: Use the attribute mean to fill in the missing
value, or use the attribute mean for all samples belonging
to the same class to fill in the missing value: smarter
• Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree
HANDLING MISSING DATA

98
• We want to transform all dates to the same format
internally
• Some systems accept dates in many formats
– e.g. “Sep 24, 2003”, 9/24/03, 24.09.03, etc
– dates are transformed internally to a standard
value
• Frequently, just the year (YYYY) is sufficient
• For more details, we may need the month, the day,
the hour, etc
• Representing date as YYYYMM or YYYYMMDD
can be OK, but has problems
3. UNIFIED DATE FORMAT

99
• To preserve intervals, we can use
– Unix system date: Number of seconds since
1970
– Number of days since Jan 1, 1960 (SAS)
• Problem:
– values are non-obvious
– don’t help intuition and knowledge
discovery
– harder to verify, easier to make an error
UNIFIED DATE FORMAT OPTIONS

100
• Some tools can deal with nominal values
internally
• Other methods (neural nets, regression,
nearest neighbor) require only numeric inputs
• To use nominal fields in such methods need
to convert them to a numeric value
– Q: Why not ignore nominal fields altogether?
– A: They may contain valuable information
• Different strategies for binary, ordered, multi-
valued nominal fields
4.CONVERSION: NOMINAL TO NUMERIC

101
• Binary fields
–E.g. Gender = M, F
• Convert to Field_0_1 with 0, 1
values
–e.g. Gender = M  Gender_0_1 = 0
–Gender = F  Gender_0_1 = 1
CONVERSION: BINARY TO NUMERIC

102
• Ordered attributes (e.g. Grade) can be converted to
numbers preserving natural order, e.g.
– A  4.0
– A-  3.7
– B+  3.3
– B  3.0
• Q: Why is it important to preserve natural
order?
• A: To allow meaningful comparisons, e.g. Grade
> 3.5
CONVERSION: ORDERED TO
NUMERIC

103
• Noise: random error or variance in a
measured variable
• Incorrect attribute values may due to
– Faulty data collection instruments
– Data entry problems
– Data transmission problems
– Technology limitation
– Inconsistency in naming convention
• Other data problems which requires data
cleaning
– Duplicate records
– Incomplete data
5. NOISY DATA

104
• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human
• Regression
– smooth by fitting the data into regression functions
6. HANDLING NOISY DATA

105
• Equal-width(distance) partitioning:
• It divides the range into N intervals of equal size:
uniform grid
• if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/N.
• The most straightforward
• But outliers may dominate presentation
• Skewed data is not handled well.
• Equal-depth(frequency) partitioning:
• It divides the range into N intervals, each containing
approximately same number of samples
• Good data scaling
• Managing categorical attributes can be tricky.
SIMPLE DISCRETIZATION METHODS: BINNING

106
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21,
24, 25, 26, 28, 29, 34
• Partition into (equi-depth) bins:
– -Bin 1: 4, 8, 9, 15
– -Bin 2: 21, 21, 24, 25
– -Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
– -Bin 1: 9, 9, 9, 9
– -Bin 2: 23, 23, 23, 23
– -Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
– -Bin 1: 4, 4, 4, 15
– -Bin 2: 21, 21, 25, 25
– -Bin 3: 26, 26, 26, 34
BINNING METHODS FOR DATA SMOOTHING

107
• Linear regression
involves finding the
“best” line to fit two
attributes (or
variables), so that
one attribute can be
used to predict the
other.
• Multiple linear
regression is an
extension of linear
regression, where
more than two
DATA SMOOTHING - REGRESSION

108
• Outliers may be
detected by
clustering, for
example, where
similar values are
organized into
groups, or
“clusters.”
• Intuitively, values
that fall outside of
the set of clusters
may be considered
DATA SMOOTHING - OUTLIER ANALYSIS

110
 Healthcare
 Finance
 Marketing
 Retail
 Education
 Transportation and Logistics
 Telecommunications
 Government and Public
Services
 Environmental Science

Applications of Data Science in
various fields

111
Predictive Analytics
Predict disease outbreaks and patient
admission rates.
Clinical Research
Analyse patient data for drug discovery
and treatment effectiveness.
Personalized Medicine
Tailor treatments based on individual
patient data.
1. Healthcare

112
 Algorithmic Trading
Develop trading algorithms based on historical
market data.
 Fraud Detection
Identify fraudulent transactions and activities.
 Credit Scoring
Assess creditworthiness of individuals and
businesses.
2. Finance

113
 Customer Segmentation
Divide customers into groups for targeted
marketing.
 Recommendation Systems
Provide personalized product or content
recommendations.
 Market Basket Analysis
 Understand customer purchasing patterns.
3.Marketing

114
 Inventory Management
Optimize stock levels to prevent overstocking or
shortages.
 Demand Forecasting
Predict product demand to optimize pricing and stock
levels.
 Customer Analytics
Analyse customer behaviour to enhance shopping
experience.
4.Retail

115
 Adaptive Learning
Customize educational content based on student
performance data.
 Dropout Prediction
Identify students at risk of dropping out and provide
support.
 Learning Analytics
Analyze student data to improve teaching methods and
curriculum.
5.Education

116
 Route Optimization
Find the most efficient routes for transportation and
delivery.
 Predictive Maintenance
Anticipate when vehicles or machinery will need
maintenance.
 Supply Chain Analytics
Optimize the supply chain for cost efficiency.
6.Transportation and
Logistics

117
 Network Optimization
Improve network performance based on usage patterns.
 Churn Prediction
Predict and reduce customer churn by identifying at-risk
customers.
 Fraud Detection
Identify and prevent fraudulent activities on the network.
7.Telecommunications

118
Crime Prediction
Predict high-risk areas for crime and
allocate resources accordingly.
Social Welfare
Identify eligible recipients for social
welfare programs using income and
demographic data.
Traffic Management
8.Government and Public
Services

119
 Climate Modelling
Analyze climate data to predict and understand
climate patterns.
 Natural Disaster Prediction
Predict natural disasters based on environmental data.
 Resource Conservation
Optimize resource usage and conservation strategies.
9.Environmental Science

120
 Performance Analysis
Analyze player performance and strategize game
plans.
 Injury Prevention
Predict and prevent injuries by analyzing player
health data.
 Fan Engagement
Analyze fan data to enhance the fan experience.
10.Sports Analytics

121
 Recruitment
Analyze resumes and job descriptions for optimal
candidate matching.
 Employee Retention
Identify factors contributing to employee turnover and
implement retention strategies.
 Workforce Planning
 Predict future workforce needs based on historical
data.
11. Human Resources

ASSIGNMENT-01
1) SUPERVISED:
• Linear regression-sales forecasting(Prediction)
• Decision Tree- used for both classification and regression

Linear regression-sales forecasting (Prediction)
• 1) WHAT IS LINEAR REGRESSION?

LINEAR AND NONLINEAR
REGRESSION
❖ It is simplest form of regression. Linear regression attempts to model the relationship between two variables by
fitting a linear equation to observe the data.
❖ Linear regression attempts to find the mathematical relationship between variables.
❖ If outcome is straight line then it is considered as linear model and if it is curved line, then it is a non
linear model.
❖ The relationship between dependent variable is given by straight line and it has only one independent
variable.
Y = a + b X
❖ Model 'Y', is a linear function of 'X'.
❖ The value of 'Y' increases or decreases in linear manner according to which the value of 'X' also changes.

MULTIPLE LINEAR
REGRESSION
❖ Multiple linear regression is an extension of linear regression analysis.
❖ It uses two or more independent variables to predict an outcome and a single continuous
dependent variable.
Y = a0
+ a1
X1
+ a2
X2
+.........+ak
Xk
+e
where,
'Y' is the response variable.
X1
+ X2
+ Xk
are the independent predictors.
'e' is random error.
a0
, a1
, a2
, ak
are the regression coefficients.

LOGISTIC
REGRESSION
❖ Logistic Regression was used in the biological sciences in early twentieth century. It was then
used in many social science applications. Logistic Regression is used when the dependent
variable(target) is categorical.
❖ For example,
➢ To predict whether an email is spam (1) or (0)
➢ Whether the tumor is malignant (1) or not (0)

Decision Tree- used for both classification and
regression

C0-01 OEAD0002.pptx ,msbxkasbdkbakwdbkawdka

C0-01 OEAD0002.pptx ,msbxkasbdkbakwdbkawdka

More Related Content

Similar to C0-01 OEAD0002.pptx ,msbxkasbdkbakwdbkawdka

Recently uploaded

C0-01 OEAD0002.pptx ,msbxkasbdkbakwdbkawdka

Editor's Notes