CO-01
DATA SCIENCE & VISUALIZATION
SYLLABUS
Definition of Data Science
AI, Machine Learning and Data Science
Extracting Meaningful patterns
Building representative models
Statistics ML and Computing
Learning Algorithms
Data Science Classification
Data Science Algorithms
Data Science process
Prior knowledge
Data Preparation, Modeling and Application.
Definition of Data Science
• Data science is the process of using scientific
methods, algorithms, and programming to
extract knowledge and insights from data. It
involves collecting, analyzing, and interpreting
large amounts of data to find hidden patterns
and trends
AI, Machine Learning and Data Science
Data science and artificial intelligence are not the same.
“Data science and artificial intelligence are two technologies that are
transforming the world. While artificial intelligence powers data science
operations, data science is not completely dependent on AI. Data Science is
leading the fourth industrial revolution. ”
Data science also requires machine learning algorithms,
which results in dependency on AI.
Comparison Between AI and Data Science
• Data science jobs require the knowledge of ML languages like R and
Python to perform various data operations and computer science
expertise.
• Data science uses more tools apart from AI. This is because data
science involves multiples steps to analyze data and generate
insights.
• Data science models are built for statistical insights whereas AI is
used to build models that mimic cognition and human
understanding.
Comparison Between AI and Data
Science
• Today’s industries require both, data science and
artificial intelligence.
• Data science will help them make necessary data-
driven decisions and assess their performance in
the market, while artificial intelligence will help
industries work with smarter devices and
software that will minimize workload and
optimize all the processes for improves
innovation.
Data
Science
Artifici
al
Intellige
nce
ML/
DM/
Analyt
ics
1
0
“Data science produces insights.
Machine learning produces predictions”
DATA SCIENCE AND ARTIFICIAL INTELLIGENCE
Comparison Between AI and Data Science
9
• They are not the “same thing”
• Big data = crude oil
• Big data is about extracting “crude oil”,
transporting it in “mega tankers”, siphoning it
through “pipelines”, and storing it in “massive
silos”
• Data science is about refining the “crude oil” Carlos
Samohano Founder,
Data Science London
DATA SCIENCE AND BIG DATA
8
Dat
a
Scien
ce
Humani
ties Machi
ne/
Statisti
cal
Learni
ngApplica
tion
Domai
n
Experti
se
Visual
izatio
n
Mathe
matic
Soci
al
Scien
ce
La
w
Data
Manage
ment
DATA SCIENCE AS A UNIFIER
Computer
Science
Machine
Learning
Statistics
Data Science is a science which uses :
Visualization
To collect ,clean, Integrate, analyze, visualize, interact with data to
create data products.
Human Computer Interaction
A. Programming Languages
• Python
• R
B. Data Visualization Tools
• Tableau
• Power BI
C. Machine Learning
Libraries
• Scikit-learn
• TensorFlow
D. Big Data Frameworks
• Apache HadooP
• Apache Spark
DATA SCIENCE TOOLS AND TECHNOLOGIES
Extracting Meaningful patterns
A pattern is a set of objects or phenomena or
concepts where the elements of the set are
similar to one another in certain ways or aspects.
A pattern is an entity , that could be given a
name .
• Example : Fingerprint Image, handwritten word
, human face , speech signal , DNA sequence etc
Extracting Meaningful patterns
16
17
PATTERN RECOGNITION SYSTEM
18
PATTERN RECOGNITION PROCESS
Data acquisition and sensing:
Measurements of physical variables.
Important issues: bandwidth, resolution , etc.
Pre-processing:
Removal of noise in data.
Isolation of patterns of interest from the background.
Feature extraction:
Finding a new representation in terms of features.
Classification
Using features and learned models to assign a pattern
to a category.
Post-processing
Evaluation of confidence in decisions.
19
PATTERN RECOGNITION MODEL
Statistical model: Pattern recognition systems are based on statistics and
probabilities.
Syntactic model: Structural models for pattern recognition and are based on
the relation between features. Here the patterns are represented by structures .
Template matching model: In this model, a template or a prototype of
the pattern to be recognized is available.
Neural network model: An artificial neural network (ANN) is a self-adaptive
trainable process that is able to learn and resolve complex problems based on available
knowledge.
PATTERN CLASS
20
A Pattern class is a set of patterns sharing
common attributes .
A collection of “Similar” ( not necessarily
identical ) objects.
During recognition given objects are assigned
to prescribed classes.
21
APPLICATIONS OF PATTERN RECOGNITION
Input: Images with characters (normally contaminated with noise)
Output: The identified character string
Useful in scenarios such as automatic license plate recognition (ALPR), optical
character recognition(OCR) ,etc.
CHARACTER RECOGNITION
22
APPLICATIONS OF PATTERN RECOGNITION
Input: Acoustic signal (Sound waves etc)
Output: Contents of the speech
Useful in scenarios such as speech-to-text (STT), voice command and control etc.
SPEECH RECOGNITION
23
APPLICATIONS OF PATTERN RECOGNITION
FINGERPRINT RECOGNITION
Input: Fingerprint of some person
Output: The persons identity.
Useful in scenarios such as computerized access control , criminal pursuit, etc.
Building representative models/Data science
Life Cycle
• The Data Science lifecycle is mainly designed for Big
Data problems and data science projects.
• The cycle is iterative to represent real project.
• To address the distinct requirements for performing
analysis on Big Data, step – by – step methodology is
needed to organize the activities and tasks involved with
acquiring, processing, analyzing, and repurposing data.
DATA SCIENCE LIFECYCLE …….contd..
• Exploratory Data Analysis (EDA) is a method used by data
scientists to analyze and investigate data sets. It's a
critical initial step in the data science workflow. EDA
involves:
• Summarizing:
• Visualizing:
• Identifying:
• Exploring:
• Outlier detection:
• Hypothesis testing:
DATA SCIENCE LIFECYCLE …….contd..
Phase 1: Understanding the problem –
 The data science team learn and investigate the
problem.
 Develop context and understanding.
 Come to know about data sources needed and available
for the project.
 The team formulates initial hypothesis that can be later
tested with data.
 Hypothesis: An idea that is suggested as the possible
DATA SCIENCE LIFECYCLE …….contd..
Phase 2: Gathering Relevant Data–
 Steps to explore, preprocess, and condition data prior to modeling
and analysis.
 It requires the presence of an analytic sandbox, the team execute,
load, and transform, to get data into the sandbox.
 Data preparation tasks are likely to be performed multiple times
and not in predefined order.
 Several tools commonly used for this phase are – Hadoop, Alpine
Miner, Open Refine, etc.
DATA SCIENCE LIFECYCLE …….contd..
Phase 3: Data Preparation–
 Team explores data to learn about relationships between
variables and subsequently, selects key variables and the
most suitable models.
 In this phase, data science team develop data sets for
training, testing, and production purposes.
 Team builds and executes models based on the work
done in the model planning phase.
 Several tools commonly used for this phase are –
Matlab, STASTICA.
DATA SCIENCE LIFECYCLE …….contd..
Phase 4: Feature Engineering and Feature Extraction –
 Team develops datasets for testing, training, and
production purposes.
 Team also considers whether its existing tools will
suffice for running the models or if they need more
robust environment for executing models.
 Free or open-source tools – Rand PL/R, Octave, WEKA.
 Commercial tools – Matlab , STASTICA.
DATA SCIENCE LIFECYCLE …….contd..
Phase 5: Model Building and Deployment–
 After executing model team need to compare outcomes of
modeling to criteria established for success and failure.
 Team considers how best to articulate findings and
outcomes to various team members and stakeholders,
taking into account warning, assumptions.
 Team should identify key findings, quantify business value,
and develop narrative to summarize and convey findings to
stakeholders.
DATA SCIENCE LIFECYCLE …….contd..
Statistics , ML and Computing
• Statistical Foundations for ML (e.g.,
probability, hypothesis testing, Bayesian
inference)
• Machine Learning Algorithms (e.g.,
supervised/unsupervised learning, neural
networks)
• Computational Techniques (e.g., optimization
algorithms, numerical methods)
Statistics , ML and Computing
ML VS. DL
Learning Algorithms
Below is the list of the top 10 commonly used Machine Learning Algorithms:
• Linear regression
• Logistic regression
• Decision tree
• SVM algorithm
• Naive Bayes algorithm
• KNN algorithm
• K-means
• Random forest algorithm
• Dimensionality reduction algorithms
• Gradient boosting algorithm and Ada-Boosting algorithm
1. Linear Regression
• To understand the working functionality of Linear Regression, imagine how you would
arrange random logs of wood in increasing order of their weight. There is a catch;
however – you cannot weigh each log. You have to guess its weight just by looking at the
height and girth of the log (visual analysis) and arranging them using a combination of
these visible parameters. This is what linear regression in machine learning is like.
• In this process, a relationship is established between independent and dependent
variables by fitting them to a line. This line is known as the regression line and is
represented by a linear equation Y= a *X + b.
• In this equation:
• Y – Dependent Variable
• a – Slope
• X – Independent variable
• b – Intercept
• The coefficients a & b are derived by minimizing the sum of the squared difference of
distance between data points and the regression line.
2. Logistic Regression
• Logistic Regression is used to estimate discrete values
(usually binary values like 0/1) from a set of independent
variables. It helps predict the probability of an event by fitting
data to a logit function. It is also called logit regression.
• These methods listed below are often used to help improve
logistic regression models:
• include interaction terms
• eliminate features
• regularize techniques
• use a non-linear model
3. Decision Tree
• Decision Tree algorithm in machine learning is
one of the most popular algorithm in use
today; this is a supervised learning algorithm
that is used for classifying problems. It works
well in classifying both categorical and
continuous dependent variables. This
algorithm divides the population into two or
more homogeneous sets based on the most
significant attributes/ independent variables.
4. SVM (Support Vector Machine) Algorithm
• SVM algorithm is a method of a classification
algorithm in which you plot raw data as points
in an n-dimensional space (where n is the
number of features you have). The value of
each feature is then tied to a particular
coordinate, making it easy to classify the data.
Lines called classifiers can be used to split the
data and plot them on a graph.
5. Naive Bayes Algorithm
• A Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence
of any other feature.
• Even if these features are related to each other, a Naive
Bayes classifier would consider all of these properties
independently when calculating the probability of a
particular outcome.
• A Naive Bayesian model is easy to build and useful for
massive datasets. It's simple and is known to outperform
even highly sophisticated classification methods.
6.KNN (K- Nearest Neighbors) Algorithm
• This algorithm can be applied to both classification and regression problems.
Apparently, within the Data Science industry, it's more widely used to solve
classification problems. It’s a simple algorithm that stores all available cases and
classifies any new cases by taking a majority vote of its k neighbors. The case is then
assigned to the class with which it has the most in common. A distance function
performs this measurement.
• KNN can be easily understood by comparing it to real life. For example, if you want
information about a person, it makes sense to talk to his or her friends and
colleagues!
• Things to consider before selecting K Nearest Neighbours Algorithm:
• KNN is computationally expensive
• Variables should be normalized, or else higher range variables can bias the
algorithm
• Data still needs to be pre-processed.
7. K-Means
• It is an unsupervised learning algorithm that solves clustering problems.
Data sets are classified into a particular number of clusters (let's call that
number K) in such a way that all the data points within a cluster are
homogenous and heterogeneous from the data in other clusters.
• How K-means forms clusters:
• The K-means algorithm picks k number of points, called centroids, for each
cluster.
• Each data point forms a cluster with the closest centroids, i.e., K clusters.
• It now creates new centroids based on the existing cluster members.
• With these new centroids, the closest distance for each data point is
determined. This process is repeated until the centroids do not change.
8.Random Forest Algorithm
• A collective of decision trees is called a Random Forest. To classify a new
object based on its attributes, each tree is classified, and the tree “votes”
for that class. The forest chooses the classification having the most votes
(over all the trees in the forest).
• Each tree is planted & grown as follows:
• If the number of cases in the training set is N, then a sample of N cases is
taken at random. This sample will be the training set for growing the tree.
• If there are M input variables, a number m<<M is specified such that at
each node, m variables are selected at random out of the M, and the best
split on this m is used to split the node. The value of m is held constant
during this process.
• Each tree is grown to the most substantial extent possible. There is no
pruning.
9. Dimensionality Reduction Algorithms
• In today's world, vast amounts of data are being
stored and analyzed by corporates, government
agencies, and research organizations. As a data
scientist, you know that this raw data contains a
lot of information - the challenge is to identify
significant patterns and variables.
• Dimensionality reduction algorithms like Decision
Tree, Factor Analysis, Missing Value Ratio, and
Random Forest can help you find relevant details.
10.Gradient Boosting Algorithm and AdaBoosting Algorithm
• Gradient Boosting Algorithm and AdaBoosting Algorithm are
boosting algorithms used when massive loads of data have to
be handled to make predictions with high accuracy. Boosting is
an ensemble learning algorithm that combines the predictive
power of several base estimators to improve robustness.
• In short, it combines multiple weak or average predictors to
build a strong predictor. These boosting algorithms always
work well in data science competitions like Kaggle, AV
Hackathon, CrowdAnalytix. These are the most preferred
machine learning algorithms today. Use them, along with
Python and R Codes, to achieve accurate outcomes
Learning Algorithms-classes
Learning Algorithms-classes
Data Science Classification
• In data science, classification is a type of
supervised machine learning where the goal is
to predict the category or label of new
observations based on past data.
• It involves mapping input data to discrete
output classes.
Key Concepts in Classification
• Supervised Learning:
– The training dataset contains both input features and corresponding
class labels.
– The model learns to map input features to labels.
• Classes (Categories):
– Binary Classification: Two classes (e.g., spam vs. not spam).
– Multiclass Classification: More than two classes (e.g., classifying
emails into spam, promotions, and social).
• Features:
– Independent variables or attributes used to predict class labels.
• Target Variable:
– The dependent variable representing the class label.
Common Classification Algorithms
• Logistic Regression: For binary classification
problems.
• Decision Trees: Tree-like structure for decision
making.
• Random Forest: Ensemble of multiple decision trees.
• Support Vector Machine (SVM): Finds the optimal
hyperplane for separation.
• k-Nearest Neighbors (k-NN): Classifies based on the
majority class of nearest data points.
Evaluation Metrics for Classification
• Accuracy: Overall correctness of predictions.
• Precision: True positives divided by all predicted
positives.
• Recall (Sensitivity): True positives divided by all actual
positives.
• F1-Score: Harmonic mean of precision and recall.
• ROC-AUC Curve: Measures the trade-off between
sensitivity and specificity.
• Confusion Matrix: Visualization of model performance.
Applications of Classification in Data Science
• Email Spam Detection: Classify emails as spam or
not.
• Customer Churn Prediction: Identify customers
likely to leave a service.
• Medical Diagnosis: Classify diseases based on
symptoms.
• Sentiment Analysis: Determine if a review is
positive, negative, or neutral.
• Fraud Detection: Identify fraudulent transactions.
Prior knowledge
• Before diving into the field of Data Science,
it's essential to have a solid foundation in
certain subjects.
• This prior knowledge will help you better
understand data, choose appropriate
algorithms, and interpret results. Here's a
breakdown of the key areas to focus on
• 1. Mathematics
• 2. Programming & Tools
• 3. Machine Learning and Algorithms
• 4. Data Engineering and Databases
• 5. Data Wrangling and Preprocessing
• 6. Communication and Storytelling
• 7. Soft Skills
1. Mathematics
• Linear Algebra:
– Vectors, matrices, and matrix operations (addition,
multiplication, inversion).
– Eigenvalues and eigenvectors (important in dimensionality
reduction algorithms like PCA).
• Probability & Statistics:
– Probability theory: Bayes' Theorem, distributions (e.g., normal,
binomial), and random variables.
– Regression analysis: Linear regression, correlation, and
multivariate regression.
– Sampling techniques: Random sampling, stratified sampling, and
sampling distribution.
2. Programming & Tools
• Python:
– Python is the most widely used language in data science due to its
simplicity and vast ecosystem of libraries like Pandas, NumPy,
SciPy, Matplotlib, Scikit-learn, TensorFlow, PyTorch.
– R: Another popular language, particularly for statistical analysis
and data visualization.
• SQL (Structured Query Language):
– Essential for extracting, manipulating, and querying data from
relational databases (e.g., MySQL, PostgreSQL).
• Data Visualization Tools:
– Tools like Matplotlib, Seaborn, and Plotly (in Python)
– Tableau or Power BI for interactive dashboards.
3.Machine Learning and Algorithms
• Supervised Learning:
– Classification (e.g., Logistic Regression, Decision Trees, k-NN, SVM)
and Regression (e.g., Linear Regression, Ridge/Lasso).
– Performance metrics: Accuracy, precision, recall, F1-score, ROC-AUC,
etc.
• Unsupervised Learning:
– Clustering (e.g., K-means, Hierarchical Clustering, DBSCAN).
– Association Rules (e.g., Apriori, Eclat).
• Deep Learning:
– Neural Networks: Perceptrons, Feedforward Neural Networks.
– Convolutional Neural Networks (CNNs) for image data.
– Recurrent Neural Networks (RNNs) and LSTMs for sequence data.
4. Data Engineering and Databases
• Databases and Data Warehouses:
– Understanding relational databases
– Big Data Technologies:
– Familiarity with frameworks like Hadoop and
Spark for processing large datasets efficiently.
5. Data Wrangling and Preprocessing
• Handling missing data: Techniques like
imputation, removal, or filling with default
values.
• Feature engineering: Creating new features from
raw data to improve model performance.
• Data transformation: Normalization, scaling,
encoding categorical variables (e.g., one-hot
encoding).
6. Communication and Storytelling
• Data Visualization: Being able to visualize data
insights clearly using charts and graphs.
• Reporting: Summarizing the analysis,
explaining methodology, and presenting
results in a way that non-technical
stakeholders can understand.
• Storytelling: Presenting a narrative around the
data to support decision-making.
7. Soft Skills
• Critical Thinking: Ability to approach problems
from different angles and think analytically.
• Problem-Solving: Tackling complex business
problems with a data-driven approach.
Data Preparation
63
 Data collection is a fundamental step in the data analytics and
visualization process.
 The quality and relevance of the collected data significantly impact the
insights and decisions derived from the analysis.
 Effective data collection and visualization strategies are essential for
extracting valuable insights and empowering data-driven decision-
making.
 It's a dynamic process that requires continuous refinement based on
user feedback and changing business needs.
Data Collection/preparation Strategies
64
 Define Clear Objectives
 Identify Relevant Data Sources
 Data Quality Assessment
 Consider Structured and Unstructured Data
 Real-time Data Collection
 Data Privacy and Ethics
 Sampling Techniques
 Surveys and Questionnaires
 Collaboration with Stakeholders
 Data Integration
 Feed back loops
Data collection/preparation strategies in the context of data
analytics and visualization
65
Clearly outline the goals and objectives
of your data analytics and visualization
project.
 Understand the questions you want to
answer and the insights you aim to
derive.
 Knowing what insights you aim to gain
will guide your data collection efforts.
1.Define Clear Objectives
66
Determine the sources of data that are
relevant to your objectives.
This can include databases, spread
sheets, API’s(Application Program
Interface), external datasets, or a
combination of these.
2.Identify Relevant Data Sources
 Determine the key performance indicators (KPIs)
and metrics relevant to your analysis and
visualization goals.
 These metrics will drive the selection of data
sources and variables.
Choose data sources that align with your
67
Assess the quality of available data. Check
for completeness, accuracy, consistency,
and relevance.
 Cleaning and pre-processing may be
necessary to address any issues.
Prioritize data quality over quantity.
Ensure that the data collected is accurate, complete,
and reliable.
Address issues such as missing values, outliers, and
inconsistencies early in the process
3.Data Quality Assessment
68
Depending on your objectives,
collect both structured data
(e.g., databases) and
unstructured data (e.g., text,
images) for a more
comprehensive analysis.
4.Consider Structured and Unstructured Data
69
If your analysis requires real-time
insights, consider implementing
systems for collecting and
processing data in real-time.
This is especially important for
dynamic datasets.
Depending on the nature of your
analysis, consider implementing
real-time data collection
processes.
5.Real-time and Automated Data Collection
70
Ensure compliance with data privacy
regulations.
Obtain necessary permissions for data
collection, especially when dealing with
personal or sensitive information.
Adhere to data security and privacy regulations.
Protect sensitive information and ensure
compliance with legal requirements to maintain
the trust of stakeholders
6.Data Privacy and Ethics
71
 Use sampling methods if working with large datasets.
 This involves selecting a representative subset of data
for analysis, which can save time and resources.
 If working with large datasets, consider using sampling
techniques to extract a representative subset for analysis.
 This can save computational resources and speed up the
analysis.
7.Sampling Techniques
72
Design and deploy surveys or
questionnaires to gather specific
information directly from users or
relevant stakeholders.
Ensure that the questions align
with your objectives.
8.Surveys and Questionnaires
73
Collaborate with domain experts and
stakeholders to gain insights into the
context of the data.
Their input can help refine data
collection strategies.
 Foster collaboration between teams responsible for data collection
and those performing analytics and visualization.
 Clear communication ensures that the collected data meets the
analytical requirements.
9.Collaboration with
Stakeholders
74
Integrate data from different
sources to create a unified
dataset.
Ensure compatibility and
consistency when combining data
from various platforms.
10.Data Integration
75
Establish feedback loops to
continuously improve data collection
processes.
 Monitor the performance of analytics
and visualizations and adjust data
collection strategies as needed.
11.Feedback Loops
76
 Remember to consider ethical and privacy considerations during
data collection, and ensure data quality through proper cleaning
and pre-processing before analysis and visualization.
 Gather specific information through structured sets of questions.
Collect subjective data, opinions, and preferences.
 Qualitative data collection, understanding perspectives. Study
natural behaviour, processes, and situations.
 Collect large volumes of diverse data from the internet. Common in
healthcare, manufacturing, and environmental monitoring.
 Social Media Monitoring Monitor public opinions, brand sentiment,
and identify trends.
 Transaction Records Identify patterns, trends, and anomalies in
business processes. Focus Groups Obtain diverse opinions and
insights on specific topics.
Summary
Data Modelling
78
 Effective data collection is the foundation of
meaningful data analytics and visualization.
 Data pre-processing/modelling is a crucial step
in the data analytics and visualization process.
 It involves cleaning, transforming, and
organizing raw data into a format that can be
effectively utilized for analysis and
visualization
Data pre-processing/modelling overview in Data
science & visualization
79
Data Pre-Processing/pre-
processing Overview
80
• Data cleaning is the process of detecting corrupt
data and inaccurate records from a record set or
database table.
• The main use of cleaning step is based on
detecting incomplete, inaccurate, inconsistent
and irrelevant data and applying techniques to
modify or delete this useless data.
DATA CLEANING
81
• Data Integration focuses on unification of data
residing in different sources and presenting a
unified view of these data.
• Data with different representations are put together
and any conflicts resulting from it are resolved.
• This process becomes vital in a number of scientific
and commercial applications. With increasing
volume and exponential growth of data, integrating
it becomes even more significant.
DATA INTEGRATION
82
• Data reduction is the process of transforming
digital info into ordered and simplified form.
• This data is generally derived through empirical
and experimental means.
• It involves reducing large amounts of data into
smaller and meaningful fragments.
DATA REDUCTION
83
• Data discretization is an important concept
when you have a large amount of numeric
data, but only want to classify it based on
nominal values.
• In this scenario, the continuous data is split
into discrete forms and the values of these
discrete sets are said to be the nominal value.
It is basically a process of converting
continuous data attributes into a finite set of
intervals with minimal loss of information.
DTATA DISRETIZATION
DATA CLEANING
85
• Data in the real world is dirty
– Incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data e.g., occupation=“ ”
– Noisy: containing errors or outliers e.g.,
Salary=“-10”
– Inconsistent: containing discrepancies in codes or
names e.g., Age=“42” Birthday=“03/07/2005”
DATA PREPROCESSING
86
• Incomplete data comes from
– n/a data value when collected
– different consideration between the time when the
data was collected and when it is analyzed.
– human/hardware/software problems
• Noisy data comes from the process of data
– collection
– entry
– transmission
• Inconsistent data comes from
– Different data sources
– Functional dependency violation
DIRTY DATA COMES FROM
87
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause
incorrect or even misleading statistics.
– Data warehouse needs consistent integration of
quality data
• Data extraction, cleaning, and transformation
comprises the majority of the work of building
a data warehouse. —Bill Inmon
IMPORTANCE OF DATA PREPROCESSING
88
A. Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
B. Data integration
– Integration of multiple databases, data cubes, or files
C. Data transformation
– Normalization and aggregation
D. Data reduction
– Obtains reduced representation in volume but produces the
same or similar analytical results
E. Data discretization
– Part of data reduction but with particular importance,
especially for numerical data
MAJOR TASKS IN DATA PREPROCESSING
89
FORMS OF DATA PREPROCESSING
90
FORMS OF DATA PREPROCESSING (CONT…)
91
• Importance
– “Data cleaning is one of the three biggest
problems in data Processing”—Ralph Kimball
– “Data cleaning is the number one problem in data
Processing”—DCI survey
• Data cleaning tasks
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
– Resolve redundancy caused by data integration
A. DATA CLEANING
92
1. Data acquisition and metadata
2. Fill in missing values
3. Unified date format
4. Converting nominal to numeric
5. Identify outliers and smooth out noisy data
6. Correct inconsistent data
DATA CLEANING TASKS
93
• Data can be in DBMS
• ODBC, JDBC protocols
• Data in a flat file
• Fixed-column format
• Delimited format: tab, comma “,”, other
• E.g. C4.5 and Weka “arff”use comma-delimited
data
• Attention: Convert field delimiters inside strings
• Verify the number of fields before and after
1.DATA CLEANING:
ACQUISITION
94
• Field types:
– binary, nominal (categorical),ordinal, numeric, …
– For nominal fields: tables translating codes to full
descriptions
• Field role:
– input : inputs for modelling
– target : output
– id/auxiliary : keep, but not use for modelling
– ignore : don’t use for modelling
– weight : instance weight
– …
• Field descriptions
DATA CLEANING: METADATA
95
• Convert data to a standard format (e.g. arff or
csv)
• Missing values
• Unified date format
• Binning of numeric data
• Fix errors and outliers
• Convert nominal fields whose values have
order to numeric.
DATA CLEANING: REFORMATTING
96
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– Equipment malfunction
– Inconsistent with other recorded data and thus deleted
– Data not entered due to misunderstanding
– Certain data may not be considered important at the time of
entry
– Not register history or changes of the data
• Missing data may need to be inferred.
2.FILLING MISSING VALUES
97
• Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification—not effective when
the percentage of missing values per attribute varies
considerably.
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!
• Imputation: Use the attribute mean to fill in the missing
value, or use the attribute mean for all samples belonging
to the same class to fill in the missing value: smarter
• Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree
HANDLING MISSING DATA
98
• We want to transform all dates to the same format
internally
• Some systems accept dates in many formats
– e.g. “Sep 24, 2003”, 9/24/03, 24.09.03, etc
– dates are transformed internally to a standard
value
• Frequently, just the year (YYYY) is sufficient
• For more details, we may need the month, the day,
the hour, etc
• Representing date as YYYYMM or YYYYMMDD
can be OK, but has problems
3. UNIFIED DATE FORMAT
99
• To preserve intervals, we can use
– Unix system date: Number of seconds since
1970
– Number of days since Jan 1, 1960 (SAS)
• Problem:
– values are non-obvious
– don’t help intuition and knowledge
discovery
– harder to verify, easier to make an error
UNIFIED DATE FORMAT OPTIONS
100
• Some tools can deal with nominal values
internally
• Other methods (neural nets, regression,
nearest neighbor) require only numeric inputs
• To use nominal fields in such methods need
to convert them to a numeric value
– Q: Why not ignore nominal fields altogether?
– A: They may contain valuable information
• Different strategies for binary, ordered, multi-
valued nominal fields
4.CONVERSION: NOMINAL TO NUMERIC
101
• Binary fields
–E.g. Gender = M, F
• Convert to Field_0_1 with 0, 1
values
–e.g. Gender = M  Gender_0_1 = 0
–Gender = F  Gender_0_1 = 1
CONVERSION: BINARY TO NUMERIC
102
• Ordered attributes (e.g. Grade) can be converted to
numbers preserving natural order, e.g.
– A  4.0
– A-  3.7
– B+  3.3
– B  3.0
• Q: Why is it important to preserve natural
order?
• A: To allow meaningful comparisons, e.g. Grade
> 3.5
CONVERSION: ORDERED TO
NUMERIC
103
• Noise: random error or variance in a
measured variable
• Incorrect attribute values may due to
– Faulty data collection instruments
– Data entry problems
– Data transmission problems
– Technology limitation
– Inconsistency in naming convention
• Other data problems which requires data
cleaning
– Duplicate records
– Incomplete data
5. NOISY DATA
104
• Binning method:
– first sort data and partition into (equi-depth) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human
• Regression
– smooth by fitting the data into regression functions
6. HANDLING NOISY DATA
105
• Equal-width(distance) partitioning:
• It divides the range into N intervals of equal size:
uniform grid
• if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/N.
• The most straightforward
• But outliers may dominate presentation
• Skewed data is not handled well.
• Equal-depth(frequency) partitioning:
• It divides the range into N intervals, each containing
approximately same number of samples
• Good data scaling
• Managing categorical attributes can be tricky.
SIMPLE DISCRETIZATION METHODS: BINNING
106
• Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21,
24, 25, 26, 28, 29, 34
• Partition into (equi-depth) bins:
– -Bin 1: 4, 8, 9, 15
– -Bin 2: 21, 21, 24, 25
– -Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
– -Bin 1: 9, 9, 9, 9
– -Bin 2: 23, 23, 23, 23
– -Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
– -Bin 1: 4, 4, 4, 15
– -Bin 2: 21, 21, 25, 25
– -Bin 3: 26, 26, 26, 34
BINNING METHODS FOR DATA SMOOTHING
107
• Linear regression
involves finding the
“best” line to fit two
attributes (or
variables), so that
one attribute can be
used to predict the
other.
• Multiple linear
regression is an
extension of linear
regression, where
more than two
DATA SMOOTHING - REGRESSION
108
• Outliers may be
detected by
clustering, for
example, where
similar values are
organized into
groups, or
“clusters.”
• Intuitively, values
that fall outside of
the set of clusters
may be considered
DATA SMOOTHING - OUTLIER ANALYSIS
APPLICATIONS OF DATA SCIENCE
110
 Healthcare
 Finance
 Marketing
 Retail
 Education
 Transportation and Logistics
 Telecommunications
 Government and Public
Services
 Environmental Science

Applications of Data Science in
various fields
111
Predictive Analytics
Predict disease outbreaks and patient
admission rates.
Clinical Research
Analyse patient data for drug discovery
and treatment effectiveness.
Personalized Medicine
Tailor treatments based on individual
patient data.
1. Healthcare
112
 Algorithmic Trading
Develop trading algorithms based on historical
market data.
 Fraud Detection
Identify fraudulent transactions and activities.
 Credit Scoring
Assess creditworthiness of individuals and
businesses.
2. Finance
113
 Customer Segmentation
Divide customers into groups for targeted
marketing.
 Recommendation Systems
Provide personalized product or content
recommendations.
 Market Basket Analysis
 Understand customer purchasing patterns.
3.Marketing
114
 Inventory Management
Optimize stock levels to prevent overstocking or
shortages.
 Demand Forecasting
Predict product demand to optimize pricing and stock
levels.
 Customer Analytics
Analyse customer behaviour to enhance shopping
experience.
4.Retail
115
 Adaptive Learning
Customize educational content based on student
performance data.
 Dropout Prediction
Identify students at risk of dropping out and provide
support.
 Learning Analytics
Analyze student data to improve teaching methods and
curriculum.
5.Education
116
 Route Optimization
Find the most efficient routes for transportation and
delivery.
 Predictive Maintenance
Anticipate when vehicles or machinery will need
maintenance.
 Supply Chain Analytics
Optimize the supply chain for cost efficiency.
6.Transportation and
Logistics
117
 Network Optimization
Improve network performance based on usage patterns.
 Churn Prediction
Predict and reduce customer churn by identifying at-risk
customers.
 Fraud Detection
Identify and prevent fraudulent activities on the network.
7.Telecommunications
118
Crime Prediction
Predict high-risk areas for crime and
allocate resources accordingly.
Social Welfare
Identify eligible recipients for social
welfare programs using income and
demographic data.
Traffic Management
8.Government and Public
Services
119
 Climate Modelling
Analyze climate data to predict and understand
climate patterns.
 Natural Disaster Prediction
Predict natural disasters based on environmental data.
 Resource Conservation
Optimize resource usage and conservation strategies.
9.Environmental Science
120
 Performance Analysis
Analyze player performance and strategize game
plans.
 Injury Prevention
Predict and prevent injuries by analyzing player
health data.
 Fan Engagement
Analyze fan data to enhance the fan experience.
10.Sports Analytics
121
 Recruitment
Analyze resumes and job descriptions for optimal
candidate matching.
 Employee Retention
Identify factors contributing to employee turnover and
implement retention strategies.
 Workforce Planning
 Predict future workforce needs based on historical
data.
11. Human Resources
THANK YOU
Team – DSV
ASSIGNMENT-01
1) SUPERVISED:
• Linear regression-sales forecasting(Prediction)
• Decision Tree- used for both classification and regression
Linear regression-sales forecasting (Prediction)
• 1) WHAT IS LINEAR REGRESSION?
LINEAR AND NONLINEAR
REGRESSION
❖ It is simplest form of regression. Linear regression attempts to model the relationship between two variables by
fitting a linear equation to observe the data.
❖ Linear regression attempts to find the mathematical relationship between variables.
❖ If outcome is straight line then it is considered as linear model and if it is curved line, then it is a non
linear model.
❖ The relationship between dependent variable is given by straight line and it has only one independent
variable.
Y = a + b X
❖ Model 'Y', is a linear function of 'X'.
❖ The value of 'Y' increases or decreases in linear manner according to which the value of 'X' also changes.
MULTIPLE LINEAR
REGRESSION
❖ Multiple linear regression is an extension of linear regression analysis.
❖ It uses two or more independent variables to predict an outcome and a single continuous
dependent variable.
Y = a0
+ a1
X1
+ a2
X2
+.........+ak
Xk
+e
where,
'Y' is the response variable.
X1
+ X2
+ Xk
are the independent predictors.
'e' is random error.
a0
, a1
, a2
, ak
are the regression coefficients.
LOGISTIC
REGRESSION
❖ Logistic Regression was used in the biological sciences in early twentieth century. It was then
used in many social science applications. Logistic Regression is used when the dependent
variable(target) is categorical.
❖ For example,
➢ To predict whether an email is spam (1) or (0)
➢ Whether the tumor is malignant (1) or not (0)
Decision Tree- used for both classification and
regression
C0-01 OEAD0002.pptx ,msbxkasbdkbakwdbkawdka

C0-01 OEAD0002.pptx ,msbxkasbdkbakwdbkawdka

  • 1.
    CO-01 DATA SCIENCE &VISUALIZATION
  • 2.
    SYLLABUS Definition of DataScience AI, Machine Learning and Data Science Extracting Meaningful patterns Building representative models Statistics ML and Computing Learning Algorithms Data Science Classification Data Science Algorithms Data Science process Prior knowledge Data Preparation, Modeling and Application.
  • 3.
    Definition of DataScience • Data science is the process of using scientific methods, algorithms, and programming to extract knowledge and insights from data. It involves collecting, analyzing, and interpreting large amounts of data to find hidden patterns and trends
  • 4.
    AI, Machine Learningand Data Science
  • 5.
    Data science andartificial intelligence are not the same. “Data science and artificial intelligence are two technologies that are transforming the world. While artificial intelligence powers data science operations, data science is not completely dependent on AI. Data Science is leading the fourth industrial revolution. ”
  • 6.
    Data science alsorequires machine learning algorithms, which results in dependency on AI.
  • 7.
    Comparison Between AIand Data Science • Data science jobs require the knowledge of ML languages like R and Python to perform various data operations and computer science expertise. • Data science uses more tools apart from AI. This is because data science involves multiples steps to analyze data and generate insights. • Data science models are built for statistical insights whereas AI is used to build models that mimic cognition and human understanding.
  • 8.
    Comparison Between AIand Data Science • Today’s industries require both, data science and artificial intelligence. • Data science will help them make necessary data- driven decisions and assess their performance in the market, while artificial intelligence will help industries work with smarter devices and software that will minimize workload and optimize all the processes for improves innovation.
  • 9.
    Data Science Artifici al Intellige nce ML/ DM/ Analyt ics 1 0 “Data science producesinsights. Machine learning produces predictions” DATA SCIENCE AND ARTIFICIAL INTELLIGENCE
  • 10.
    Comparison Between AIand Data Science
  • 11.
    9 • They arenot the “same thing” • Big data = crude oil • Big data is about extracting “crude oil”, transporting it in “mega tankers”, siphoning it through “pipelines”, and storing it in “massive silos” • Data science is about refining the “crude oil” Carlos Samohano Founder, Data Science London DATA SCIENCE AND BIG DATA
  • 12.
  • 13.
    Computer Science Machine Learning Statistics Data Science isa science which uses : Visualization To collect ,clean, Integrate, analyze, visualize, interact with data to create data products. Human Computer Interaction
  • 14.
    A. Programming Languages •Python • R B. Data Visualization Tools • Tableau • Power BI C. Machine Learning Libraries • Scikit-learn • TensorFlow D. Big Data Frameworks • Apache HadooP • Apache Spark DATA SCIENCE TOOLS AND TECHNOLOGIES
  • 15.
    Extracting Meaningful patterns Apattern is a set of objects or phenomena or concepts where the elements of the set are similar to one another in certain ways or aspects. A pattern is an entity , that could be given a name . • Example : Fingerprint Image, handwritten word , human face , speech signal , DNA sequence etc
  • 16.
  • 17.
  • 18.
    18 PATTERN RECOGNITION PROCESS Dataacquisition and sensing: Measurements of physical variables. Important issues: bandwidth, resolution , etc. Pre-processing: Removal of noise in data. Isolation of patterns of interest from the background. Feature extraction: Finding a new representation in terms of features. Classification Using features and learned models to assign a pattern to a category. Post-processing Evaluation of confidence in decisions.
  • 19.
    19 PATTERN RECOGNITION MODEL Statisticalmodel: Pattern recognition systems are based on statistics and probabilities. Syntactic model: Structural models for pattern recognition and are based on the relation between features. Here the patterns are represented by structures . Template matching model: In this model, a template or a prototype of the pattern to be recognized is available. Neural network model: An artificial neural network (ANN) is a self-adaptive trainable process that is able to learn and resolve complex problems based on available knowledge.
  • 20.
    PATTERN CLASS 20 A Patternclass is a set of patterns sharing common attributes . A collection of “Similar” ( not necessarily identical ) objects. During recognition given objects are assigned to prescribed classes.
  • 21.
    21 APPLICATIONS OF PATTERNRECOGNITION Input: Images with characters (normally contaminated with noise) Output: The identified character string Useful in scenarios such as automatic license plate recognition (ALPR), optical character recognition(OCR) ,etc. CHARACTER RECOGNITION
  • 22.
    22 APPLICATIONS OF PATTERNRECOGNITION Input: Acoustic signal (Sound waves etc) Output: Contents of the speech Useful in scenarios such as speech-to-text (STT), voice command and control etc. SPEECH RECOGNITION
  • 23.
    23 APPLICATIONS OF PATTERNRECOGNITION FINGERPRINT RECOGNITION Input: Fingerprint of some person Output: The persons identity. Useful in scenarios such as computerized access control , criminal pursuit, etc.
  • 24.
    Building representative models/Datascience Life Cycle • The Data Science lifecycle is mainly designed for Big Data problems and data science projects. • The cycle is iterative to represent real project. • To address the distinct requirements for performing analysis on Big Data, step – by – step methodology is needed to organize the activities and tasks involved with acquiring, processing, analyzing, and repurposing data.
  • 25.
    DATA SCIENCE LIFECYCLE…….contd..
  • 26.
    • Exploratory DataAnalysis (EDA) is a method used by data scientists to analyze and investigate data sets. It's a critical initial step in the data science workflow. EDA involves: • Summarizing: • Visualizing: • Identifying: • Exploring: • Outlier detection: • Hypothesis testing: DATA SCIENCE LIFECYCLE …….contd..
  • 27.
    Phase 1: Understandingthe problem –  The data science team learn and investigate the problem.  Develop context and understanding.  Come to know about data sources needed and available for the project.  The team formulates initial hypothesis that can be later tested with data.  Hypothesis: An idea that is suggested as the possible DATA SCIENCE LIFECYCLE …….contd..
  • 28.
    Phase 2: GatheringRelevant Data–  Steps to explore, preprocess, and condition data prior to modeling and analysis.  It requires the presence of an analytic sandbox, the team execute, load, and transform, to get data into the sandbox.  Data preparation tasks are likely to be performed multiple times and not in predefined order.  Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine, etc. DATA SCIENCE LIFECYCLE …….contd..
  • 29.
    Phase 3: DataPreparation–  Team explores data to learn about relationships between variables and subsequently, selects key variables and the most suitable models.  In this phase, data science team develop data sets for training, testing, and production purposes.  Team builds and executes models based on the work done in the model planning phase.  Several tools commonly used for this phase are – Matlab, STASTICA. DATA SCIENCE LIFECYCLE …….contd..
  • 30.
    Phase 4: FeatureEngineering and Feature Extraction –  Team develops datasets for testing, training, and production purposes.  Team also considers whether its existing tools will suffice for running the models or if they need more robust environment for executing models.  Free or open-source tools – Rand PL/R, Octave, WEKA.  Commercial tools – Matlab , STASTICA. DATA SCIENCE LIFECYCLE …….contd..
  • 31.
    Phase 5: ModelBuilding and Deployment–  After executing model team need to compare outcomes of modeling to criteria established for success and failure.  Team considers how best to articulate findings and outcomes to various team members and stakeholders, taking into account warning, assumptions.  Team should identify key findings, quantify business value, and develop narrative to summarize and convey findings to stakeholders. DATA SCIENCE LIFECYCLE …….contd..
  • 32.
    Statistics , MLand Computing • Statistical Foundations for ML (e.g., probability, hypothesis testing, Bayesian inference) • Machine Learning Algorithms (e.g., supervised/unsupervised learning, neural networks) • Computational Techniques (e.g., optimization algorithms, numerical methods)
  • 33.
    Statistics , MLand Computing
  • 34.
  • 35.
    Learning Algorithms Below isthe list of the top 10 commonly used Machine Learning Algorithms: • Linear regression • Logistic regression • Decision tree • SVM algorithm • Naive Bayes algorithm • KNN algorithm • K-means • Random forest algorithm • Dimensionality reduction algorithms • Gradient boosting algorithm and Ada-Boosting algorithm
  • 36.
    1. Linear Regression •To understand the working functionality of Linear Regression, imagine how you would arrange random logs of wood in increasing order of their weight. There is a catch; however – you cannot weigh each log. You have to guess its weight just by looking at the height and girth of the log (visual analysis) and arranging them using a combination of these visible parameters. This is what linear regression in machine learning is like. • In this process, a relationship is established between independent and dependent variables by fitting them to a line. This line is known as the regression line and is represented by a linear equation Y= a *X + b. • In this equation: • Y – Dependent Variable • a – Slope • X – Independent variable • b – Intercept • The coefficients a & b are derived by minimizing the sum of the squared difference of distance between data points and the regression line.
  • 37.
    2. Logistic Regression •Logistic Regression is used to estimate discrete values (usually binary values like 0/1) from a set of independent variables. It helps predict the probability of an event by fitting data to a logit function. It is also called logit regression. • These methods listed below are often used to help improve logistic regression models: • include interaction terms • eliminate features • regularize techniques • use a non-linear model
  • 38.
    3. Decision Tree •Decision Tree algorithm in machine learning is one of the most popular algorithm in use today; this is a supervised learning algorithm that is used for classifying problems. It works well in classifying both categorical and continuous dependent variables. This algorithm divides the population into two or more homogeneous sets based on the most significant attributes/ independent variables.
  • 39.
    4. SVM (SupportVector Machine) Algorithm • SVM algorithm is a method of a classification algorithm in which you plot raw data as points in an n-dimensional space (where n is the number of features you have). The value of each feature is then tied to a particular coordinate, making it easy to classify the data. Lines called classifiers can be used to split the data and plot them on a graph.
  • 40.
    5. Naive BayesAlgorithm • A Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. • Even if these features are related to each other, a Naive Bayes classifier would consider all of these properties independently when calculating the probability of a particular outcome. • A Naive Bayesian model is easy to build and useful for massive datasets. It's simple and is known to outperform even highly sophisticated classification methods.
  • 41.
    6.KNN (K- NearestNeighbors) Algorithm • This algorithm can be applied to both classification and regression problems. Apparently, within the Data Science industry, it's more widely used to solve classification problems. It’s a simple algorithm that stores all available cases and classifies any new cases by taking a majority vote of its k neighbors. The case is then assigned to the class with which it has the most in common. A distance function performs this measurement. • KNN can be easily understood by comparing it to real life. For example, if you want information about a person, it makes sense to talk to his or her friends and colleagues! • Things to consider before selecting K Nearest Neighbours Algorithm: • KNN is computationally expensive • Variables should be normalized, or else higher range variables can bias the algorithm • Data still needs to be pre-processed.
  • 42.
    7. K-Means • Itis an unsupervised learning algorithm that solves clustering problems. Data sets are classified into a particular number of clusters (let's call that number K) in such a way that all the data points within a cluster are homogenous and heterogeneous from the data in other clusters. • How K-means forms clusters: • The K-means algorithm picks k number of points, called centroids, for each cluster. • Each data point forms a cluster with the closest centroids, i.e., K clusters. • It now creates new centroids based on the existing cluster members. • With these new centroids, the closest distance for each data point is determined. This process is repeated until the centroids do not change.
  • 43.
    8.Random Forest Algorithm •A collective of decision trees is called a Random Forest. To classify a new object based on its attributes, each tree is classified, and the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest). • Each tree is planted & grown as follows: • If the number of cases in the training set is N, then a sample of N cases is taken at random. This sample will be the training set for growing the tree. • If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M, and the best split on this m is used to split the node. The value of m is held constant during this process. • Each tree is grown to the most substantial extent possible. There is no pruning.
  • 44.
    9. Dimensionality ReductionAlgorithms • In today's world, vast amounts of data are being stored and analyzed by corporates, government agencies, and research organizations. As a data scientist, you know that this raw data contains a lot of information - the challenge is to identify significant patterns and variables. • Dimensionality reduction algorithms like Decision Tree, Factor Analysis, Missing Value Ratio, and Random Forest can help you find relevant details.
  • 45.
    10.Gradient Boosting Algorithmand AdaBoosting Algorithm • Gradient Boosting Algorithm and AdaBoosting Algorithm are boosting algorithms used when massive loads of data have to be handled to make predictions with high accuracy. Boosting is an ensemble learning algorithm that combines the predictive power of several base estimators to improve robustness. • In short, it combines multiple weak or average predictors to build a strong predictor. These boosting algorithms always work well in data science competitions like Kaggle, AV Hackathon, CrowdAnalytix. These are the most preferred machine learning algorithms today. Use them, along with Python and R Codes, to achieve accurate outcomes
  • 46.
  • 47.
  • 48.
    Data Science Classification •In data science, classification is a type of supervised machine learning where the goal is to predict the category or label of new observations based on past data. • It involves mapping input data to discrete output classes.
  • 49.
    Key Concepts inClassification • Supervised Learning: – The training dataset contains both input features and corresponding class labels. – The model learns to map input features to labels. • Classes (Categories): – Binary Classification: Two classes (e.g., spam vs. not spam). – Multiclass Classification: More than two classes (e.g., classifying emails into spam, promotions, and social). • Features: – Independent variables or attributes used to predict class labels. • Target Variable: – The dependent variable representing the class label.
  • 50.
    Common Classification Algorithms •Logistic Regression: For binary classification problems. • Decision Trees: Tree-like structure for decision making. • Random Forest: Ensemble of multiple decision trees. • Support Vector Machine (SVM): Finds the optimal hyperplane for separation. • k-Nearest Neighbors (k-NN): Classifies based on the majority class of nearest data points.
  • 51.
    Evaluation Metrics forClassification • Accuracy: Overall correctness of predictions. • Precision: True positives divided by all predicted positives. • Recall (Sensitivity): True positives divided by all actual positives. • F1-Score: Harmonic mean of precision and recall. • ROC-AUC Curve: Measures the trade-off between sensitivity and specificity. • Confusion Matrix: Visualization of model performance.
  • 52.
    Applications of Classificationin Data Science • Email Spam Detection: Classify emails as spam or not. • Customer Churn Prediction: Identify customers likely to leave a service. • Medical Diagnosis: Classify diseases based on symptoms. • Sentiment Analysis: Determine if a review is positive, negative, or neutral. • Fraud Detection: Identify fraudulent transactions.
  • 53.
    Prior knowledge • Beforediving into the field of Data Science, it's essential to have a solid foundation in certain subjects. • This prior knowledge will help you better understand data, choose appropriate algorithms, and interpret results. Here's a breakdown of the key areas to focus on
  • 54.
    • 1. Mathematics •2. Programming & Tools • 3. Machine Learning and Algorithms • 4. Data Engineering and Databases • 5. Data Wrangling and Preprocessing • 6. Communication and Storytelling • 7. Soft Skills
  • 55.
    1. Mathematics • LinearAlgebra: – Vectors, matrices, and matrix operations (addition, multiplication, inversion). – Eigenvalues and eigenvectors (important in dimensionality reduction algorithms like PCA). • Probability & Statistics: – Probability theory: Bayes' Theorem, distributions (e.g., normal, binomial), and random variables. – Regression analysis: Linear regression, correlation, and multivariate regression. – Sampling techniques: Random sampling, stratified sampling, and sampling distribution.
  • 56.
    2. Programming &Tools • Python: – Python is the most widely used language in data science due to its simplicity and vast ecosystem of libraries like Pandas, NumPy, SciPy, Matplotlib, Scikit-learn, TensorFlow, PyTorch. – R: Another popular language, particularly for statistical analysis and data visualization. • SQL (Structured Query Language): – Essential for extracting, manipulating, and querying data from relational databases (e.g., MySQL, PostgreSQL). • Data Visualization Tools: – Tools like Matplotlib, Seaborn, and Plotly (in Python) – Tableau or Power BI for interactive dashboards.
  • 57.
    3.Machine Learning andAlgorithms • Supervised Learning: – Classification (e.g., Logistic Regression, Decision Trees, k-NN, SVM) and Regression (e.g., Linear Regression, Ridge/Lasso). – Performance metrics: Accuracy, precision, recall, F1-score, ROC-AUC, etc. • Unsupervised Learning: – Clustering (e.g., K-means, Hierarchical Clustering, DBSCAN). – Association Rules (e.g., Apriori, Eclat). • Deep Learning: – Neural Networks: Perceptrons, Feedforward Neural Networks. – Convolutional Neural Networks (CNNs) for image data. – Recurrent Neural Networks (RNNs) and LSTMs for sequence data.
  • 58.
    4. Data Engineeringand Databases • Databases and Data Warehouses: – Understanding relational databases – Big Data Technologies: – Familiarity with frameworks like Hadoop and Spark for processing large datasets efficiently.
  • 59.
    5. Data Wranglingand Preprocessing • Handling missing data: Techniques like imputation, removal, or filling with default values. • Feature engineering: Creating new features from raw data to improve model performance. • Data transformation: Normalization, scaling, encoding categorical variables (e.g., one-hot encoding).
  • 60.
    6. Communication andStorytelling • Data Visualization: Being able to visualize data insights clearly using charts and graphs. • Reporting: Summarizing the analysis, explaining methodology, and presenting results in a way that non-technical stakeholders can understand. • Storytelling: Presenting a narrative around the data to support decision-making.
  • 61.
    7. Soft Skills •Critical Thinking: Ability to approach problems from different angles and think analytically. • Problem-Solving: Tackling complex business problems with a data-driven approach.
  • 62.
  • 63.
    63  Data collectionis a fundamental step in the data analytics and visualization process.  The quality and relevance of the collected data significantly impact the insights and decisions derived from the analysis.  Effective data collection and visualization strategies are essential for extracting valuable insights and empowering data-driven decision- making.  It's a dynamic process that requires continuous refinement based on user feedback and changing business needs. Data Collection/preparation Strategies
  • 64.
    64  Define ClearObjectives  Identify Relevant Data Sources  Data Quality Assessment  Consider Structured and Unstructured Data  Real-time Data Collection  Data Privacy and Ethics  Sampling Techniques  Surveys and Questionnaires  Collaboration with Stakeholders  Data Integration  Feed back loops Data collection/preparation strategies in the context of data analytics and visualization
  • 65.
    65 Clearly outline thegoals and objectives of your data analytics and visualization project.  Understand the questions you want to answer and the insights you aim to derive.  Knowing what insights you aim to gain will guide your data collection efforts. 1.Define Clear Objectives
  • 66.
    66 Determine the sourcesof data that are relevant to your objectives. This can include databases, spread sheets, API’s(Application Program Interface), external datasets, or a combination of these. 2.Identify Relevant Data Sources  Determine the key performance indicators (KPIs) and metrics relevant to your analysis and visualization goals.  These metrics will drive the selection of data sources and variables. Choose data sources that align with your
  • 67.
    67 Assess the qualityof available data. Check for completeness, accuracy, consistency, and relevance.  Cleaning and pre-processing may be necessary to address any issues. Prioritize data quality over quantity. Ensure that the data collected is accurate, complete, and reliable. Address issues such as missing values, outliers, and inconsistencies early in the process 3.Data Quality Assessment
  • 68.
    68 Depending on yourobjectives, collect both structured data (e.g., databases) and unstructured data (e.g., text, images) for a more comprehensive analysis. 4.Consider Structured and Unstructured Data
  • 69.
    69 If your analysisrequires real-time insights, consider implementing systems for collecting and processing data in real-time. This is especially important for dynamic datasets. Depending on the nature of your analysis, consider implementing real-time data collection processes. 5.Real-time and Automated Data Collection
  • 70.
    70 Ensure compliance withdata privacy regulations. Obtain necessary permissions for data collection, especially when dealing with personal or sensitive information. Adhere to data security and privacy regulations. Protect sensitive information and ensure compliance with legal requirements to maintain the trust of stakeholders 6.Data Privacy and Ethics
  • 71.
    71  Use samplingmethods if working with large datasets.  This involves selecting a representative subset of data for analysis, which can save time and resources.  If working with large datasets, consider using sampling techniques to extract a representative subset for analysis.  This can save computational resources and speed up the analysis. 7.Sampling Techniques
  • 72.
    72 Design and deploysurveys or questionnaires to gather specific information directly from users or relevant stakeholders. Ensure that the questions align with your objectives. 8.Surveys and Questionnaires
  • 73.
    73 Collaborate with domainexperts and stakeholders to gain insights into the context of the data. Their input can help refine data collection strategies.  Foster collaboration between teams responsible for data collection and those performing analytics and visualization.  Clear communication ensures that the collected data meets the analytical requirements. 9.Collaboration with Stakeholders
  • 74.
    74 Integrate data fromdifferent sources to create a unified dataset. Ensure compatibility and consistency when combining data from various platforms. 10.Data Integration
  • 75.
    75 Establish feedback loopsto continuously improve data collection processes.  Monitor the performance of analytics and visualizations and adjust data collection strategies as needed. 11.Feedback Loops
  • 76.
    76  Remember toconsider ethical and privacy considerations during data collection, and ensure data quality through proper cleaning and pre-processing before analysis and visualization.  Gather specific information through structured sets of questions. Collect subjective data, opinions, and preferences.  Qualitative data collection, understanding perspectives. Study natural behaviour, processes, and situations.  Collect large volumes of diverse data from the internet. Common in healthcare, manufacturing, and environmental monitoring.  Social Media Monitoring Monitor public opinions, brand sentiment, and identify trends.  Transaction Records Identify patterns, trends, and anomalies in business processes. Focus Groups Obtain diverse opinions and insights on specific topics. Summary
  • 77.
  • 78.
    78  Effective datacollection is the foundation of meaningful data analytics and visualization.  Data pre-processing/modelling is a crucial step in the data analytics and visualization process.  It involves cleaning, transforming, and organizing raw data into a format that can be effectively utilized for analysis and visualization Data pre-processing/modelling overview in Data science & visualization
  • 79.
  • 80.
    80 • Data cleaningis the process of detecting corrupt data and inaccurate records from a record set or database table. • The main use of cleaning step is based on detecting incomplete, inaccurate, inconsistent and irrelevant data and applying techniques to modify or delete this useless data. DATA CLEANING
  • 81.
    81 • Data Integrationfocuses on unification of data residing in different sources and presenting a unified view of these data. • Data with different representations are put together and any conflicts resulting from it are resolved. • This process becomes vital in a number of scientific and commercial applications. With increasing volume and exponential growth of data, integrating it becomes even more significant. DATA INTEGRATION
  • 82.
    82 • Data reductionis the process of transforming digital info into ordered and simplified form. • This data is generally derived through empirical and experimental means. • It involves reducing large amounts of data into smaller and meaningful fragments. DATA REDUCTION
  • 83.
    83 • Data discretizationis an important concept when you have a large amount of numeric data, but only want to classify it based on nominal values. • In this scenario, the continuous data is split into discrete forms and the values of these discrete sets are said to be the nominal value. It is basically a process of converting continuous data attributes into a finite set of intervals with minimal loss of information. DTATA DISRETIZATION
  • 84.
  • 85.
    85 • Data inthe real world is dirty – Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=“ ” – Noisy: containing errors or outliers e.g., Salary=“-10” – Inconsistent: containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/2005” DATA PREPROCESSING
  • 86.
    86 • Incomplete datacomes from – n/a data value when collected – different consideration between the time when the data was collected and when it is analyzed. – human/hardware/software problems • Noisy data comes from the process of data – collection – entry – transmission • Inconsistent data comes from – Different data sources – Functional dependency violation DIRTY DATA COMES FROM
  • 87.
    87 • No qualitydata, no quality mining results! – Quality decisions must be based on quality data • e.g., duplicate or missing data may cause incorrect or even misleading statistics. – Data warehouse needs consistent integration of quality data • Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse. —Bill Inmon IMPORTANCE OF DATA PREPROCESSING
  • 88.
    88 A. Data cleaning –Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies B. Data integration – Integration of multiple databases, data cubes, or files C. Data transformation – Normalization and aggregation D. Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results E. Data discretization – Part of data reduction but with particular importance, especially for numerical data MAJOR TASKS IN DATA PREPROCESSING
  • 89.
    89 FORMS OF DATAPREPROCESSING
  • 90.
    90 FORMS OF DATAPREPROCESSING (CONT…)
  • 91.
    91 • Importance – “Datacleaning is one of the three biggest problems in data Processing”—Ralph Kimball – “Data cleaning is the number one problem in data Processing”—DCI survey • Data cleaning tasks – Fill in missing values – Identify outliers and smooth out noisy data – Correct inconsistent data – Resolve redundancy caused by data integration A. DATA CLEANING
  • 92.
    92 1. Data acquisitionand metadata 2. Fill in missing values 3. Unified date format 4. Converting nominal to numeric 5. Identify outliers and smooth out noisy data 6. Correct inconsistent data DATA CLEANING TASKS
  • 93.
    93 • Data canbe in DBMS • ODBC, JDBC protocols • Data in a flat file • Fixed-column format • Delimited format: tab, comma “,”, other • E.g. C4.5 and Weka “arff”use comma-delimited data • Attention: Convert field delimiters inside strings • Verify the number of fields before and after 1.DATA CLEANING: ACQUISITION
  • 94.
    94 • Field types: –binary, nominal (categorical),ordinal, numeric, … – For nominal fields: tables translating codes to full descriptions • Field role: – input : inputs for modelling – target : output – id/auxiliary : keep, but not use for modelling – ignore : don’t use for modelling – weight : instance weight – … • Field descriptions DATA CLEANING: METADATA
  • 95.
    95 • Convert datato a standard format (e.g. arff or csv) • Missing values • Unified date format • Binning of numeric data • Fix errors and outliers • Convert nominal fields whose values have order to numeric. DATA CLEANING: REFORMATTING
  • 96.
    96 • Data isnot always available – E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to – Equipment malfunction – Inconsistent with other recorded data and thus deleted – Data not entered due to misunderstanding – Certain data may not be considered important at the time of entry – Not register history or changes of the data • Missing data may need to be inferred. 2.FILLING MISSING VALUES
  • 97.
    97 • Ignore thetuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. • Fill in the missing value manually: tedious + infeasible? • Use a global constant to fill in the missing value: e.g., “unknown”, a new class?! • Imputation: Use the attribute mean to fill in the missing value, or use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter • Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree HANDLING MISSING DATA
  • 98.
    98 • We wantto transform all dates to the same format internally • Some systems accept dates in many formats – e.g. “Sep 24, 2003”, 9/24/03, 24.09.03, etc – dates are transformed internally to a standard value • Frequently, just the year (YYYY) is sufficient • For more details, we may need the month, the day, the hour, etc • Representing date as YYYYMM or YYYYMMDD can be OK, but has problems 3. UNIFIED DATE FORMAT
  • 99.
    99 • To preserveintervals, we can use – Unix system date: Number of seconds since 1970 – Number of days since Jan 1, 1960 (SAS) • Problem: – values are non-obvious – don’t help intuition and knowledge discovery – harder to verify, easier to make an error UNIFIED DATE FORMAT OPTIONS
  • 100.
    100 • Some toolscan deal with nominal values internally • Other methods (neural nets, regression, nearest neighbor) require only numeric inputs • To use nominal fields in such methods need to convert them to a numeric value – Q: Why not ignore nominal fields altogether? – A: They may contain valuable information • Different strategies for binary, ordered, multi- valued nominal fields 4.CONVERSION: NOMINAL TO NUMERIC
  • 101.
    101 • Binary fields –E.g.Gender = M, F • Convert to Field_0_1 with 0, 1 values –e.g. Gender = M  Gender_0_1 = 0 –Gender = F  Gender_0_1 = 1 CONVERSION: BINARY TO NUMERIC
  • 102.
    102 • Ordered attributes(e.g. Grade) can be converted to numbers preserving natural order, e.g. – A  4.0 – A-  3.7 – B+  3.3 – B  3.0 • Q: Why is it important to preserve natural order? • A: To allow meaningful comparisons, e.g. Grade > 3.5 CONVERSION: ORDERED TO NUMERIC
  • 103.
    103 • Noise: randomerror or variance in a measured variable • Incorrect attribute values may due to – Faulty data collection instruments – Data entry problems – Data transmission problems – Technology limitation – Inconsistency in naming convention • Other data problems which requires data cleaning – Duplicate records – Incomplete data 5. NOISY DATA
  • 104.
    104 • Binning method: –first sort data and partition into (equi-depth) bins – then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Clustering – detect and remove outliers • Combined computer and human inspection – detect suspicious values and check by human • Regression – smooth by fitting the data into regression functions 6. HANDLING NOISY DATA
  • 105.
    105 • Equal-width(distance) partitioning: •It divides the range into N intervals of equal size: uniform grid • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. • The most straightforward • But outliers may dominate presentation • Skewed data is not handled well. • Equal-depth(frequency) partitioning: • It divides the range into N intervals, each containing approximately same number of samples • Good data scaling • Managing categorical attributes can be tricky. SIMPLE DISCRETIZATION METHODS: BINNING
  • 106.
    106 • Sorted datafor price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 • Partition into (equi-depth) bins: – -Bin 1: 4, 8, 9, 15 – -Bin 2: 21, 21, 24, 25 – -Bin 3: 26, 28, 29, 34 • Smoothing by bin means: – -Bin 1: 9, 9, 9, 9 – -Bin 2: 23, 23, 23, 23 – -Bin 3: 29, 29, 29, 29 • Smoothing by bin boundaries: – -Bin 1: 4, 4, 4, 15 – -Bin 2: 21, 21, 25, 25 – -Bin 3: 26, 26, 26, 34 BINNING METHODS FOR DATA SMOOTHING
  • 107.
    107 • Linear regression involvesfinding the “best” line to fit two attributes (or variables), so that one attribute can be used to predict the other. • Multiple linear regression is an extension of linear regression, where more than two DATA SMOOTHING - REGRESSION
  • 108.
    108 • Outliers maybe detected by clustering, for example, where similar values are organized into groups, or “clusters.” • Intuitively, values that fall outside of the set of clusters may be considered DATA SMOOTHING - OUTLIER ANALYSIS
  • 109.
  • 110.
    110  Healthcare  Finance Marketing  Retail  Education  Transportation and Logistics  Telecommunications  Government and Public Services  Environmental Science  Applications of Data Science in various fields
  • 111.
    111 Predictive Analytics Predict diseaseoutbreaks and patient admission rates. Clinical Research Analyse patient data for drug discovery and treatment effectiveness. Personalized Medicine Tailor treatments based on individual patient data. 1. Healthcare
  • 112.
    112  Algorithmic Trading Developtrading algorithms based on historical market data.  Fraud Detection Identify fraudulent transactions and activities.  Credit Scoring Assess creditworthiness of individuals and businesses. 2. Finance
  • 113.
    113  Customer Segmentation Dividecustomers into groups for targeted marketing.  Recommendation Systems Provide personalized product or content recommendations.  Market Basket Analysis  Understand customer purchasing patterns. 3.Marketing
  • 114.
    114  Inventory Management Optimizestock levels to prevent overstocking or shortages.  Demand Forecasting Predict product demand to optimize pricing and stock levels.  Customer Analytics Analyse customer behaviour to enhance shopping experience. 4.Retail
  • 115.
    115  Adaptive Learning Customizeeducational content based on student performance data.  Dropout Prediction Identify students at risk of dropping out and provide support.  Learning Analytics Analyze student data to improve teaching methods and curriculum. 5.Education
  • 116.
    116  Route Optimization Findthe most efficient routes for transportation and delivery.  Predictive Maintenance Anticipate when vehicles or machinery will need maintenance.  Supply Chain Analytics Optimize the supply chain for cost efficiency. 6.Transportation and Logistics
  • 117.
    117  Network Optimization Improvenetwork performance based on usage patterns.  Churn Prediction Predict and reduce customer churn by identifying at-risk customers.  Fraud Detection Identify and prevent fraudulent activities on the network. 7.Telecommunications
  • 118.
    118 Crime Prediction Predict high-riskareas for crime and allocate resources accordingly. Social Welfare Identify eligible recipients for social welfare programs using income and demographic data. Traffic Management 8.Government and Public Services
  • 119.
    119  Climate Modelling Analyzeclimate data to predict and understand climate patterns.  Natural Disaster Prediction Predict natural disasters based on environmental data.  Resource Conservation Optimize resource usage and conservation strategies. 9.Environmental Science
  • 120.
    120  Performance Analysis Analyzeplayer performance and strategize game plans.  Injury Prevention Predict and prevent injuries by analyzing player health data.  Fan Engagement Analyze fan data to enhance the fan experience. 10.Sports Analytics
  • 121.
    121  Recruitment Analyze resumesand job descriptions for optimal candidate matching.  Employee Retention Identify factors contributing to employee turnover and implement retention strategies.  Workforce Planning  Predict future workforce needs based on historical data. 11. Human Resources
  • 122.
  • 123.
    ASSIGNMENT-01 1) SUPERVISED: • Linearregression-sales forecasting(Prediction) • Decision Tree- used for both classification and regression
  • 124.
    Linear regression-sales forecasting(Prediction) • 1) WHAT IS LINEAR REGRESSION?
  • 125.
    LINEAR AND NONLINEAR REGRESSION ❖It is simplest form of regression. Linear regression attempts to model the relationship between two variables by fitting a linear equation to observe the data. ❖ Linear regression attempts to find the mathematical relationship between variables. ❖ If outcome is straight line then it is considered as linear model and if it is curved line, then it is a non linear model. ❖ The relationship between dependent variable is given by straight line and it has only one independent variable. Y = a + b X ❖ Model 'Y', is a linear function of 'X'. ❖ The value of 'Y' increases or decreases in linear manner according to which the value of 'X' also changes.
  • 132.
    MULTIPLE LINEAR REGRESSION ❖ Multiplelinear regression is an extension of linear regression analysis. ❖ It uses two or more independent variables to predict an outcome and a single continuous dependent variable. Y = a0 + a1 X1 + a2 X2 +.........+ak Xk +e where, 'Y' is the response variable. X1 + X2 + Xk are the independent predictors. 'e' is random error. a0 , a1 , a2 , ak are the regression coefficients.
  • 133.
    LOGISTIC REGRESSION ❖ Logistic Regressionwas used in the biological sciences in early twentieth century. It was then used in many social science applications. Logistic Regression is used when the dependent variable(target) is categorical. ❖ For example, ➢ To predict whether an email is spam (1) or (0) ➢ Whether the tumor is malignant (1) or not (0)
  • 137.
    Decision Tree- usedfor both classification and regression

Editor's Notes

  • #63 The descriptions present the rationale for each attribute and explain the reason and role of the attribute.
  • #116 A supply chain transforms raw materials and components into a finished product that's delivered to a customer.