SlideShare a Scribd company logo
1 of 87
PRESCRIPTIVE ANALYTICS
SUB CODE:322BAB01
UNIT I
Prescriptive analytics Definition
Prescriptive analytics is a process that analyzes data and provides instant
recommendations on how to optimize business practices to suit multiple predicted
outcomes.
Prescriptive analytics is the third and final tier in modern, computerized data
processing. These three tiers include:
•Descriptive analytics: Descriptive analytics acts as an initial catalyst to clear and
and concise data analysis. It is the “what we know” (current user data, real-time
data, previous engagement data, and big data).
•Predictive analytics: Predictive analytics applies mathematical models to the
current data to inform (predict) future behavior. It is the “what could happen."
•Prescriptive analytics: Prescriptive analytics utilizes similar modeling structures
structures to predict outcomes and then utilizes a combination of machine
learning, business rules, artificial intelligence, and algorithms to simulate various
approaches to these numerous outcomes. It then suggests the best possible
actions to optimize business practices. It is the “what should happen.”
Data Mining Definition
Data mining is the process of searching and analyzing a large batch of raw data in order
to identify patterns and extract useful information.
Companies use data mining software to learn more about their customers.
It can help them to develop more effective marketing strategies, increase sales, and
decrease costs.
Data mining relies on effective data collection, warehousing, and computer processing.
Data Mining Works
Data mining involves exploring and analyzing large blocks of
information to glean meaningful patterns and trends. It is used in credit
risk management, fraud detection, and spam filtering. It also is a
market research tool that helps reveal the sentiment or opinions of a
given group of people. The data mining process breaks down into four
steps:
•Data is collected and loaded into data warehouses on-site or on a
cloud service.
•Business analysts, management teams, and information technology
professionals access the data and determine how they want to organize
it.
•Custom application software sorts and organizes the data.
•The end user presents the data in an easy-to-share format, such as a
graph or table.
The Data Mining Process
To be most effective, data analysts generally follow a certain flow of tasks along the data mining
process.
Step 1: Understand the Business
Before any data is touched, extracted, cleaned, or analyzed, it is important to
understand the underlying entity and the project at hand.
Step 2: Understand the Data
Once the business problem has been clearly defined, it's time to start thinking about
data.
This step also includes determining the limits of the data, storage, security, and
collection and assesses how these constraints will affect the data mining process.
Step 3: Prepare the Data
Data is gathered, uploaded, extracted, or calculated. It is then cleaned, standardized,
scrubbed for outliers, assessed for mistakes, and checked for reasonableness
Step 4: Build the Model
With our clean data set in hand, it's time to crunch the numbers. Data scientists use the
types of data mining above to search for relationships, trends, associations, or
sequential patterns
Step 5: Evaluate the Results
The data-centered aspect of data mining concludes by assessing the findings of the data model or
models.
Step 6: Implement Change and Monitor
The data mining process concludes with management taking steps in response to the
findings of the analysis.
Text mining
Text mining is the process of turning natural language into something that can be
manipulated, stored, and analysed by machines.
It’s all about giving computers, which have historically worked with numerical data,
the ability to work with linguistic data – by turning it into something with a structured
format.
Quantitative and qualitative data
To really understand text mining, we need to establish some key concepts, such as the
difference between quantitative and qualitative data.
Qualitative data
Most of the human language we find in everyday life is qualitative data. It describes the
characteristics of things – their qualities – and expresses a person’s reasoning, emotion,
Quantitative data
The opposite of qualitative data is quantitative data. Quantitative data is numerical – it tells you
about quantity.
Text analysis takes qualitative textual data and turns it into quantitative, numerical data.
It does things like counting the number of times a theme, topic or phrase is included in a large
corpus of textual data, in order to determine the importance or prevalence of a topic.
Process of Text Mining
•Gathering unstructured information from various sources accessible in various
document organizations, for example, plain text, web pages, PDF records, etc.
•Pre-processing and data cleansing tasks are performed to distinguish and eliminate
inconsistency in the data. The data cleansing process makes sure to capture the
genuine text, and it is performed to eliminate stop words stemming (the process of
identifying the root of a certain word and indexing the data.
•Processing and controlling tasks are applied to review and further clean the data set.
•Pattern analysis is implemented in Management Information System.
•Information processed in the above steps is utilized to extract important and
applicable data for a powerful and convenient decision-making process and trend
analysis.
Procedures for Analyzing Text Mining
•Text Summarization: To extract its partial content and reflect its whole content automatically.
•Text Categorization: To assign a category to the text among categories predefined by users.
•Text Clustering: To segment texts into several clusters, depending on the substantial
relevance.
Definition:
Web mining
is the process of using data mining techniques and algorithms to extract information directly
from the Web either through Web documents and Web services, hyperlinks and server logs. The
goal of Web mining is to look for patterns in Web data by collecting and analyzing information
in order to gain insights into trends, the industry, and users in general.
Web mining
Web mining is the process of extracting valuable data from a website. This data can be used for a
variety of reasons, including marketing, customer analysis, and security purposes.
Types of web mining:
•Content web mining: The process of extracting useful information from the
contents of Web pages and Web documents, which are mostly text, images,
•Web Structure Mining: Process of analyzing the structure of nodes and the
connection of a website through the use of graph theory. There are two things
the structure of a website in terms of how it connects to other sites and the
page itself, as to how each page connects.
•Mining of the use of the web: The process of extracting patterns and information from server logs to gain
information from server logs to gain insights into user activity, where it comes from, how many users have
clicked on an item on the site, and the types of activities taking place on the site.
Process Mining
Process mining is an emerging data science technique that involves analyzing event logs to extract
information about an organization’s underlying operational processes. Here’s how it works.
In process mining, we use algorithms to analyze event data and reveal details about the
activities performed by people and machines.
Process mining has a wide range of applications across various disciplines
including finance, healthcare, manufacturing and logistics.
It’s an interdisciplinary field that combines techniques from data mining, machine learning and
process management to discover, monitor and optimize real-world business processes.
With process mining, organizations can increase efficiency and ensure widespread compliance.
PROCESS MINING DEFINITION
Process mining involves taking log data from different enterprise systems and analyzing it to
understand how to improve various processes. With process mining tools, teams can transform
data into visualizations to locate bottlenecks and adjust workflows accordingly.
Types of Process Mining
3 TYPES OF PROCESS MINING
1.Discovery
2.Conformance
3.Enhancement
1. DISCOVERY
In some cases, an organization may not have a proper process model definition. In such cases,
we would use process mining to develop a process model.
Alpha-algorithm, heuristic-mining algorithm and genetic-process-mining algorithm are among
the algorithms we use to extract process models from the event logs. Process discovery
algorithms scan through all the events in the log to develop the process model.
We can then develop graphical representations of the process models using industry-standard
notations such as directly-follows graphs, petri-nets, and BPMN 2.0 (Business Process Model
and Notation).
2. CONFORMANCE
Organizations often have an ideal process model in place that defines how a process is supposed
3. ENHANCEMENT
Once we develop the process model and identify the areas for improvement, process mining enhancement
involves redesigning the process to optimize its efficiency and effectiveness. This process includes
eliminating unnecessary steps, automating repetitive tasks and reallocating resources, all while improving
team communication and collaboration.
Process Mining Important
Company leaders and executives may think they know the ins and outs of a business, but
lack the data to back up their claims.
Process mining removes any guesswork and false assumptions by translating log data
into visual models and representations.
With a more accurate view of daily operations, leaders can understand what’s truly
working for the company and what processes need adjusting.
Teams can use these insights to reallocate resources and better apply employees’ time
and energy.
By making these adjustments, company leaders can raise teams’ performance, improve
the customer experience, cut down on unnecessary costs and boost revenue streams.
Benefits of Process Mining
PROCESS FLOW, VARIATIONS AND EXCEPTIONS
The process model clearly shows the process steps and their sequences as well as process variations and exceptions (if
there are any). With this information, you can identify inefficiencies, bottlenecks and opportunities for operational
improvement.
PROCESS PERFORMANCE METRICS AND RESOURCE UTILIZATION
Process mining helps you see the process performance metrics and resource utilization more clearly. This means you
can get a better idea of how organizations use resources including people, machines and materials. This information
helps organizations optimize resource allocation and improve overall efficiency.
COMPLIANCE AND REGULATORY REQUIREMENTS
Process models demonstrate deviations from compliance and regulatory requirements. Organizations use this
information to ensure that the process is aligned with legal and regulatory standards.
Process Mining Use Cases
 CUSTOMER SERVICE INSURANCE
 SALES E-COMMERCE
 LOGISTICS IT
 MANUFACTURING HEALTHCARE
EDUCATION
 FINANCE
Data Warehouse Meaning
A data warehouse, or “enterprise data warehouse” (EDW), is a central repository system in which
businesses store valuable information, such as customer and sales data, for analytics and reporting
purposes.
Data warehouse Benefits
Data warehouses provide many benefits to businesses. Some of the most common benefits
include:
•Provide a stable, centralized repository for large amounts of historical data
•Improve business processes and decision-making with actionable insights
•Increase a business’s overall return on investment (ROI)
•Improve data quality
•Enhance BI performance and capabilities by drawing on multiple sources
•Provide access to historical data business-wide
•Use AI and machine learning to improve business analytics
Meaning of Data Mart
A data mart is a subset of a data warehouse focused on a particular line of business, department, or
subject area.
For example, many companies may have a data mart that aligns with a specific department in
the business, such as finance, sales, or marketing.
UNIT II
MEANING OF DATA MINING
Data mining is the process of extracting and discovering patterns in large data sets involving methods at
the intersection of machine learning, statistics, and database systems.
KDD Process
KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful, previously unknown, and
potentially valuable information from large datasets.
KDD is an iterative process where evaluation measures can be enhanced, mining can be refined, new
data can be integrated and transformed in order to get different and more appropriate results.
Preprocessing of databases consists of Data cleaning and Data Integration.
Advantages of KDD
1.Improves decision-making: KDD provides valuable insights and knowledge that can help
organizations make better decisions.
2.Increased efficiency: KDD automates repetitive and time-consuming tasks and makes the data ready
for analysis, which saves time and money.
3.Better customer service: KDD helps organizations gain a better understanding of their customers’
needs and preferences, which can help them provide better customer service.
4.Fraud detection: KDD can be used to detect fraudulent activities by identifying patterns and
anomalies in the data that may indicate fraud.
5.Predictive modeling: KDD can be used to build predictive models that can forecast future trends and
patterns.
CRoss-Industry Standard Process for Data Mining [CRISP-DM]
The Cross-industry standard process for data mining, known as CRISP-DM,[1] is an open
standard process model that describes common approaches used by data mining experts. It is the most
widely-used analytics model.
6 Major phases of CRISP -DM
SEMMA is an acronym that stands for Sample, Explore, Modify, Model, and Assess. It is a list of sequential
steps developed by SAS Institute, one of the largest producers of statistics and business
intelligence software.
SEMMAAND DOMAIN SPECIFIC
When applying the SEMMA framework in the healthcare domain, there are several domain-specific
considerations and challenges to keep in mind. Healthcare data is often complex and sensitive, and the
objectives may be unique to the field. Here's how SEMMA can be adapted for healthcare:
1.Sample:
1. Patient Data Selection: In healthcare, the selection of a representative sample often involves
choosing patient data. The sample should include diverse patient demographics, medical
conditions, and treatment histories.
2.Explore:
1. Clinical Knowledge Integration: Healthcare professionals and data analysts need to work closely
to interpret and understand the data. Domain-specific knowledge is crucial for meaningful
exploratory data analysis. Visualization techniques can help identify patterns and anomalies in
patient data.
3.Modify:
1. Data Cleaning and Anomaly Detection: Given the sensitivity of healthcare data, it's important to
carefully handle and de-identify patient information to maintain privacy and comply with
healthcare regulations (e.g., HIPAA in the United States).
2. Feature Engineering: Healthcare data often requires specialized feature engineering to extract
meaningful variables from electronic health records (EHRs) or medical images.
1.Model:
1. Clinical Algorithms: Healthcare modeling may involve the development of clinical prediction
models, risk scores, or diagnostic algorithms. It's essential to integrate medical knowledge and
guidelines into the modeling process.
2. Patient Risk Stratification: Models in healthcare may be used for patient risk stratification to
identify individuals at higher risk for certain conditions or adverse events.
2.Assess:
1. Clinical Validation: The assessment of models in healthcare often includes clinical validation,
which involves testing the models on real patients and assessing their clinical utility.
2. Ethical Considerations: Ethical and regulatory considerations, such as patient consent and data
security, play a significant role in assessing and deploying models in healthcare.
In healthcare, the SEMMA framework should be applied with a deep understanding of medical concepts,
clinical practice, and healthcare regulations. It's also essential to involve healthcare professionals, such
as physicians and nurses, in the process to ensure that the developed models are clinically relevant and
safe for patients. Additionally, the use of healthcare-specific data standards and terminologies (e.g.,
SNOMED CT, ICD-10, CPT) is critical for data integration and interoperability in the healthcare
domain.
Classification is to identify the category or the class label of a new observation. First, a set of data is used as
training data.
Classification and Predication in Data Mining
Root-Mean-Square Error
•The Root-Mean-Square Error (RMSE) is one of the methods to determine the
accuracy of our model in predicting the target values.
Mean absolute deviation (MAD)
It seems like you might be asking about "MAD" in the context of data mining. "MAD" can refer to
multiple things, and its meaning depends on the specific context. In data mining, it's not a widely
recognized acronym or term. However, I can offer a couple of possible interpretations:
1.Median Absolute Deviation (MAD): MAD is a statistic used in data analysis and is related to
measures of data dispersion. It is the median of the absolute differences between data points and the
median of the dataset. It is often used to identify and quantify outliers in a dataset.
2.Mean Absolute Deviation (MAD): Similar to the Median Absolute Deviation, this measure calculates
the average of the absolute differences between data points and the mean of the dataset. It is another
way to quantify the spread or dispersion of data.
3.MAD as an acronym or abbreviation specific to a particular data mining algorithm or tool: Data
mining is a broad field, and different algorithms and software tools may use their own acronyms or
abbreviations.
If you are referring to a specific method, algorithm, or tool with "MAD," you would need to provide
more context for a precise explanation.
If you have a specific context or a particular algorithm or tool in mind, please provide more information,
and I can offer a more accurate explanation.
MAP (Mean Absolute Precision) and MAPE (Mean Absolute Percentage Error) are both metrics used to
evaluate the performance of models, often in the context of information retrieval or predictive modeling,
but they measure different things.
1.MAP (Mean Average Precision): MAP is typically used in the context of information retrieval or
ranking systems, such as search engines.
It assesses the quality of a ranked list of items by measuring the average precision across different levels
of recall.
It takes into account not just whether relevant items are retrieved but also their order in the ranked list.
The formula for MAP is: MAP=N1​∑i=1NAPi​ Where:
1. N is the number of queries or situations.
2. AP_i is the Average Precision for the i-th query or situation.
1.MAPE (Mean Absolute Percentage Error): MAPE is typically used in the context of predictive
modeling and forecasting.
It measures the accuracy of predictions by calculating the percentage difference between predicted and
actual values for a set of data points.
It's a way to express the error as a percentage of the actual values, making it easier to interpret.
The formula for MAPE is: MAPE=n1​∑i=1n∣∣YiYi​−Y^i∣∣×100
1. n is the number of data points.
2. Y_i is the actual value for the i-th data point.
3. �^�Y^i​ is the predicted value for the i-th data point.
In summary, MAP assesses the quality of ranked lists in information retrieval, while MAPE measures the
accuracy of predictions in forecasting and predictive modeling. They are different metrics used for
different purposes, and their calculation methods and interpretations are distinct.
A confusion matrix is a technique for summarizing the performance of a classification algorithm.
Classification accuracy alone can be misleading if you have an unequal number of observations in each
class or if you have more than two classes in your dataset.
Calculating a confusion matrix can give you a better idea of what your classification model is getting
right and what types of errors it is making.
In this post, you will discover the confusion matrix for use in machine learning.
After reading this post you will know:
•What the confusion matrix is and why you need to use it.
•How to calculate a confusion matrix for a 2-class classification problem from scratch.
•How create a confusion matrix in Weka, Python and R.
RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE
Receiver Operating Characteristic (ROC) Curve is a way to compare diagnostic tests. It is a plot
of the true positive rate against the false positive rate.
• The relationship between sensitivity and specificity. For example, a decrease in sensitivity
results in an increase in specificity.
• Test accuracy; the closer the graph is to the top and left-hand borders, the more accurate the
test. Likewise, the closer the graph to the diagonal, the less accurate the test. A perfect
test would go straight from zero up to the top-left corner and then straight across the
horizontal.
• The likelihood ratio; given by the derivative at any particular cutpoint.
 A ROC curve showing two tests. The red test is closer to the diagonal
and is therefore less accurate than the green test.
AUC CURVE
AUC, short for area under the ROC (receiver operating
characteristic) curve, is a relatively straightforward
metric that is useful across a range of use-cases.
DATA VALIDATION TECHNIQUES
Data validation and testing tools are software applications that can help you
automate and simplify your data validation and testing process. These tools can
measure and report the quality of your data based on various dimensions, such as
accuracy, completeness, consistency, timeliness, and uniqueness. Additionally,
they can apply techniques and methods to enhance the quality of your data, such
as data cleansing, data transformation, data integration, and data enrichment.
Furthermore, they can track and alert the changes and issues in your data quality
over time.
Examples of data validation and testing tools include
 DataCleaner (open source),
 Talend Data Quality (commercial), and
 KNIME (open source) –
all of which can help you profile, clean, transform, monitor your data,
create/execute data mining workflows, validate/test your data/models.
HOLD-OUT & CROSS-VALIDATION
 Hold-out is when you split up your dataset into a ‘train’ and ‘test’ set. The
training set is what the model is trained on, and the test set is used to see how
well that model performs on unseen data. A common split when using the hold-
out method is using 80% of data for training and the remaining 20% of the data
for testing.
 Cross-validation or ‘k-fold cross-validation’ is when the dataset is randomly
split up into ‘k’ groups. One of the groups is used as the test set and the rest are
used as the training set. The model is trained on the training set and scored on
the test set. Then the process is repeated until each unique group as been used
as the test set.
HOLD-OUT VS. CROSS-VALIDATION
 Cross-validation is usually the preferred method because it gives your
model the opportunity to train on multiple train-test splits. This gives you
a better indication of how well your model will perform on unseen data.
Hold-out, on the other hand, is dependent on just one train-test split.
That makes the hold-out method score dependent on how the data is
split into train and test sets.
 The hold-out method is good to use when you have a very large
dataset, you’re on a time crunch, or you are starting to build an initial
model in your data science project. Keep in mind that because cross-
validation uses multiple train-test splits, it takes more computational
power and time to run than using the holdout method.
LOOCV (LEAVE-ONE-OUT CROSS-VALIDATION)
 LOOCV is a cross-validation method where the model is trained on all data
points except one, and the performance is evaluated on the excluded data point.
This process is repeated for each data point in the dataset.
 Use Cases:
• Commonly used in machine learning and statistics to assess how well a model
is expected to perform on new, unseen data.
• Useful for model selection, hyperparameter tuning, and comparing different
models.
RANDOM SUBSAMPLING
• Random subsampling is a variation of the holdout method. The holdout
method is been repeated K times.
• The holdout subsampling involves randomly splitting the data into a training
set and a test set.
• On the training set the data is been trained and the mean square error (MSE) is
been obtained from the predictions on the test set.
• As MSE is dependent on the split, this method is not recommended. So a new
split can give you a new MSE.
• The overall accuracy is been calculated as E = 1/K sum_{k}^{i=1} E_{i}
BOOTSTRAPPING
• Bootstrapping is one of the techniques which is used to make the estimations
from the data by taking an average of the estimates from smaller data samples.
• The bootstrapping method involves the iterative resampling of a dataset with
replacement.
• On resampling instead of only estimating the statistics once on complete data,
we can do it many times.
• Repeating this multiple times helps to obtain a vector of estimates.
• Bootstrapping can compute variance, expected value, and other relevant
statistics of these estimates.
Unit 3
PREDICTION TECHNIQUES
Data visualization
 Data visualization is the graphical representation of data to help people
understand the significance of data by placing it in a visual context. It involves
the creation and study of visual representations of data, such as charts, graphs,
and maps. Effective data visualization can simplify complex information,
highlight patterns and trends, and facilitate decision-making
Types of Data Visualizations
• Bar Charts and Column Charts
• Line Charts.
• Pie Charts
• Scatter Plots
• Heatmaps
• Bubble Charts
• Treemaps
Visualization Tools
• Tableau
• Microsoft Power BI
• Matplotlib and Seaborn (Python Libraries
• D3.js
• Excel
Time series
 Time series prediction involves forecasting future values based on historical time-
ordered data. Various techniques are used to model and predict time series data.
 Examples of time series analysis in action include:
• Weather data
• Rainfall measurements
• Temperature readings
• Heart rate monitoring (EKG)
• Brain monitoring (EEG)
• Quarterly sales
• Stock prices
• Automated stock trading
• Industry forecasts
• Interest rates
Time Series Prediction Techniques
ARIMA
 ARIMA, which stands for Autoregressive Integrated Moving Average, is a widely used
statistical method for time series forecasting. It is a class of models that captures
different aspects of time series data, including autoregression, differencing, and moving
averages. ARIMA models are particularly useful for predicting future values based on
past observations and identifying patterns in time-dependent data.
 Applications
1. Financial Forecasting
2. Stock Price Prediction
3. Demand Forecasting
4. Sales Forecasting
5. Economic Modeling
6. Energy Consumption Prediction
7. Call Volume Prediction in Call Centers
8. Traffic Flow Prediction
Holt-Winters time series method
The Holt-Winters method is a time series forecasting technique that extends the simple exponential
smoothing method to handle trends and seasonality. It is particularly effective for data that exhibits
both trend and seasonality over time. The method was developed by Peter Winters and is an
enhancement of the Holt method, which addresses trends, and the simple exponential smoothing
method, which handles seasonality.
1.Finance
2.Supply Chain Management
3.Sales and Marketing
4.Energy Consumption
5.Economics
6.Healthcare
7.Meteorology
8.Manufacturing
9.Retail
10.Transportation
11.Telecommunications
12.Human Resources
13.Education
14.Real Estate
15.Environmental Monitoring
Applications
Vector Autoregressive Models
 The vector autoregressive (VAR) model is a workhouse multivariate time series
model that relates current observations of a variable with past observations of itself
and past observations of other variables in the system.
For example, we could use a VAR model to show how real GDP is a function of
policy rate and how policy rate is, in turn, a function of real GDP.
 Advantages of VAR models
• A systematic but flexible approach for capturing complex real-world behavior.
• Better forecasting performance.
• Ability to capture the intertwined dynamics of time series data.
Multivariate regression analysis
Multivariate Regression Models Applications
1. Economics and Finance
2. Marketing and Business
3. Healthcare
4. Social Sciences
5. Environmental Science
6. Engineering
7. Operations Research
8. Psychology and Behavioral Sciences
9. Climate Science
10. Education
11. Biostatistics
12. Quality Control and Manufacturing
Multivariate regression analysis in
the context of time series involves
modeling a response variable as a
linear combination of multiple
predictor variables, considering the
temporal nature of the data. This is
useful when you have multiple
time-dependent variables influence
UNIT-4
CLASSIFICATION AND
CLUSTERING TECHNIQUES
CLASSIFICATION
• Classification is a supervised learning technique used to categorize data
into predefined classes or categories.
• The model learns from labeled data to predict the class or category of
unlabeled data.
• Algorithms:
1 Logistic Regression
2 Decision Trees
3 Support Vector Machines (SVM)
4 Naive Bayes
5 K-nearest Neighbors (KNN)
CLUSTERING TECHNIQUES
• Clustering is an unsupervised learning technique used to group data points
into clusters or segments based on their similarities without predefined
classes.
• Algorithms:
1.K-Means Clustering
2.Hierarchical Clustering
3.Density-Based Spatial Clustering of Applications with
Noise(DBSCAN)
4.Gaussian Mixture Models (GMM)
5.Mean Shift
DECISION TREE
Meaning:
Decision tree is a hierarchical structure resembling an inverted tree, consisting of
nodes and branches, used in decision analysis and machine learning. It's a
supervised learning method primarily employed in classification and regression
tasks.
Structure of a Decision Tree:
Root Node: Represents the topmost decision point based on the entire dataset.
Internal Nodes: Depict decisions or criteria based on specific features.
Branches: Show the possible outcomes or decisions at each node.
Leaf Nodes: Terminal nodes indicating the final outcome or class.
ADVANTAGE AND DISAGVANTAGE OF
DECISION TREE
Advantages:
1.Interpretability
2.No Assumptions about Data
3.Handles Non-linearity
4.Feature Selection
5.Handles Missing Data
Disadvantages:
1.Overfitting:
2.Instability
3.Biased to Dominant Classes
4.Complexity
5.Limited to Rectangular Partitioning
K Nearest Neighbors
• K Nearest Neighbors (KNN) is a simple, non-parametric, and intuitive algorithm
used for both classification and regression tasks in machine learning.
• In KNN, the classification or prediction of a new data point is determined by the
majority of its "k" nearest neighbors in the feature space.
Importance of parameter 'k':
• "Choosing the right 'k' is crucial."
• "A smaller 'k' can lead to overfitting; a larger 'k' might cause underfitting.
How KNN Works
Bulleted points:
• "Classification based on majority voting of 'k' nearest neighbors."
• "Distance calculation - typically Euclidean or Manhattan distance."
• "Selection of 'k' neighbors based on shortest distances."
Steps of KNN
Step-by-step breakdown of the algorithm:
• "Calculate distances from the new point to all other data points."
• "Select 'k' nearest neighbors based on these distances."
• "For classification: Majority vote to determine the new point's class."
• "For regression: Average or weighted average of 'k' neighbors' values."
KNN Application
• Applications of KNN:
• "Pattern recognition"
• "Recommendation systems"
• "Medical diagnosis"
• "Anomaly detection"
Pros and Cons
• Advantages:
• "Simple and easy to implement."
• "Works well with smaller datasets."
• Disadvantages:
• "Computationally expensive for larger datasets."
• "Sensitivity to noisy or irrelevant data."
LOGISTIC REGRESSION
MEANING:
Logistic regression is a statistical method used for analyzing a dataset in which there
are one or more independent variables that determine an outcome. It's primarily
employed for binary classification problems, predicting the probability of an event
occurring (yes/no, 0/1) based on given independent variables. Despite its name, it's a
method for classification, not regression.
ADVANTAGE AND DISADVANTAGE:
Advantages:
1.Simple and Fast
2.Interpretability
3.Flexibility
4.Works well with Linearly Separable Data
5.Less Prone to Overfitting
Disadvantages:
1.Assumption of Linearity
2.Sensitive to Outliers
3.Limited Outcome
4.Inability to Capture Complex Relationships
5.Potential Impact of Irrelevant Features
DISCRIMINANT ANALYSIS
• Discriminant Analysis is a statistical technique used for classification and
dimensionality reduction. It's particularly valuable when the goal is to predict
group membership or categories for instances based on their feature
• Types of Discriminant Analysis,
1.Linear Discriminant Analysis (LDA)
2.Quadratic Discriminant Analysis (QDA)
Linear Discriminant Analysis (LDA):
• LDA finds the linear combinations of features that characterize or
separate two or more classes. It aims to represent the differences
between the classes by projecting the data into a lower-dimensional
space while preserving as much discriminatory information as
possible.
Quadratic Discriminant Analysis (QDA):
• QDA, unlike LDA, does not assume equal covariance among classes.
It allows each class to have its own covariance matrix. This can make
QDA more flexible in capturing the differences among classes,
especially when the classes have different variances or covariances.
MARKET BASKET ANALYSIS
Market Basket Analysis (MBA) is a data mining technique used by retailers to
uncover associations between products purchased by customers. It explores the
relationships between items that are frequently bought together, revealing patterns
and affinities within transactional data. This analysis is commonly employed in
retail, e-commerce, and even in the study of website navigation patterns or content
consumption.
UNIT-5
MACHINE LEARNING AND AI
ARTIFICIAL INTELLIGENCE & MACHINE LEARNING
Artificial Intelligence encompasses the broader concept of machines
carrying out tasks in a way that we would consider "smart." It includes
everything from rule-based systems to deep learning and neural networks
Machine Learning is a subset of AI that involves training machines to learn
from data. Instead of programming explicit instructions, you provide data
to a machine learning algorithm, allowing it to learn patterns and make
decisions or predictions based on that data.
GENETIC ALGORITHM
The genetic algorithm is based on the genetic structure and behavior of the
chromosome of the population. The following things are the foundation of
genetic algorithms.
•Each chromosome indicates a possible solution. Thus the population is a
collection of chromosomes.
•A fitness function characterizes each individual in the population. Therefore,
greater fitness better is the solution.
•Out of the available individuals in the population, the best individuals are
used to reproduce the next generation offsprings.
•The offspring produced will have features of both the parents and is a result
of mutation. A mutation is a small change in the gene structure.
NEURAL NETWORK
• Neural networks are a fundamental component of modern machine
learning and artificial intelligence.
• Inspired by the structure and functioning of the human brain, neural
networks are a set of algorithms, modeled loosely after the human brain,
that are designed to recognize patterns.
• They interpret sensory data through a kind of machine perception, labeling
or clustering raw input.
FUZZY LOGIC
Fuzzy logistic regression combines fuzzy logic with logistic regression,
allowing for uncertainty in classification tasks by considering partial
membership rather than strict categorization. It extends traditional
logistic regression by accommodating vague or uncertain data, enabling
more nuanced decision-making in classification problems.
SUPPORT VECTOR MACHINES
• Support vector machine (SVM) is a supervised machine
learning algorithm.
• Svm’s purpose is to predict the classification of a query sample by
relying on labeled input data which are separated into two group
classes by using a margin.
• Specifically, the data is transformed into a higher dimension, and a
support vector classifier is used as a threshold (or hyperplane) to
separate the two classes with minimum error.
OPTIMIZATION TECHNIQUES:
• Optimization techniques are a collection of methods and algorithms used to find
the best possible solution from all feasible solutions.
• They aim to either minimize or maximize an objective function while considering
certain constraints or conditions.
• These techniques are applied across diverse fields, including mathematics,
engineering, economics, computer science, and beyond.
• They serve various purposes such as cost minimization, resource allocation,
scheduling, and maximizing efficiency.
ANT COLONY:
• Ant colony optimization is a metaheuristic technique inspired by the foraging behavior of ants.
It’s used to solve computational problems by simulating the way ants search for the best paths
to food sources.
• This method involves a population of artificial ants that move through a problem space,
leaving behind pheromone trails.
• Ants preferentially follow paths with stronger pheromone concentrations, which grow stronger
as more ants travel along them.
• Over time, the paths with higher pheromone levels become more attractive to other ants.
• As ants repeatedly explore and reinforce the most efficient paths, the system converges
towards an optimal or near-optimal solution.
PARTICLE SWARM:
• Particle swarm optimization (PSO) a computational method inspired by the
social behavior of bird flocking or fish schooling.
• In PSO, a population of candidate solutions, called particles, moves around in
the search space, adjusting their positions based on their own experience and
the experiences of their neighbors.
• Each particle adjusts its position based on its current velocity and the best
solution it has encountered.
• The algorithm uses these individual experiences and the experiences of the
entire swarm to guide the search towards the most promising areas of the
solution space.
DATA ENVELOPMENT ANALYSIS.
• DEA stands for Data Envelopment Analysis.
• It's a non-parametric method used to assess the relative efficiency of
decision-making units, such as companies, organizations, or even individuals,
by comparing their inputs and outputs.
• This technique evaluates the performance of multiple units that use multiple
inputs to produce multiple outputs.
• It helps in understanding how efficiently inputs are being used to generate
outputs.
• The core idea is to find the most efficient units, which can then serve as
benchmarks for others to improve their performance.

More Related Content

Similar to Prescriptive Analytics-1.pptx

Introductions to Business Analytics
Introductions to Business Analytics Introductions to Business Analytics
Introductions to Business Analytics Venkat .P
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learningSandeep Garg
 
Tips --Break Down the Barriers to Better Data Analytics
Tips --Break Down the Barriers to Better Data AnalyticsTips --Break Down the Barriers to Better Data Analytics
Tips --Break Down the Barriers to Better Data AnalyticsAbhishek Sood
 
Moh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptxMoh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptxAbdullahEmam4
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business IntelligenceSukirti Garg
 
Study of Data Mining Methods and its Applications
Study of  Data Mining Methods and its ApplicationsStudy of  Data Mining Methods and its Applications
Study of Data Mining Methods and its ApplicationsIRJET Journal
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & ApplicationsFazle Rabbi Ador
 
km ppt neew one
km ppt neew onekm ppt neew one
km ppt neew oneSahil Jain
 
Predictive Modelling Analytics through Data Mining
Predictive Modelling Analytics through Data MiningPredictive Modelling Analytics through Data Mining
Predictive Modelling Analytics through Data MiningIRJET Journal
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data AnalyticsUtkarsh Sharma
 

Similar to Prescriptive Analytics-1.pptx (20)

Introductions to Business Analytics
Introductions to Business Analytics Introductions to Business Analytics
Introductions to Business Analytics
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
 
Tips --Break Down the Barriers to Better Data Analytics
Tips --Break Down the Barriers to Better Data AnalyticsTips --Break Down the Barriers to Better Data Analytics
Tips --Break Down the Barriers to Better Data Analytics
 
Seminario Big Data - 27/11/2017
Seminario Big Data - 27/11/2017Seminario Big Data - 27/11/2017
Seminario Big Data - 27/11/2017
 
Seminario Big Data
Seminario Big DataSeminario Big Data
Seminario Big Data
 
Data mining
Data miningData mining
Data mining
 
Data mining
Data miningData mining
Data mining
 
Internship Presentation.pdf
Internship Presentation.pdfInternship Presentation.pdf
Internship Presentation.pdf
 
Data mining
Data miningData mining
Data mining
 
Data mining
Data miningData mining
Data mining
 
Big data overview
Big data overviewBig data overview
Big data overview
 
Moh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptxMoh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptx
 
Business Intelligence
Business IntelligenceBusiness Intelligence
Business Intelligence
 
1-210217184339.pptx
1-210217184339.pptx1-210217184339.pptx
1-210217184339.pptx
 
Study of Data Mining Methods and its Applications
Study of  Data Mining Methods and its ApplicationsStudy of  Data Mining Methods and its Applications
Study of Data Mining Methods and its Applications
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & Applications
 
km ppt neew one
km ppt neew onekm ppt neew one
km ppt neew one
 
data analysis-mining
data analysis-miningdata analysis-mining
data analysis-mining
 
Predictive Modelling Analytics through Data Mining
Predictive Modelling Analytics through Data MiningPredictive Modelling Analytics through Data Mining
Predictive Modelling Analytics through Data Mining
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
 

Recently uploaded

8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR
8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR
8447779800, Low rate Call girls in New Ashok Nagar Delhi NCRashishs7044
 
RE Capital's Visionary Leadership under Newman Leech
RE Capital's Visionary Leadership under Newman LeechRE Capital's Visionary Leadership under Newman Leech
RE Capital's Visionary Leadership under Newman LeechNewman George Leech
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...lizamodels9
 
Call Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any TimeCall Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any Timedelhimodelshub1
 
Organizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessOrganizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessSeta Wicaksana
 
8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCR8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCRashishs7044
 
NewBase 19 April 2024 Energy News issue - 1717 by Khaled Al Awadi.pdf
NewBase  19 April  2024  Energy News issue - 1717 by Khaled Al Awadi.pdfNewBase  19 April  2024  Energy News issue - 1717 by Khaled Al Awadi.pdf
NewBase 19 April 2024 Energy News issue - 1717 by Khaled Al Awadi.pdfKhaled Al Awadi
 
8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCRashishs7044
 
2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis UsageNeil Kimberley
 
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,noida100girls
 
Annual General Meeting Presentation Slides
Annual General Meeting Presentation SlidesAnnual General Meeting Presentation Slides
Annual General Meeting Presentation SlidesKeppelCorporation
 
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCRashishs7044
 
FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607dollysharma2066
 
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu MenzaYouth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menzaictsugar
 
Digital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfDigital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfJos Voskuil
 
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCRashishs7044
 
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfIntro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfpollardmorgan
 
India Consumer 2024 Redacted Sample Report
India Consumer 2024 Redacted Sample ReportIndia Consumer 2024 Redacted Sample Report
India Consumer 2024 Redacted Sample ReportMintel Group
 
(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCRsoniya singh
 
Marketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet CreationsMarketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet Creationsnakalysalcedo61
 

Recently uploaded (20)

8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR
8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR
8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR
 
RE Capital's Visionary Leadership under Newman Leech
RE Capital's Visionary Leadership under Newman LeechRE Capital's Visionary Leadership under Newman Leech
RE Capital's Visionary Leadership under Newman Leech
 
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
Call Girls In Sikandarpur Gurgaon ❤️8860477959_Russian 100% Genuine Escorts I...
 
Call Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any TimeCall Girls Miyapur 7001305949 all area service COD available Any Time
Call Girls Miyapur 7001305949 all area service COD available Any Time
 
Organizational Structure Running A Successful Business
Organizational Structure Running A Successful BusinessOrganizational Structure Running A Successful Business
Organizational Structure Running A Successful Business
 
8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCR8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCR
 
NewBase 19 April 2024 Energy News issue - 1717 by Khaled Al Awadi.pdf
NewBase  19 April  2024  Energy News issue - 1717 by Khaled Al Awadi.pdfNewBase  19 April  2024  Energy News issue - 1717 by Khaled Al Awadi.pdf
NewBase 19 April 2024 Energy News issue - 1717 by Khaled Al Awadi.pdf
 
8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR
 
2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage2024 Numerator Consumer Study of Cannabis Usage
2024 Numerator Consumer Study of Cannabis Usage
 
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
BEST Call Girls In Old Faridabad ✨ 9773824855 ✨ Escorts Service In Delhi Ncr,
 
Annual General Meeting Presentation Slides
Annual General Meeting Presentation SlidesAnnual General Meeting Presentation Slides
Annual General Meeting Presentation Slides
 
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
 
FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607
 
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu MenzaYouth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
 
Digital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdfDigital Transformation in the PLM domain - distrib.pdf
Digital Transformation in the PLM domain - distrib.pdf
 
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
 
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdfIntro to BCG's Carbon Emissions Benchmark_vF.pdf
Intro to BCG's Carbon Emissions Benchmark_vF.pdf
 
India Consumer 2024 Redacted Sample Report
India Consumer 2024 Redacted Sample ReportIndia Consumer 2024 Redacted Sample Report
India Consumer 2024 Redacted Sample Report
 
(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR
(8264348440) 🔝 Call Girls In Mahipalpur 🔝 Delhi NCR
 
Marketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet CreationsMarketing Management Business Plan_My Sweet Creations
Marketing Management Business Plan_My Sweet Creations
 

Prescriptive Analytics-1.pptx

  • 2. UNIT I Prescriptive analytics Definition Prescriptive analytics is a process that analyzes data and provides instant recommendations on how to optimize business practices to suit multiple predicted outcomes. Prescriptive analytics is the third and final tier in modern, computerized data processing. These three tiers include: •Descriptive analytics: Descriptive analytics acts as an initial catalyst to clear and and concise data analysis. It is the “what we know” (current user data, real-time data, previous engagement data, and big data). •Predictive analytics: Predictive analytics applies mathematical models to the current data to inform (predict) future behavior. It is the “what could happen." •Prescriptive analytics: Prescriptive analytics utilizes similar modeling structures structures to predict outcomes and then utilizes a combination of machine learning, business rules, artificial intelligence, and algorithms to simulate various approaches to these numerous outcomes. It then suggests the best possible actions to optimize business practices. It is the “what should happen.”
  • 3. Data Mining Definition Data mining is the process of searching and analyzing a large batch of raw data in order to identify patterns and extract useful information. Companies use data mining software to learn more about their customers. It can help them to develop more effective marketing strategies, increase sales, and decrease costs. Data mining relies on effective data collection, warehousing, and computer processing.
  • 4. Data Mining Works Data mining involves exploring and analyzing large blocks of information to glean meaningful patterns and trends. It is used in credit risk management, fraud detection, and spam filtering. It also is a market research tool that helps reveal the sentiment or opinions of a given group of people. The data mining process breaks down into four steps: •Data is collected and loaded into data warehouses on-site or on a cloud service. •Business analysts, management teams, and information technology professionals access the data and determine how they want to organize it. •Custom application software sorts and organizes the data. •The end user presents the data in an easy-to-share format, such as a graph or table.
  • 5. The Data Mining Process To be most effective, data analysts generally follow a certain flow of tasks along the data mining process. Step 1: Understand the Business Before any data is touched, extracted, cleaned, or analyzed, it is important to understand the underlying entity and the project at hand. Step 2: Understand the Data Once the business problem has been clearly defined, it's time to start thinking about data. This step also includes determining the limits of the data, storage, security, and collection and assesses how these constraints will affect the data mining process. Step 3: Prepare the Data Data is gathered, uploaded, extracted, or calculated. It is then cleaned, standardized, scrubbed for outliers, assessed for mistakes, and checked for reasonableness Step 4: Build the Model With our clean data set in hand, it's time to crunch the numbers. Data scientists use the types of data mining above to search for relationships, trends, associations, or sequential patterns
  • 6. Step 5: Evaluate the Results The data-centered aspect of data mining concludes by assessing the findings of the data model or models. Step 6: Implement Change and Monitor The data mining process concludes with management taking steps in response to the findings of the analysis. Text mining Text mining is the process of turning natural language into something that can be manipulated, stored, and analysed by machines. It’s all about giving computers, which have historically worked with numerical data, the ability to work with linguistic data – by turning it into something with a structured format. Quantitative and qualitative data To really understand text mining, we need to establish some key concepts, such as the difference between quantitative and qualitative data. Qualitative data Most of the human language we find in everyday life is qualitative data. It describes the characteristics of things – their qualities – and expresses a person’s reasoning, emotion,
  • 7. Quantitative data The opposite of qualitative data is quantitative data. Quantitative data is numerical – it tells you about quantity. Text analysis takes qualitative textual data and turns it into quantitative, numerical data. It does things like counting the number of times a theme, topic or phrase is included in a large corpus of textual data, in order to determine the importance or prevalence of a topic. Process of Text Mining •Gathering unstructured information from various sources accessible in various document organizations, for example, plain text, web pages, PDF records, etc. •Pre-processing and data cleansing tasks are performed to distinguish and eliminate inconsistency in the data. The data cleansing process makes sure to capture the genuine text, and it is performed to eliminate stop words stemming (the process of identifying the root of a certain word and indexing the data. •Processing and controlling tasks are applied to review and further clean the data set. •Pattern analysis is implemented in Management Information System. •Information processed in the above steps is utilized to extract important and applicable data for a powerful and convenient decision-making process and trend analysis.
  • 8.
  • 9. Procedures for Analyzing Text Mining •Text Summarization: To extract its partial content and reflect its whole content automatically. •Text Categorization: To assign a category to the text among categories predefined by users. •Text Clustering: To segment texts into several clusters, depending on the substantial relevance.
  • 10. Definition: Web mining is the process of using data mining techniques and algorithms to extract information directly from the Web either through Web documents and Web services, hyperlinks and server logs. The goal of Web mining is to look for patterns in Web data by collecting and analyzing information in order to gain insights into trends, the industry, and users in general. Web mining Web mining is the process of extracting valuable data from a website. This data can be used for a variety of reasons, including marketing, customer analysis, and security purposes. Types of web mining: •Content web mining: The process of extracting useful information from the contents of Web pages and Web documents, which are mostly text, images, •Web Structure Mining: Process of analyzing the structure of nodes and the connection of a website through the use of graph theory. There are two things the structure of a website in terms of how it connects to other sites and the page itself, as to how each page connects.
  • 11. •Mining of the use of the web: The process of extracting patterns and information from server logs to gain information from server logs to gain insights into user activity, where it comes from, how many users have clicked on an item on the site, and the types of activities taking place on the site.
  • 12.
  • 13.
  • 14.
  • 15. Process Mining Process mining is an emerging data science technique that involves analyzing event logs to extract information about an organization’s underlying operational processes. Here’s how it works. In process mining, we use algorithms to analyze event data and reveal details about the activities performed by people and machines. Process mining has a wide range of applications across various disciplines including finance, healthcare, manufacturing and logistics. It’s an interdisciplinary field that combines techniques from data mining, machine learning and process management to discover, monitor and optimize real-world business processes. With process mining, organizations can increase efficiency and ensure widespread compliance. PROCESS MINING DEFINITION Process mining involves taking log data from different enterprise systems and analyzing it to understand how to improve various processes. With process mining tools, teams can transform data into visualizations to locate bottlenecks and adjust workflows accordingly.
  • 16. Types of Process Mining 3 TYPES OF PROCESS MINING 1.Discovery 2.Conformance 3.Enhancement 1. DISCOVERY In some cases, an organization may not have a proper process model definition. In such cases, we would use process mining to develop a process model. Alpha-algorithm, heuristic-mining algorithm and genetic-process-mining algorithm are among the algorithms we use to extract process models from the event logs. Process discovery algorithms scan through all the events in the log to develop the process model. We can then develop graphical representations of the process models using industry-standard notations such as directly-follows graphs, petri-nets, and BPMN 2.0 (Business Process Model and Notation). 2. CONFORMANCE Organizations often have an ideal process model in place that defines how a process is supposed
  • 17. 3. ENHANCEMENT Once we develop the process model and identify the areas for improvement, process mining enhancement involves redesigning the process to optimize its efficiency and effectiveness. This process includes eliminating unnecessary steps, automating repetitive tasks and reallocating resources, all while improving team communication and collaboration. Process Mining Important Company leaders and executives may think they know the ins and outs of a business, but lack the data to back up their claims. Process mining removes any guesswork and false assumptions by translating log data into visual models and representations. With a more accurate view of daily operations, leaders can understand what’s truly working for the company and what processes need adjusting. Teams can use these insights to reallocate resources and better apply employees’ time and energy. By making these adjustments, company leaders can raise teams’ performance, improve the customer experience, cut down on unnecessary costs and boost revenue streams.
  • 18. Benefits of Process Mining PROCESS FLOW, VARIATIONS AND EXCEPTIONS The process model clearly shows the process steps and their sequences as well as process variations and exceptions (if there are any). With this information, you can identify inefficiencies, bottlenecks and opportunities for operational improvement. PROCESS PERFORMANCE METRICS AND RESOURCE UTILIZATION Process mining helps you see the process performance metrics and resource utilization more clearly. This means you can get a better idea of how organizations use resources including people, machines and materials. This information helps organizations optimize resource allocation and improve overall efficiency. COMPLIANCE AND REGULATORY REQUIREMENTS Process models demonstrate deviations from compliance and regulatory requirements. Organizations use this information to ensure that the process is aligned with legal and regulatory standards. Process Mining Use Cases  CUSTOMER SERVICE INSURANCE  SALES E-COMMERCE  LOGISTICS IT  MANUFACTURING HEALTHCARE EDUCATION  FINANCE
  • 19. Data Warehouse Meaning A data warehouse, or “enterprise data warehouse” (EDW), is a central repository system in which businesses store valuable information, such as customer and sales data, for analytics and reporting purposes. Data warehouse Benefits Data warehouses provide many benefits to businesses. Some of the most common benefits include: •Provide a stable, centralized repository for large amounts of historical data •Improve business processes and decision-making with actionable insights •Increase a business’s overall return on investment (ROI) •Improve data quality •Enhance BI performance and capabilities by drawing on multiple sources •Provide access to historical data business-wide •Use AI and machine learning to improve business analytics
  • 20.
  • 21. Meaning of Data Mart A data mart is a subset of a data warehouse focused on a particular line of business, department, or subject area. For example, many companies may have a data mart that aligns with a specific department in the business, such as finance, sales, or marketing.
  • 22. UNIT II MEANING OF DATA MINING Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. KDD Process KDD (Knowledge Discovery in Databases) is a process that involves the extraction of useful, previously unknown, and potentially valuable information from large datasets.
  • 23. KDD is an iterative process where evaluation measures can be enhanced, mining can be refined, new data can be integrated and transformed in order to get different and more appropriate results. Preprocessing of databases consists of Data cleaning and Data Integration. Advantages of KDD 1.Improves decision-making: KDD provides valuable insights and knowledge that can help organizations make better decisions. 2.Increased efficiency: KDD automates repetitive and time-consuming tasks and makes the data ready for analysis, which saves time and money. 3.Better customer service: KDD helps organizations gain a better understanding of their customers’ needs and preferences, which can help them provide better customer service. 4.Fraud detection: KDD can be used to detect fraudulent activities by identifying patterns and anomalies in the data that may indicate fraud. 5.Predictive modeling: KDD can be used to build predictive models that can forecast future trends and patterns.
  • 24. CRoss-Industry Standard Process for Data Mining [CRISP-DM] The Cross-industry standard process for data mining, known as CRISP-DM,[1] is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model. 6 Major phases of CRISP -DM
  • 25. SEMMA is an acronym that stands for Sample, Explore, Modify, Model, and Assess. It is a list of sequential steps developed by SAS Institute, one of the largest producers of statistics and business intelligence software.
  • 26.
  • 27. SEMMAAND DOMAIN SPECIFIC When applying the SEMMA framework in the healthcare domain, there are several domain-specific considerations and challenges to keep in mind. Healthcare data is often complex and sensitive, and the objectives may be unique to the field. Here's how SEMMA can be adapted for healthcare: 1.Sample: 1. Patient Data Selection: In healthcare, the selection of a representative sample often involves choosing patient data. The sample should include diverse patient demographics, medical conditions, and treatment histories. 2.Explore: 1. Clinical Knowledge Integration: Healthcare professionals and data analysts need to work closely to interpret and understand the data. Domain-specific knowledge is crucial for meaningful exploratory data analysis. Visualization techniques can help identify patterns and anomalies in patient data. 3.Modify: 1. Data Cleaning and Anomaly Detection: Given the sensitivity of healthcare data, it's important to carefully handle and de-identify patient information to maintain privacy and comply with healthcare regulations (e.g., HIPAA in the United States). 2. Feature Engineering: Healthcare data often requires specialized feature engineering to extract meaningful variables from electronic health records (EHRs) or medical images.
  • 28. 1.Model: 1. Clinical Algorithms: Healthcare modeling may involve the development of clinical prediction models, risk scores, or diagnostic algorithms. It's essential to integrate medical knowledge and guidelines into the modeling process. 2. Patient Risk Stratification: Models in healthcare may be used for patient risk stratification to identify individuals at higher risk for certain conditions or adverse events. 2.Assess: 1. Clinical Validation: The assessment of models in healthcare often includes clinical validation, which involves testing the models on real patients and assessing their clinical utility. 2. Ethical Considerations: Ethical and regulatory considerations, such as patient consent and data security, play a significant role in assessing and deploying models in healthcare. In healthcare, the SEMMA framework should be applied with a deep understanding of medical concepts, clinical practice, and healthcare regulations. It's also essential to involve healthcare professionals, such as physicians and nurses, in the process to ensure that the developed models are clinically relevant and safe for patients. Additionally, the use of healthcare-specific data standards and terminologies (e.g., SNOMED CT, ICD-10, CPT) is critical for data integration and interoperability in the healthcare domain.
  • 29. Classification is to identify the category or the class label of a new observation. First, a set of data is used as training data. Classification and Predication in Data Mining
  • 30.
  • 31.
  • 32. Root-Mean-Square Error •The Root-Mean-Square Error (RMSE) is one of the methods to determine the accuracy of our model in predicting the target values.
  • 33.
  • 34. Mean absolute deviation (MAD) It seems like you might be asking about "MAD" in the context of data mining. "MAD" can refer to multiple things, and its meaning depends on the specific context. In data mining, it's not a widely recognized acronym or term. However, I can offer a couple of possible interpretations: 1.Median Absolute Deviation (MAD): MAD is a statistic used in data analysis and is related to measures of data dispersion. It is the median of the absolute differences between data points and the median of the dataset. It is often used to identify and quantify outliers in a dataset. 2.Mean Absolute Deviation (MAD): Similar to the Median Absolute Deviation, this measure calculates the average of the absolute differences between data points and the mean of the dataset. It is another way to quantify the spread or dispersion of data. 3.MAD as an acronym or abbreviation specific to a particular data mining algorithm or tool: Data mining is a broad field, and different algorithms and software tools may use their own acronyms or abbreviations. If you are referring to a specific method, algorithm, or tool with "MAD," you would need to provide more context for a precise explanation. If you have a specific context or a particular algorithm or tool in mind, please provide more information, and I can offer a more accurate explanation.
  • 35. MAP (Mean Absolute Precision) and MAPE (Mean Absolute Percentage Error) are both metrics used to evaluate the performance of models, often in the context of information retrieval or predictive modeling, but they measure different things. 1.MAP (Mean Average Precision): MAP is typically used in the context of information retrieval or ranking systems, such as search engines. It assesses the quality of a ranked list of items by measuring the average precision across different levels of recall. It takes into account not just whether relevant items are retrieved but also their order in the ranked list. The formula for MAP is: MAP=N1​∑i=1NAPi​ Where: 1. N is the number of queries or situations. 2. AP_i is the Average Precision for the i-th query or situation. 1.MAPE (Mean Absolute Percentage Error): MAPE is typically used in the context of predictive modeling and forecasting. It measures the accuracy of predictions by calculating the percentage difference between predicted and actual values for a set of data points. It's a way to express the error as a percentage of the actual values, making it easier to interpret.
  • 36. The formula for MAPE is: MAPE=n1​∑i=1n∣∣YiYi​−Y^i∣∣×100 1. n is the number of data points. 2. Y_i is the actual value for the i-th data point. 3. �^�Y^i​ is the predicted value for the i-th data point. In summary, MAP assesses the quality of ranked lists in information retrieval, while MAPE measures the accuracy of predictions in forecasting and predictive modeling. They are different metrics used for different purposes, and their calculation methods and interpretations are distinct.
  • 37. A confusion matrix is a technique for summarizing the performance of a classification algorithm. Classification accuracy alone can be misleading if you have an unequal number of observations in each class or if you have more than two classes in your dataset. Calculating a confusion matrix can give you a better idea of what your classification model is getting right and what types of errors it is making. In this post, you will discover the confusion matrix for use in machine learning. After reading this post you will know: •What the confusion matrix is and why you need to use it. •How to calculate a confusion matrix for a 2-class classification problem from scratch. •How create a confusion matrix in Weka, Python and R.
  • 38. RECEIVER OPERATING CHARACTERISTIC (ROC) CURVE Receiver Operating Characteristic (ROC) Curve is a way to compare diagnostic tests. It is a plot of the true positive rate against the false positive rate. • The relationship between sensitivity and specificity. For example, a decrease in sensitivity results in an increase in specificity. • Test accuracy; the closer the graph is to the top and left-hand borders, the more accurate the test. Likewise, the closer the graph to the diagonal, the less accurate the test. A perfect test would go straight from zero up to the top-left corner and then straight across the horizontal. • The likelihood ratio; given by the derivative at any particular cutpoint.
  • 39.  A ROC curve showing two tests. The red test is closer to the diagonal and is therefore less accurate than the green test.
  • 40. AUC CURVE AUC, short for area under the ROC (receiver operating characteristic) curve, is a relatively straightforward metric that is useful across a range of use-cases.
  • 41. DATA VALIDATION TECHNIQUES Data validation and testing tools are software applications that can help you automate and simplify your data validation and testing process. These tools can measure and report the quality of your data based on various dimensions, such as accuracy, completeness, consistency, timeliness, and uniqueness. Additionally, they can apply techniques and methods to enhance the quality of your data, such as data cleansing, data transformation, data integration, and data enrichment. Furthermore, they can track and alert the changes and issues in your data quality over time. Examples of data validation and testing tools include  DataCleaner (open source),  Talend Data Quality (commercial), and  KNIME (open source) – all of which can help you profile, clean, transform, monitor your data, create/execute data mining workflows, validate/test your data/models.
  • 42. HOLD-OUT & CROSS-VALIDATION  Hold-out is when you split up your dataset into a ‘train’ and ‘test’ set. The training set is what the model is trained on, and the test set is used to see how well that model performs on unseen data. A common split when using the hold- out method is using 80% of data for training and the remaining 20% of the data for testing.  Cross-validation or ‘k-fold cross-validation’ is when the dataset is randomly split up into ‘k’ groups. One of the groups is used as the test set and the rest are used as the training set. The model is trained on the training set and scored on the test set. Then the process is repeated until each unique group as been used as the test set.
  • 43. HOLD-OUT VS. CROSS-VALIDATION  Cross-validation is usually the preferred method because it gives your model the opportunity to train on multiple train-test splits. This gives you a better indication of how well your model will perform on unseen data. Hold-out, on the other hand, is dependent on just one train-test split. That makes the hold-out method score dependent on how the data is split into train and test sets.  The hold-out method is good to use when you have a very large dataset, you’re on a time crunch, or you are starting to build an initial model in your data science project. Keep in mind that because cross- validation uses multiple train-test splits, it takes more computational power and time to run than using the holdout method.
  • 44. LOOCV (LEAVE-ONE-OUT CROSS-VALIDATION)  LOOCV is a cross-validation method where the model is trained on all data points except one, and the performance is evaluated on the excluded data point. This process is repeated for each data point in the dataset.  Use Cases: • Commonly used in machine learning and statistics to assess how well a model is expected to perform on new, unseen data. • Useful for model selection, hyperparameter tuning, and comparing different models.
  • 45. RANDOM SUBSAMPLING • Random subsampling is a variation of the holdout method. The holdout method is been repeated K times. • The holdout subsampling involves randomly splitting the data into a training set and a test set. • On the training set the data is been trained and the mean square error (MSE) is been obtained from the predictions on the test set. • As MSE is dependent on the split, this method is not recommended. So a new split can give you a new MSE. • The overall accuracy is been calculated as E = 1/K sum_{k}^{i=1} E_{i}
  • 46. BOOTSTRAPPING • Bootstrapping is one of the techniques which is used to make the estimations from the data by taking an average of the estimates from smaller data samples. • The bootstrapping method involves the iterative resampling of a dataset with replacement. • On resampling instead of only estimating the statistics once on complete data, we can do it many times. • Repeating this multiple times helps to obtain a vector of estimates. • Bootstrapping can compute variance, expected value, and other relevant statistics of these estimates.
  • 48. Data visualization  Data visualization is the graphical representation of data to help people understand the significance of data by placing it in a visual context. It involves the creation and study of visual representations of data, such as charts, graphs, and maps. Effective data visualization can simplify complex information, highlight patterns and trends, and facilitate decision-making
  • 49. Types of Data Visualizations • Bar Charts and Column Charts • Line Charts. • Pie Charts • Scatter Plots • Heatmaps • Bubble Charts • Treemaps
  • 50. Visualization Tools • Tableau • Microsoft Power BI • Matplotlib and Seaborn (Python Libraries • D3.js • Excel
  • 51. Time series  Time series prediction involves forecasting future values based on historical time- ordered data. Various techniques are used to model and predict time series data.  Examples of time series analysis in action include: • Weather data • Rainfall measurements • Temperature readings • Heart rate monitoring (EKG) • Brain monitoring (EEG) • Quarterly sales • Stock prices • Automated stock trading • Industry forecasts • Interest rates
  • 52. Time Series Prediction Techniques ARIMA  ARIMA, which stands for Autoregressive Integrated Moving Average, is a widely used statistical method for time series forecasting. It is a class of models that captures different aspects of time series data, including autoregression, differencing, and moving averages. ARIMA models are particularly useful for predicting future values based on past observations and identifying patterns in time-dependent data.  Applications 1. Financial Forecasting 2. Stock Price Prediction 3. Demand Forecasting 4. Sales Forecasting 5. Economic Modeling 6. Energy Consumption Prediction 7. Call Volume Prediction in Call Centers 8. Traffic Flow Prediction
  • 53. Holt-Winters time series method The Holt-Winters method is a time series forecasting technique that extends the simple exponential smoothing method to handle trends and seasonality. It is particularly effective for data that exhibits both trend and seasonality over time. The method was developed by Peter Winters and is an enhancement of the Holt method, which addresses trends, and the simple exponential smoothing method, which handles seasonality. 1.Finance 2.Supply Chain Management 3.Sales and Marketing 4.Energy Consumption 5.Economics 6.Healthcare 7.Meteorology 8.Manufacturing 9.Retail 10.Transportation 11.Telecommunications 12.Human Resources 13.Education 14.Real Estate 15.Environmental Monitoring Applications
  • 54. Vector Autoregressive Models  The vector autoregressive (VAR) model is a workhouse multivariate time series model that relates current observations of a variable with past observations of itself and past observations of other variables in the system. For example, we could use a VAR model to show how real GDP is a function of policy rate and how policy rate is, in turn, a function of real GDP.  Advantages of VAR models • A systematic but flexible approach for capturing complex real-world behavior. • Better forecasting performance. • Ability to capture the intertwined dynamics of time series data.
  • 55. Multivariate regression analysis Multivariate Regression Models Applications 1. Economics and Finance 2. Marketing and Business 3. Healthcare 4. Social Sciences 5. Environmental Science 6. Engineering 7. Operations Research 8. Psychology and Behavioral Sciences 9. Climate Science 10. Education 11. Biostatistics 12. Quality Control and Manufacturing Multivariate regression analysis in the context of time series involves modeling a response variable as a linear combination of multiple predictor variables, considering the temporal nature of the data. This is useful when you have multiple time-dependent variables influence
  • 57. CLASSIFICATION • Classification is a supervised learning technique used to categorize data into predefined classes or categories. • The model learns from labeled data to predict the class or category of unlabeled data. • Algorithms: 1 Logistic Regression 2 Decision Trees 3 Support Vector Machines (SVM) 4 Naive Bayes 5 K-nearest Neighbors (KNN)
  • 58. CLUSTERING TECHNIQUES • Clustering is an unsupervised learning technique used to group data points into clusters or segments based on their similarities without predefined classes. • Algorithms: 1.K-Means Clustering 2.Hierarchical Clustering 3.Density-Based Spatial Clustering of Applications with Noise(DBSCAN) 4.Gaussian Mixture Models (GMM) 5.Mean Shift
  • 59. DECISION TREE Meaning: Decision tree is a hierarchical structure resembling an inverted tree, consisting of nodes and branches, used in decision analysis and machine learning. It's a supervised learning method primarily employed in classification and regression tasks. Structure of a Decision Tree: Root Node: Represents the topmost decision point based on the entire dataset. Internal Nodes: Depict decisions or criteria based on specific features. Branches: Show the possible outcomes or decisions at each node. Leaf Nodes: Terminal nodes indicating the final outcome or class.
  • 60.
  • 61. ADVANTAGE AND DISAGVANTAGE OF DECISION TREE Advantages: 1.Interpretability 2.No Assumptions about Data 3.Handles Non-linearity 4.Feature Selection 5.Handles Missing Data Disadvantages: 1.Overfitting: 2.Instability 3.Biased to Dominant Classes 4.Complexity 5.Limited to Rectangular Partitioning
  • 62. K Nearest Neighbors • K Nearest Neighbors (KNN) is a simple, non-parametric, and intuitive algorithm used for both classification and regression tasks in machine learning. • In KNN, the classification or prediction of a new data point is determined by the majority of its "k" nearest neighbors in the feature space. Importance of parameter 'k': • "Choosing the right 'k' is crucial." • "A smaller 'k' can lead to overfitting; a larger 'k' might cause underfitting.
  • 63. How KNN Works Bulleted points: • "Classification based on majority voting of 'k' nearest neighbors." • "Distance calculation - typically Euclidean or Manhattan distance." • "Selection of 'k' neighbors based on shortest distances."
  • 64. Steps of KNN Step-by-step breakdown of the algorithm: • "Calculate distances from the new point to all other data points." • "Select 'k' nearest neighbors based on these distances." • "For classification: Majority vote to determine the new point's class." • "For regression: Average or weighted average of 'k' neighbors' values."
  • 65. KNN Application • Applications of KNN: • "Pattern recognition" • "Recommendation systems" • "Medical diagnosis" • "Anomaly detection"
  • 66. Pros and Cons • Advantages: • "Simple and easy to implement." • "Works well with smaller datasets." • Disadvantages: • "Computationally expensive for larger datasets." • "Sensitivity to noisy or irrelevant data."
  • 67. LOGISTIC REGRESSION MEANING: Logistic regression is a statistical method used for analyzing a dataset in which there are one or more independent variables that determine an outcome. It's primarily employed for binary classification problems, predicting the probability of an event occurring (yes/no, 0/1) based on given independent variables. Despite its name, it's a method for classification, not regression.
  • 68. ADVANTAGE AND DISADVANTAGE: Advantages: 1.Simple and Fast 2.Interpretability 3.Flexibility 4.Works well with Linearly Separable Data 5.Less Prone to Overfitting Disadvantages: 1.Assumption of Linearity 2.Sensitive to Outliers 3.Limited Outcome 4.Inability to Capture Complex Relationships 5.Potential Impact of Irrelevant Features
  • 69. DISCRIMINANT ANALYSIS • Discriminant Analysis is a statistical technique used for classification and dimensionality reduction. It's particularly valuable when the goal is to predict group membership or categories for instances based on their feature • Types of Discriminant Analysis, 1.Linear Discriminant Analysis (LDA) 2.Quadratic Discriminant Analysis (QDA)
  • 70. Linear Discriminant Analysis (LDA): • LDA finds the linear combinations of features that characterize or separate two or more classes. It aims to represent the differences between the classes by projecting the data into a lower-dimensional space while preserving as much discriminatory information as possible.
  • 71. Quadratic Discriminant Analysis (QDA): • QDA, unlike LDA, does not assume equal covariance among classes. It allows each class to have its own covariance matrix. This can make QDA more flexible in capturing the differences among classes, especially when the classes have different variances or covariances.
  • 72. MARKET BASKET ANALYSIS Market Basket Analysis (MBA) is a data mining technique used by retailers to uncover associations between products purchased by customers. It explores the relationships between items that are frequently bought together, revealing patterns and affinities within transactional data. This analysis is commonly employed in retail, e-commerce, and even in the study of website navigation patterns or content consumption.
  • 73.
  • 75. ARTIFICIAL INTELLIGENCE & MACHINE LEARNING Artificial Intelligence encompasses the broader concept of machines carrying out tasks in a way that we would consider "smart." It includes everything from rule-based systems to deep learning and neural networks Machine Learning is a subset of AI that involves training machines to learn from data. Instead of programming explicit instructions, you provide data to a machine learning algorithm, allowing it to learn patterns and make decisions or predictions based on that data.
  • 76. GENETIC ALGORITHM The genetic algorithm is based on the genetic structure and behavior of the chromosome of the population. The following things are the foundation of genetic algorithms. •Each chromosome indicates a possible solution. Thus the population is a collection of chromosomes. •A fitness function characterizes each individual in the population. Therefore, greater fitness better is the solution. •Out of the available individuals in the population, the best individuals are used to reproduce the next generation offsprings. •The offspring produced will have features of both the parents and is a result of mutation. A mutation is a small change in the gene structure.
  • 77.
  • 78. NEURAL NETWORK • Neural networks are a fundamental component of modern machine learning and artificial intelligence. • Inspired by the structure and functioning of the human brain, neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. • They interpret sensory data through a kind of machine perception, labeling or clustering raw input.
  • 79.
  • 80. FUZZY LOGIC Fuzzy logistic regression combines fuzzy logic with logistic regression, allowing for uncertainty in classification tasks by considering partial membership rather than strict categorization. It extends traditional logistic regression by accommodating vague or uncertain data, enabling more nuanced decision-making in classification problems.
  • 81.
  • 82. SUPPORT VECTOR MACHINES • Support vector machine (SVM) is a supervised machine learning algorithm. • Svm’s purpose is to predict the classification of a query sample by relying on labeled input data which are separated into two group classes by using a margin. • Specifically, the data is transformed into a higher dimension, and a support vector classifier is used as a threshold (or hyperplane) to separate the two classes with minimum error.
  • 83. OPTIMIZATION TECHNIQUES: • Optimization techniques are a collection of methods and algorithms used to find the best possible solution from all feasible solutions. • They aim to either minimize or maximize an objective function while considering certain constraints or conditions. • These techniques are applied across diverse fields, including mathematics, engineering, economics, computer science, and beyond. • They serve various purposes such as cost minimization, resource allocation, scheduling, and maximizing efficiency.
  • 84.
  • 85. ANT COLONY: • Ant colony optimization is a metaheuristic technique inspired by the foraging behavior of ants. It’s used to solve computational problems by simulating the way ants search for the best paths to food sources. • This method involves a population of artificial ants that move through a problem space, leaving behind pheromone trails. • Ants preferentially follow paths with stronger pheromone concentrations, which grow stronger as more ants travel along them. • Over time, the paths with higher pheromone levels become more attractive to other ants. • As ants repeatedly explore and reinforce the most efficient paths, the system converges towards an optimal or near-optimal solution.
  • 86. PARTICLE SWARM: • Particle swarm optimization (PSO) a computational method inspired by the social behavior of bird flocking or fish schooling. • In PSO, a population of candidate solutions, called particles, moves around in the search space, adjusting their positions based on their own experience and the experiences of their neighbors. • Each particle adjusts its position based on its current velocity and the best solution it has encountered. • The algorithm uses these individual experiences and the experiences of the entire swarm to guide the search towards the most promising areas of the solution space.
  • 87. DATA ENVELOPMENT ANALYSIS. • DEA stands for Data Envelopment Analysis. • It's a non-parametric method used to assess the relative efficiency of decision-making units, such as companies, organizations, or even individuals, by comparing their inputs and outputs. • This technique evaluates the performance of multiple units that use multiple inputs to produce multiple outputs. • It helps in understanding how efficiently inputs are being used to generate outputs. • The core idea is to find the most efficient units, which can then serve as benchmarks for others to improve their performance.