SlideShare a Scribd company logo
1 of 27
Restaurant Data Analysis
Course Name: Software Development Project – II
Course Code : ICT-3112
3/3/2024 Presented by: Sadika, Noor & Rakib 2
Team members
No Name ID
01 Sadika Khatun Jhinu IT20029
02 Gazi Md. Noor Hossain IT20030
03 Rakibul Islam IT20031
Supervisor
Md. Tanvir Rahman
Assistant Professor
Dept. of ICT
MBSTU
 Dataset
 Data Mining
 Python Programming Language
 Binary & Discrete Classification
 Euclidean Distance
 Minkowski Distance
 Regression Analysis
 Linear Regression
 Covariance
 Deviation
 Prediction Using SVM
 ROC Curve
3/3/2024 Presented by: Sadika, Noor & Rakib 3
Contents
Our aim is to
 Collect a Dataset from Kaggle
 Implement the knowledge that we learnt in Data Mining
...Course
 Implement using Python Programming Language
3/3/2024 Presented by: Sadika, Noor & Rakib 4
Project Proposal
Dataset
 We collect this restaurant dataset from Kaggle. Kaggle is a
popular online platform for data science competitions, machine
learning challenges, and data sets which is founded in 2010.
 It contains customer details, their personal ratings and their
payment system.
 It’s a numerical dataset.
 It contains 2000 data for analysis.
 The dataset file is in .csv (Comma Separated Value) format which
allows data to be saved in a tabular format.
• The attributes of this file:
3/3/2024 Presented by: Sadika, Noor & Rakib 5
1. CustomerID
2. Height
3. Weight
4. Age
5. annual_income
6. ratings
7. Price
8. Payment
Data Mining
Data mining is a process of extracting meaningful patterns, trends, and insights
from large volumes of data. It involves the use of advanced algorithms and
statistical techniques to discover hidden relationships within datasets.
Key features of Data Mining:
 Classification and Clustering: Data mining allows for the categorization
of data into distinct groups through classification. Clustering involves
grouping similar data points together without predefined categories.
 Anomaly Detection: It can identify unusual or anomalous data points.
This feature is valuable for fraud detection, outlier identification, and
quality control.
 Regression Analysis: This involves the estimation of relationships
between variables.
 Association Rule Mining: It identifies relationships between different
items in a datasets.
 Predictive Modeling: Data mining enables the creation of predictive
models that can forecast future trends or outcomes based on historical data.
3/3/2024 Presented by: Sadika, Noor & Rakib 6
Python Programming Language
 Python is a high-level, versatile, and dynamically-typed programming language
known for its simplicity, readability, and extensive standard library.
 Python programming language is being used in web development, Machine Learning
applications, along with all cutting-edge technology in Software Industry.
 Python's simplicity, readability, extensive libraries, and versatility have made it a
favored language across a wide range of industries and applications, from web
development to scientific research and artificial intelligence.
Applications of Python
Python can be used on a server to create web applications
Python can be used alongside software to create workflows.
Python can connect to database systems. It can also read and modify files.
Python can connect to database systems and can also read and modify files.
Python can be used to handle big data and perform complex mathematics.
3/3/2024 Presented by: Sadika, Noor & Rakib 7
Classification
Classification is a process of categorizing data or objects into predefined classes
or categories based on their features or attributes. In machine learning,
classification is a type of supervised learning technique where an algorithm is
trained on a labeled dataset to predict the class or category of new, unseen data.
Classification is of two types:
1. Binary Classification: In binary classification, the goal is to classify the input into one of two
classes or categories.
2. Multiclass Classification: In multi-class classification, the goal is to classify the input into one
of several classes or categories.
3/3/2024 Presented by: Sadika, Noor & Rakib 8
Binarization
• A simple technique to binarize a categorical attribute is the following: If
there are m categorical values, then uniquely assign each original value
to an integer in the interval (0, m-1)
• Here, if we split (Weight) from data set by applying some condition then
the code is:
condition1 = data['weight'] < 30
condition2 = (data['weight']>=30)&(data['weight’]<=60)
condition3 = data['weight'] > 60
data['Below_30'] = condition1.astype(int)
data['Between_30_and_60'] = condition2.astype(int)
data['Above_60'] = condition3.astype(int)
print(data)
3/3/2024 Presented by: Sadika, Noor & Rakib 9
Binarization
3/3/2024 Presented by: Sadika, Noor & Rakib 10
Discretization
Discretization is typically applied to attributes that are used in classification or
association analysis. Transformation of a continuous attribute to a categorical attribute
involves two subtasks: deciding how many categories, n, to have and determining how
to map the values of the continuous attribute to these categories.
Here for, threshold = 3 We can split our (Weight) dataset into 3 specific categories.
num_bins = 3
bin_labels = ['Less', 'Medium', 'More']
data['New Weight'] = pd.cut(data['weight'],
bins=num_bins, labels=bin_labels)
print(data)
3/3/2024 Presented by: Sadika, Noor & Rakib 11
Discretization
3/3/2024 Presented by: Sadika, Noor & Rakib 12
Euclidean Distance
The Euclidean distance is a measure of the straight-line distance between two
points in Euclidean space. It is the most commonly used distance metric in
geometry and machine learning.
Properties:
1. It is always non-negative (d≥0).
2. It is symmetric, meaning the distance from point A to point B is the same as from point B
to point A.
3. It satisfies the triangle inequality, which means the shortest distance between two points
is a straight line.
Euclidean distance, d = 𝑖=1
𝑛
(𝑥𝑖 − 𝑦𝑖 )2
point1 = data['weight']
point2 = data['height']
distance = np.linalg.norm(point1 - point2)
Euclidean distance: 2698.051
3/3/2024 Presented by: Sadika, Noor & Rakib 13
Minkowski Distance
The Minkowski distance is a metric used to measure the distance between two points in
a multidimensional space. It is a generalization of other distance metrics like Euclidean
distance and Manhattan distance.
Minkowski Distance, d = 𝑖=1
𝑛
|𝑥𝑖 − 𝑦𝑖 |𝑝
1
𝑝
Some properties of the Minkowski distance:
1. When p=1, it is called the Manhattan distance or L1 norm.
2. When p=2, it is called the Euclidean distance or L2 norm.
3. If p approaches infinity, the Minkowski distance approaches the Chebyshev
distance
point1 = data['weight']
point2 = data['height’]
p = 2
Distance = np.power(np.sum(np.abs(point1 - point2) ** p), 1/p)
Minkowski distance (p=2): 2698.0517
3/3/2024 Presented by: Sadika, Noor & Rakib 14
Regression Analysis
Regression analysis is a statistical method that shows the relationship between
two or more variables.
 Usually expressed in a graph, the method tests the relationship between a
dependent variable against independent variables.
 Typically, the independent variable(s) changes with the dependent variable(s)
and the regression analysis attempts to answer which factors matter most to
that change.
 Generally, regression analysis is used to:
 Try and explain a phenomenon
 Predict future events
 Optimize manufacturing and delivery processes
 Resolve errors
 Provide new insights
3/3/2024 Presented by: Sadika, Noor & Rakib 15
Linear Regression
• Linear regression is a type of supervised machine learning algorithm that
computes the linear relationship between a dependent variable and one or
more independent features.
• The equation for Linear Regression, y = ax + b
Here, x is independent variable
y is dependent variable
a = intercept point of regression line
b = slop of regression line
Again,
b =
(𝑥𝑦) −
𝑥. 𝑦
𝑛
𝑥2 −
( 𝑥)
2
𝑛
and, a = 𝑦 − 𝑏. 𝑥
3/3/2024 Presented by: Sadika, Noor & Rakib 16
Linear Regression
• model = LinearRegression()
• model.fit(X, Y)
• slope = model.coef_[0]
• intercept = model.intercept_
Slope (Coefficient): 2.889
Intercept: -68.252
3/3/2024 Presented by: Sadika, Noor & Rakib 17
Covariance
Covariance is a measure of the relationship between two random variables
and to what extent, they change together. It defines the changes between
the two variables, such that change in one variable is equal to change in
another variable.
X = data['weight']
Y = data['height’]
mean_X = np.mean(X)
mean_Y = np.mean(Y)
covariance = np.sum((X - mean_X) * (Y - mean_Y)) / (len(X) - 1)
Covariance of Height and Weight: 11.17
3/3/2024 Presented by: Sadika, Noor & Rakib 18
Sample covariance Formula:
Cov(x,y) =
Standard Deviation
Standard deviation is a statistical measure that quantifies the amount
of variation or dispersion in a set of data points. It provides a way to
understand how spread out the values in a dataset are around the
mean.
Standard Deviation, σ = 𝑖=1
𝑛 𝑥𝑖−𝑥 2
2
X = data['height’]
Y = data[‘Weight’]
mean_X = np.mean(X)
std_dev_X = np.sqrt(np.mean((X - mean_X)**2))
Standard Deviation of Height: 1.97
Standard Deviation of Weight: 11.50
3/3/2024 Presented by: Sadika, Noor & Rakib 19
Prediction Algorithm
Prediction refers to the process of estimating or forecasting future events,
outcomes, or values based on existing data and patterns.
Key points:
 Methodology: Predictions are made using various techniques and models. These
may include statistical methods, machine learning algorithms, regression analysis,
time series analysis, and more.
 Training Data: To make accurate predictions, models are typically trained on
historical or existing data so that we can make a relationship or pattern with new or
unseen data.
 Accuracy and Performance: The accuracy of predictions is a critical metric.
Models are evaluated based on how well they can generalize to new data.
 Applications: Prediction is widely used across various domains. For instance, in
finance, predictions are made about stock prices; in healthcare, predictions are made
about disease progression; in weather forecasting, predictions are made about future
weather conditions.
3/3/2024 Presented by: Sadika, Noor & Rakib 20
Support Vector Machine (SVM)
 Support Vector Machine (SVM) is a powerful machine learning algorithm
used for linear or nonlinear classification, regression, and even outlier
detection tasks.
 SVMs can be used for a variety of tasks, such as text classification, image
classification, spam detection, handwriting identification, gene expression
analysis, face detection, and anomaly detection.
3/3/2024 Presented by: Sadika, Noor & Rakib 21
Support Vector Machine (SVM)
 Drop function: In Python, the drop function is a built-in function in the
standard library. It is used for removing columns.
3/3/2024 Presented by: Sadika, Noor & Rakib 22
CustomerID height weight age annual_income rate price payment
1 65 112 19 15000 3.4 1325 cash
2 71 136 21 35000 3.9 1600 cash
3 69 153 20 86000 3.7 1850 VISA
4 68 142 23 59000 2.7 2075 VISA
5 67 144 31 38000 2.8 1600 VISA
6 68 123 22 58000 3.4 2075 VISA
7 69 141 35 31000 4.1 1650 VISA
8 70 136 23 84000 2.8 2075 VISA
9 67 112 64 97000 3.2 1650 cash
Support Vector Machine (SVM)
 Level Encoder: Label Encoding is a technique that is used to convert
categorical columns into numerical ones so that they can be fitted by machine
learning models which only take numerical data. It is an important pre-
processing step in a machine-learning project.
 fillna: fillna is a method used in Python for filling missing values in a pandas
DataFrame or Series. It's a common operation when working with data, as
missing values can cause issues when performing calculations or visualizing
data.
 mean: mean refers to the average of a set of numbers.
mean = sum(numbers) / len(numbers)
mean = np.mean(numbers)
3/3/2024 Presented by: Sadika, Noor & Rakib 23
Support Vector Machine (SVM)
Test Data & Training data:
In machine learning and statistical modeling, datasets are typically divided into
two main subsets: training data and test data. These subsets serve distinct
purposes in developing and evaluating predictive models:
• Training Data: The training data is used to train or build the predictive model.
Moreover it is used for teaching the model how to make predictions or
classifications.
• Test Data: The test data is used to evaluate the model's performance and
assess how well it generalizes to new, unseen data.
Here, 20% of Data is used for Test purpose.
And also used random state = 57
3/3/2024 Presented by: Sadika, Noor & Rakib 24
Support Vector Machine (SVM)
 Accuracy: This is the ratio of correctly predicted instances (both true positives and
true negatives) to the total instances in the dataset.
Accuracy : 0.615
 Precision: Also known as Positive Predictive Value, it is the ratio of true positives to
the sum of true positives and false positives. It measures the accuracy of the positive
predictions.
Precision : 1.0
 Recall: Also known as Sensitivity, Hit Rate, or True Positive Rate, it is the ratio of
true positives to the sum of true positives and false negatives. It measures the
sensitivity to detect the positive class.
Recall : 0.615
F1-measure: The harmonic mean of precision and recall. It provides a balance
between precision and recall and is particularly useful when dealing with imbalanced
datasets.
F1-measure : 0.761
3/3/2024 Presented by: Sadika, Noor & Rakib 25
ROC Curve
The Receiver Operating Characteristic (ROC) curve is a graphical representation
that illustrates the diagnostic ability of a binary classification model. It plots the
True Positive Rate against the False Positive Rate for different classification
thresholds.
3/3/2024 Presented by: Sadika, Noor & Rakib 26
3/3/2024 Presented by: Sadika, Noor & Rakib 27

More Related Content

Similar to Data Mining Theory and Python Project.pptx

IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET Journal
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET Journal
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsIJERA Editor
 
Comparative study of classification algorithm for text based categorization
Comparative study of classification algorithm for text based categorizationComparative study of classification algorithm for text based categorization
Comparative study of classification algorithm for text based categorizationeSAT Journals
 
Classification Techniques: A Review
Classification Techniques: A ReviewClassification Techniques: A Review
Classification Techniques: A ReviewIOSRjournaljce
 
Novel algorithms for Knowledge discovery from neural networks in Classificat...
Novel algorithms for  Knowledge discovery from neural networks in Classificat...Novel algorithms for  Knowledge discovery from neural networks in Classificat...
Novel algorithms for Knowledge discovery from neural networks in Classificat...Dr.(Mrs).Gethsiyal Augasta
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET Journal
 
Discretization methods for Bayesian networks in the case of the earthquake
Discretization methods for Bayesian networks in the case of the earthquakeDiscretization methods for Bayesian networks in the case of the earthquake
Discretization methods for Bayesian networks in the case of the earthquakejournalBEEI
 
Expandable bayesian
Expandable bayesianExpandable bayesian
Expandable bayesianAhmad Amri
 
Classifiers
ClassifiersClassifiers
ClassifiersAyurdata
 
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGESCASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGESIRJET Journal
 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdfMariaKhan905189
 
AlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxAlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxPerumalPitchandi
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET Journal
 

Similar to Data Mining Theory and Python Project.pptx (20)

fINAL ML PPT.pptx
fINAL ML PPT.pptxfINAL ML PPT.pptx
fINAL ML PPT.pptx
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining Algorithms
 
Comparative study of classification algorithm for text based categorization
Comparative study of classification algorithm for text based categorizationComparative study of classification algorithm for text based categorization
Comparative study of classification algorithm for text based categorization
 
Classification Techniques: A Review
Classification Techniques: A ReviewClassification Techniques: A Review
Classification Techniques: A Review
 
Novel algorithms for Knowledge discovery from neural networks in Classificat...
Novel algorithms for  Knowledge discovery from neural networks in Classificat...Novel algorithms for  Knowledge discovery from neural networks in Classificat...
Novel algorithms for Knowledge discovery from neural networks in Classificat...
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
 
K means report
K means reportK means report
K means report
 
Discretization methods for Bayesian networks in the case of the earthquake
Discretization methods for Bayesian networks in the case of the earthquakeDiscretization methods for Bayesian networks in the case of the earthquake
Discretization methods for Bayesian networks in the case of the earthquake
 
Expandable bayesian
Expandable bayesianExpandable bayesian
Expandable bayesian
 
Visualization of Crisp and Rough Clustering using MATLAB
Visualization of Crisp and Rough Clustering using MATLABVisualization of Crisp and Rough Clustering using MATLAB
Visualization of Crisp and Rough Clustering using MATLAB
 
similarities-knn.pptx
similarities-knn.pptxsimilarities-knn.pptx
similarities-knn.pptx
 
Classifiers
ClassifiersClassifiers
Classifiers
 
61_Empirical
61_Empirical61_Empirical
61_Empirical
 
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGESCASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf
 
AlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxAlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptx
 
IRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms ComparisonIRJET- Supervised Learning Classification Algorithms Comparison
IRJET- Supervised Learning Classification Algorithms Comparison
 

Recently uploaded

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 

Recently uploaded (20)

VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 

Data Mining Theory and Python Project.pptx

  • 2. Course Name: Software Development Project – II Course Code : ICT-3112 3/3/2024 Presented by: Sadika, Noor & Rakib 2 Team members No Name ID 01 Sadika Khatun Jhinu IT20029 02 Gazi Md. Noor Hossain IT20030 03 Rakibul Islam IT20031 Supervisor Md. Tanvir Rahman Assistant Professor Dept. of ICT MBSTU
  • 3.  Dataset  Data Mining  Python Programming Language  Binary & Discrete Classification  Euclidean Distance  Minkowski Distance  Regression Analysis  Linear Regression  Covariance  Deviation  Prediction Using SVM  ROC Curve 3/3/2024 Presented by: Sadika, Noor & Rakib 3 Contents
  • 4. Our aim is to  Collect a Dataset from Kaggle  Implement the knowledge that we learnt in Data Mining ...Course  Implement using Python Programming Language 3/3/2024 Presented by: Sadika, Noor & Rakib 4 Project Proposal
  • 5. Dataset  We collect this restaurant dataset from Kaggle. Kaggle is a popular online platform for data science competitions, machine learning challenges, and data sets which is founded in 2010.  It contains customer details, their personal ratings and their payment system.  It’s a numerical dataset.  It contains 2000 data for analysis.  The dataset file is in .csv (Comma Separated Value) format which allows data to be saved in a tabular format. • The attributes of this file: 3/3/2024 Presented by: Sadika, Noor & Rakib 5 1. CustomerID 2. Height 3. Weight 4. Age 5. annual_income 6. ratings 7. Price 8. Payment
  • 6. Data Mining Data mining is a process of extracting meaningful patterns, trends, and insights from large volumes of data. It involves the use of advanced algorithms and statistical techniques to discover hidden relationships within datasets. Key features of Data Mining:  Classification and Clustering: Data mining allows for the categorization of data into distinct groups through classification. Clustering involves grouping similar data points together without predefined categories.  Anomaly Detection: It can identify unusual or anomalous data points. This feature is valuable for fraud detection, outlier identification, and quality control.  Regression Analysis: This involves the estimation of relationships between variables.  Association Rule Mining: It identifies relationships between different items in a datasets.  Predictive Modeling: Data mining enables the creation of predictive models that can forecast future trends or outcomes based on historical data. 3/3/2024 Presented by: Sadika, Noor & Rakib 6
  • 7. Python Programming Language  Python is a high-level, versatile, and dynamically-typed programming language known for its simplicity, readability, and extensive standard library.  Python programming language is being used in web development, Machine Learning applications, along with all cutting-edge technology in Software Industry.  Python's simplicity, readability, extensive libraries, and versatility have made it a favored language across a wide range of industries and applications, from web development to scientific research and artificial intelligence. Applications of Python Python can be used on a server to create web applications Python can be used alongside software to create workflows. Python can connect to database systems. It can also read and modify files. Python can connect to database systems and can also read and modify files. Python can be used to handle big data and perform complex mathematics. 3/3/2024 Presented by: Sadika, Noor & Rakib 7
  • 8. Classification Classification is a process of categorizing data or objects into predefined classes or categories based on their features or attributes. In machine learning, classification is a type of supervised learning technique where an algorithm is trained on a labeled dataset to predict the class or category of new, unseen data. Classification is of two types: 1. Binary Classification: In binary classification, the goal is to classify the input into one of two classes or categories. 2. Multiclass Classification: In multi-class classification, the goal is to classify the input into one of several classes or categories. 3/3/2024 Presented by: Sadika, Noor & Rakib 8
  • 9. Binarization • A simple technique to binarize a categorical attribute is the following: If there are m categorical values, then uniquely assign each original value to an integer in the interval (0, m-1) • Here, if we split (Weight) from data set by applying some condition then the code is: condition1 = data['weight'] < 30 condition2 = (data['weight']>=30)&(data['weight’]<=60) condition3 = data['weight'] > 60 data['Below_30'] = condition1.astype(int) data['Between_30_and_60'] = condition2.astype(int) data['Above_60'] = condition3.astype(int) print(data) 3/3/2024 Presented by: Sadika, Noor & Rakib 9
  • 10. Binarization 3/3/2024 Presented by: Sadika, Noor & Rakib 10
  • 11. Discretization Discretization is typically applied to attributes that are used in classification or association analysis. Transformation of a continuous attribute to a categorical attribute involves two subtasks: deciding how many categories, n, to have and determining how to map the values of the continuous attribute to these categories. Here for, threshold = 3 We can split our (Weight) dataset into 3 specific categories. num_bins = 3 bin_labels = ['Less', 'Medium', 'More'] data['New Weight'] = pd.cut(data['weight'], bins=num_bins, labels=bin_labels) print(data) 3/3/2024 Presented by: Sadika, Noor & Rakib 11
  • 12. Discretization 3/3/2024 Presented by: Sadika, Noor & Rakib 12
  • 13. Euclidean Distance The Euclidean distance is a measure of the straight-line distance between two points in Euclidean space. It is the most commonly used distance metric in geometry and machine learning. Properties: 1. It is always non-negative (d≥0). 2. It is symmetric, meaning the distance from point A to point B is the same as from point B to point A. 3. It satisfies the triangle inequality, which means the shortest distance between two points is a straight line. Euclidean distance, d = 𝑖=1 𝑛 (𝑥𝑖 − 𝑦𝑖 )2 point1 = data['weight'] point2 = data['height'] distance = np.linalg.norm(point1 - point2) Euclidean distance: 2698.051 3/3/2024 Presented by: Sadika, Noor & Rakib 13
  • 14. Minkowski Distance The Minkowski distance is a metric used to measure the distance between two points in a multidimensional space. It is a generalization of other distance metrics like Euclidean distance and Manhattan distance. Minkowski Distance, d = 𝑖=1 𝑛 |𝑥𝑖 − 𝑦𝑖 |𝑝 1 𝑝 Some properties of the Minkowski distance: 1. When p=1, it is called the Manhattan distance or L1 norm. 2. When p=2, it is called the Euclidean distance or L2 norm. 3. If p approaches infinity, the Minkowski distance approaches the Chebyshev distance point1 = data['weight'] point2 = data['height’] p = 2 Distance = np.power(np.sum(np.abs(point1 - point2) ** p), 1/p) Minkowski distance (p=2): 2698.0517 3/3/2024 Presented by: Sadika, Noor & Rakib 14
  • 15. Regression Analysis Regression analysis is a statistical method that shows the relationship between two or more variables.  Usually expressed in a graph, the method tests the relationship between a dependent variable against independent variables.  Typically, the independent variable(s) changes with the dependent variable(s) and the regression analysis attempts to answer which factors matter most to that change.  Generally, regression analysis is used to:  Try and explain a phenomenon  Predict future events  Optimize manufacturing and delivery processes  Resolve errors  Provide new insights 3/3/2024 Presented by: Sadika, Noor & Rakib 15
  • 16. Linear Regression • Linear regression is a type of supervised machine learning algorithm that computes the linear relationship between a dependent variable and one or more independent features. • The equation for Linear Regression, y = ax + b Here, x is independent variable y is dependent variable a = intercept point of regression line b = slop of regression line Again, b = (𝑥𝑦) − 𝑥. 𝑦 𝑛 𝑥2 − ( 𝑥) 2 𝑛 and, a = 𝑦 − 𝑏. 𝑥 3/3/2024 Presented by: Sadika, Noor & Rakib 16
  • 17. Linear Regression • model = LinearRegression() • model.fit(X, Y) • slope = model.coef_[0] • intercept = model.intercept_ Slope (Coefficient): 2.889 Intercept: -68.252 3/3/2024 Presented by: Sadika, Noor & Rakib 17
  • 18. Covariance Covariance is a measure of the relationship between two random variables and to what extent, they change together. It defines the changes between the two variables, such that change in one variable is equal to change in another variable. X = data['weight'] Y = data['height’] mean_X = np.mean(X) mean_Y = np.mean(Y) covariance = np.sum((X - mean_X) * (Y - mean_Y)) / (len(X) - 1) Covariance of Height and Weight: 11.17 3/3/2024 Presented by: Sadika, Noor & Rakib 18 Sample covariance Formula: Cov(x,y) =
  • 19. Standard Deviation Standard deviation is a statistical measure that quantifies the amount of variation or dispersion in a set of data points. It provides a way to understand how spread out the values in a dataset are around the mean. Standard Deviation, σ = 𝑖=1 𝑛 𝑥𝑖−𝑥 2 2 X = data['height’] Y = data[‘Weight’] mean_X = np.mean(X) std_dev_X = np.sqrt(np.mean((X - mean_X)**2)) Standard Deviation of Height: 1.97 Standard Deviation of Weight: 11.50 3/3/2024 Presented by: Sadika, Noor & Rakib 19
  • 20. Prediction Algorithm Prediction refers to the process of estimating or forecasting future events, outcomes, or values based on existing data and patterns. Key points:  Methodology: Predictions are made using various techniques and models. These may include statistical methods, machine learning algorithms, regression analysis, time series analysis, and more.  Training Data: To make accurate predictions, models are typically trained on historical or existing data so that we can make a relationship or pattern with new or unseen data.  Accuracy and Performance: The accuracy of predictions is a critical metric. Models are evaluated based on how well they can generalize to new data.  Applications: Prediction is widely used across various domains. For instance, in finance, predictions are made about stock prices; in healthcare, predictions are made about disease progression; in weather forecasting, predictions are made about future weather conditions. 3/3/2024 Presented by: Sadika, Noor & Rakib 20
  • 21. Support Vector Machine (SVM)  Support Vector Machine (SVM) is a powerful machine learning algorithm used for linear or nonlinear classification, regression, and even outlier detection tasks.  SVMs can be used for a variety of tasks, such as text classification, image classification, spam detection, handwriting identification, gene expression analysis, face detection, and anomaly detection. 3/3/2024 Presented by: Sadika, Noor & Rakib 21
  • 22. Support Vector Machine (SVM)  Drop function: In Python, the drop function is a built-in function in the standard library. It is used for removing columns. 3/3/2024 Presented by: Sadika, Noor & Rakib 22 CustomerID height weight age annual_income rate price payment 1 65 112 19 15000 3.4 1325 cash 2 71 136 21 35000 3.9 1600 cash 3 69 153 20 86000 3.7 1850 VISA 4 68 142 23 59000 2.7 2075 VISA 5 67 144 31 38000 2.8 1600 VISA 6 68 123 22 58000 3.4 2075 VISA 7 69 141 35 31000 4.1 1650 VISA 8 70 136 23 84000 2.8 2075 VISA 9 67 112 64 97000 3.2 1650 cash
  • 23. Support Vector Machine (SVM)  Level Encoder: Label Encoding is a technique that is used to convert categorical columns into numerical ones so that they can be fitted by machine learning models which only take numerical data. It is an important pre- processing step in a machine-learning project.  fillna: fillna is a method used in Python for filling missing values in a pandas DataFrame or Series. It's a common operation when working with data, as missing values can cause issues when performing calculations or visualizing data.  mean: mean refers to the average of a set of numbers. mean = sum(numbers) / len(numbers) mean = np.mean(numbers) 3/3/2024 Presented by: Sadika, Noor & Rakib 23
  • 24. Support Vector Machine (SVM) Test Data & Training data: In machine learning and statistical modeling, datasets are typically divided into two main subsets: training data and test data. These subsets serve distinct purposes in developing and evaluating predictive models: • Training Data: The training data is used to train or build the predictive model. Moreover it is used for teaching the model how to make predictions or classifications. • Test Data: The test data is used to evaluate the model's performance and assess how well it generalizes to new, unseen data. Here, 20% of Data is used for Test purpose. And also used random state = 57 3/3/2024 Presented by: Sadika, Noor & Rakib 24
  • 25. Support Vector Machine (SVM)  Accuracy: This is the ratio of correctly predicted instances (both true positives and true negatives) to the total instances in the dataset. Accuracy : 0.615  Precision: Also known as Positive Predictive Value, it is the ratio of true positives to the sum of true positives and false positives. It measures the accuracy of the positive predictions. Precision : 1.0  Recall: Also known as Sensitivity, Hit Rate, or True Positive Rate, it is the ratio of true positives to the sum of true positives and false negatives. It measures the sensitivity to detect the positive class. Recall : 0.615 F1-measure: The harmonic mean of precision and recall. It provides a balance between precision and recall and is particularly useful when dealing with imbalanced datasets. F1-measure : 0.761 3/3/2024 Presented by: Sadika, Noor & Rakib 25
  • 26. ROC Curve The Receiver Operating Characteristic (ROC) curve is a graphical representation that illustrates the diagnostic ability of a binary classification model. It plots the True Positive Rate against the False Positive Rate for different classification thresholds. 3/3/2024 Presented by: Sadika, Noor & Rakib 26
  • 27. 3/3/2024 Presented by: Sadika, Noor & Rakib 27