SlideShare a Scribd company logo
1 of 21
Data Mining and Neural Networks
Computational Task 1
Task 1
a. What is the problem authors aimed to solve?
Authors aimed to distinguish malignant from benign breast
cancer, using nuclear size, shape, and
texture as features.
b. Which methods did they use?
The authors used Inductive machine learning and logistic
regression to correctly label malignant or
benign.
c. How did they test the accuracy of classification?
The authors used Cross-validation to test the accuracy of the
predicted results. The accuracy of
logistic regression was 96.2% whereas the accuracy of inductive
machine learning was 97.5%.
Task 2
For task 2, the data table from ics.uci.edu was downloaded as
wdbc.data file. Here there are in total 32
columns with 1 ID column, 1 Diagnosis column and 30 attribute
columns. Here the 30 are divided into 3
groups of mean, standard error, and worst radii. There are 212
malignant cases (M) and 357 benign cases
(B) as shown in the Figure 1.
Figure 1. Number of features and count of each target class
The following are the mean, variance and standard deviation of
all attributes starting from column 3-32
shown in the Figure 2. These are calculated before normalizing
the attributes to unit variance.
Figure 2. Mean, Variance and Standard Deviation of each
attribute (0-29)
The following are the mean, variance, and standard deviation of
all attributes for Malignant class (M) in the
Figure 3.
Figure 3. Mean, Variance and Standard Deviation of each
attribute for Malignant class (M)
The following are the mean, variance, and standard deviation of
all attributes for Benign class (B) in the
Figure 4.
Figure 4. Mean, Variance and Standard Deviation of each
attribute for Benign class (B)
The attributes are not normalized as we can tell based on the
mean, variance, and standard deviations. To
normalize we will subtract the mean of each attribute from each
value of the attribute to get zero mean
and we divide it with the standard deviation to get unit variance
as shown in the Figure 5.
Figure 5. Mean and standard deviation after normalization
Task 3
To create predictors by one attribute, we plotted histograms for
each attribute and each class. Following
are some of the histograms shown in Figure 6.
Figure 6. Histogram plots of first 4 columns
To calcuate the optimal threshold for each single attribute
classifier, we have set the threshold from 0-20
(bins) and calcuated the accuracy and specificity. Here we
chose the threshold that maximizes the
accuracy. The following are the thresholds of each single
attribute classifier shown in the Figure 7.
Figure 7. Optimal Thresholds of all single attribute classifiers
sorted by accuracy
From Figure 7, we can determine that attribute ‘20’ gives the
best accuracy with least classification errors.
The following are some of the classification rules:
Attribute Accuracy Error Threshold Classification Rule
20 89.99% 10.03% 16 If x <= 16 then Class B else Class M
0 89.39% 10.60% 15 If x <= 15 then Class B else Class M
12 80.63% 19.36% 3 If x > 3 then Class M else Class B
Table 1. Classification rules of the top 3 single attribute
classifiers
Task 4
To test 1NN and 3NN classification rules, we normalized the
values to zero mean and unit variance. We
also divided the dataset into 60% training data and 40% test
data to test the classification accuracy and
error. The following Figure 8 shows the accuracy of both 1NN
and 3NN classifiers.
Figure 8. Accuracy of 1NN and 3NN classifiers
The Figure 9 shows the classification errors of both 1NN and
3NN classifiers
Figure 9. Classification errors of 1NN and 3NN classifiers
Based on this, 3NN has more accuracy than compared to 1NN
classifier, hence 3NN classifier is better in
classifying the malignant vs benign cancers.
Class 1
Class 2
Task 5
Fischer’s linear discriminant is used to obtain a hyperplane
which optimizes the signal-to-noise ratio or the
hyperplane that maximizes the distance between means of
projected instances and minimized the
variance among the projected instances of each class. That is, it
tries to find the hyperplane that reduces
the distance between two groups of projects instances and in
which the groups are closely packed with
one another.
Figure 10. Shows a hyperplane that divides all projections
clearly
Here the projections of all data points on the hyperplane are
well separated and the projections are also
closely packed. This allows us to take a normal to the
hyperplane and classify.
Fisher’s Linear Discriminant hence finds the hyperplane by
maximizing the following ratio:
Here (w⃗) is normal to the hyperplane.
Task 6
Applied Fisher’s linear discriminant to the Breast Cancer
Wisconsin (Diagnostic) data set using sklearn’s
LinearDiscriminantAnalysis classifier. Figure 11 shows the
accuracy of Fisher’s classifier:
Figure 11. Accuracy of Fisher’s Linear Discriminant
Figure 12 shows the confusion matrix and classification errors:
Figure 12. Confusion matrix and classification errors of
Fischer’s Classifier
Compared the 1NN, this method provided more accuracy but on
par with the accuracy of 3NN methods.
Appendix
1. # Import statements
import pandas as pd
import numpy as np
from sklearn import preprocessing
from matplotlib import pyplot
2. # Data import
headers = ['ID', 'Diagnosis']
headers.extend([str(i) for i in range(30)])
data = pd.read_csv('wdbc.data', sep=",", header=None, names=
headers)
data
3. # Stats
attributes = data.shape[1] - 2 # remove id and class count
benign, malignant = 0, 0
for index, row in data.iterrows():
if row[1] == 'M':
malignant += 1
elif row[1] == 'B':
benign += 1
else:
print(row[1])
print("There are {} attributes".format(attributes))
print("There are {} malignant cases (M) and {} benign cases (B)
".format(malignant
, benign))
4. # mean variance and standard deviation: All classes
all_means = []
all_std_deviations = []
all_variations = []
for column in data.columns[2:]:
all_means.append(data[column].mean())
all_std_deviations.append(data[column].std())
all_variations.append(data[column].var())
pd.DataFrame({'Mean': all_means, 'Variance': all_variations, 'St
andard Deviation':
all_std_deviations})
5. # mean variance and standard deviation: Class: malignant
malignant_means = []
malignant_std_deviations = []
malignant_variations = []
for column in data.columns[2:]:
condition = data['Diagnosis'] == 'M'
# print(column)
# print(data.columns[int(column)+2])
filtered_data = data.loc[condition]
malignant_means.append(filtered_data[column].mean())
malignant_std_deviations.append(filtered_data[column].std()
)
malignant_variations.append(filtered_data[column].var())
pd.DataFrame({'Malignant Mean': malignant_means, 'Malignant
Variance': malignant_v
ariations, 'Malignant Standard Deviation': malignant_std_deviat
ions})
6. # mean variance and standard deviation: Class: benign
benign_means = []
benign_std_deviations = []
benign_variations = []
for column in data.columns[2:]:
condition = data['Diagnosis'] == 'B'
filtered_data = data.loc[condition]
benign_means.append(filtered_data[column].mean())
benign_std_deviations.append(filtered_data[column].std())
benign_variations.append(filtered_data[column].var())
pd.DataFrame({'Benign Mean': benign_means, 'Benign Variance
': benign_variations, '
Benign Standard Deviation': benign_std_deviations})
7. # Optimal thresholds for all attributes
column_specificity_map = {}
results = {}
for column in data.columns[2:]:
# find min max and step
num_bins = 20
min = data.iloc[:,data.columns.get_loc(column)].min()
max = data.iloc[:,data.columns.get_loc(column)].max()
step = (max-min)/num_bins
# get bins
bins = [min]
for i in range(1, num_bins):
bins.append(bins[i-1]+step)
class_m = np.histogram(data.loc[data.Diagnosis == 'M', colu
mn], bins=bins, no
rmed=False)[0]
class_b = np.histogram(data.loc[data.Diagnosis == 'B', colum
n], bins=bins, no
rmed=False)[0]
total_class_m = sum(class_m)
total_class_b = sum(class_b)
new_data_m = [ item/total_class_m for item in class_m ]
new_data_b = [ item/total_class_b for item in class_b ]
new_data = pd.DataFrame({'M': new_data_m, 'B': new_data_
b})
new_data.plot.bar(title="Column: " + column)
pyplot.show()
# find the optimal threshold
threshold_specificity_map = {}
for i in range(0,num_bins):
# a <= threshold -> class M
# a > threshold -> class B
class_m_correct = len([ item for item in data.loc[data.Diag
nosis == 'M',
column] if item <= i ])
class_b_correct = len([ item for item in data.loc[data.Diag
nosis == 'B',
column] if item > i ])
norm_class_m_correct = class_m_correct/total_class_m
norm_class_b_correct = class_b_correct/total_class_b
accuracy_1 = (class_m_correct + class_b_correct) / (total_
class_m + total
_class_b)
specificity_1 = (norm_class_m_correct + norm_class_b_co
rrect)/2
# a <= thresold -> class B
# a > threshold -> class M
class_b_correct = len([ item for item in data.loc[data.Diag
nosis == 'B',
column] if item <= i ])
class_m_correct = len([ item for item in data.loc[data.Diag
nosis == 'M',
column] if item > i ])
norm_class_m_correct = class_m_correct/total_class_m
norm_class_b_correct = class_b_correct/total_class_b
accuracy_2 = (class_m_correct + class_b_correct) / (total_
class_m + total
_class_b)
specificity_2 = (norm_class_m_correct + norm_class_b_co
rrect)/2
specificity = specificity_1
accuracy = accuracy_1
if specificity < specificity_2:
specificity = specificity_2
accuracy = accuracy_2
threshold_specificity_map[i] = {'specificity': specificity, '
accuracy': a
ccuracy}
# Get the optimal threshold
max_specificity = -100
max_accuracy = -100
optimal_threshold = -100
for threshold, item in threshold_specificity_map.items():
if item['specificity'] > max_specificity:
max_specificity = item['specificity']
max_accuracy = item['accuracy']
optimal_threshold = threshold
# print("Optimal Threshold: ", optimal_threshold)
# print("Accuracy: ", max_accuracy)
# print("Error: ", 1-max_accuracy)
column_specificity_map[column] = max_specificity
results[column] = {
'Optimal Threshold': optimal_threshold,
'Accuracy': max_accuracy,
'Error': 1-max_accuracy
}
# print in order of prediciton ability
dict(sorted(column_specificity_map.items()))
pd.DataFrame(results).transpose().sort_values(by=['Accuracy'],
ascending=False)
8.
from sklearn.model_selection import train_test_split
# Normalization to zero mean and unit variance
data.iloc[:,2:] = data.iloc[:,2:].apply(lambda x: (x-
x.mean())/x.std())
train, test, train_labels, test_labels = train_test_split(data.iloc[:,
2:], data.i
loc[:,1] ,test_size=0.40, random_state=3)
9. # KNN
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1)
model.fit(train, train_labels)
nn1_predictions = model.predict(test)
accuracy_1 = model.score(test, test_labels)
print("1NN: Accuracy: ", accuracy_1)
model = KNeighborsClassifier(n_neighbors=3)
model.fit(train, train_labels)
nn2_predictions = model.predict(test)
accuracy_2 = model.score(test, test_labels)
print("3NN: Accuracy: ", accuracy_2)
pd.DataFrame({'1NN Accuracy': accuracy_1, '3NN Accuracy' : a
ccuracy_2}, index=[0]
)
10.
print("Classification errors of 1NN:")
print("PredictiontActual")
for prediction, actual in zip(nn1_predictions, test_labels):
if prediction != actual:
print(prediction, "tt", actual)
print("Classification errors of 3NN:")
print("PredictiontActual")
for prediction, actual in zip(nn2_predictions, test_labels):
if prediction != actual:
print(prediction, "tt",actual)
11. # Fishers Linear Discriminant
from sklearn.discriminant_analysis import LinearDiscriminantA
nalysis
from sklearn.metrics import accuracy_score, confusion_matrix
fisher_classifier = LinearDiscriminantAnalysis()
fisher_classifier.fit(train, train_labels)
fisher_predictions = fisher_classifier.predict(test)
print("Fisher's Accuracy: ", accuracy_score(test_labels, fisher_p
redictions))
# Confusion Matrix
print("Confusion Matrix")
print(confusion_matrix(fisher_predictions, test_labels))
print("Classification errors of Fisher's:")
print("PredictiontActual")
for prediction, actual in zip(fisher_predictions, test_labels):
if prediction != actual:
print(prediction, "tt", actual)
Data Mining and Neural NetworksComputational Task 1Task
1Task 2Task 3Task 4Task 5Task 6Appendix
MA4022/MA7022 DATA MINING and NEURAL NETWORKS
Computational Task 3, 2021
Due date 17.04.2021, 23:59
For this task you need to download 4 time series from the
Yahoo!Finance website: Any
student should have their own unique set of time series!
Please collect available data for three years 2018-2020
Please pay attention that for your analysis the time moments
should be sorted from oldest to newest.
Use the daily closing price.
1. Data evaluation and elementary preprocessing. Analyse
completeness of data. Are there missed
data (besides weekends)? How many missed data points are in
your time series? Are the dates of
missed values the same for all your time series? What may be
the reasons for missing? How can you
handle the missed values in your data (explain at least three
approaches)? Use the simple rule: fill in a
missed value by the closest in time past existing value. Plot the
results. Normalise to the z-score (zero
mean and unit standard deviation). Plot the results. (15 marks)
3. Segmentation. Prepare the bottom-up piecewise linear
segmentation for the transformed and
normalised log-return time series. Use the following mean
square errors tolerance levels: 1%, 5%,
10% (the thresholds of the mean square errors). Plot the results.
Are the segments similar for different
time series you analysed? (25 marks)
4. Prediction. Chose one of the transformed and normalised time
series as a target ⃗(⃗) and other
3 as supporting data ⃗1(⃗), ⃗2(⃗), ⃗3(⃗), where ⃗ = 1, … ,
⃗. Provide scatter diagrams of (g(t),g(t+1)).
Evaluate the error of the “next-day forecast”, ⃗ (⃗ + 1) =
⃗(⃗).
Use data for 2018 as the training set and find the predictor of
⃗(⃗ + 1) (the next day value) as a
linear function Ψ of ⃗(⃗), ⃗1(⃗), ⃗2(⃗), ⃗3(⃗):
⃗ (⃗ + 1) = Ψ(⃗(⃗), ⃗1(⃗), ⃗2(⃗), ⃗3(⃗)) (1)
(linear regression). Evaluate the training set error. Use data for
2019 as a test set and evaluate the test
set error for this set. Also, use data for 2020 as a test set and
evaluate the test set error for this set.
Compare these errors. Compare these errors to the errors of the
“next-day forecast”. Comment.
Provide plots of ⃗(⃗), ⃗ (⃗), and the residual. Present the
(⃗(⃗), ⃗ (⃗)) scatter diagram. (30 marks)
5. Adaptive predictors. For each given value of the “frame
width”, Δ=5, 10, 30, create and test
the following adaptive predictor. For every T> Δ create the
training set with Δ input vectors (⃗(⃗),
⃗1(⃗), ⃗2(⃗), ⃗3(⃗)) (⃗ = ⃗ − Δ, … , ⃗-1) and the
corresponding outputs ⃗(⃗ + 1).
In more detail, the input vectors ⃗⃗ and the output values ⃗⃗
for a given T are
⃗1 = (⃗(⃗ − Δ), ⃗1(⃗ − Δ), ⃗2(⃗ − Δ), ⃗3(⃗ − Δ)), ⃗1 =
⃗(⃗ − Δ + 1)
………..
⃗⃗ = (⃗(⃗ − Δ + ⃗ − 1), ⃗1(⃗ − Δ + ⃗ − 1), ⃗2(⃗ − Δ + ⃗ −
1), ⃗3(⃗ − Δ + ⃗ − 1)),
⃗⃗ = ⃗(⃗ − Δ + ⃗) Where
i=1,2,…, Δ.
Find the linear regression (1) for each T> Δ. Test this linear
regression for the next time value, t=T+1.
In more detail, for each T there is one test example with the
input vector ⃗⃗⃗⃗⃗ and output value ⃗⃗⃗⃗⃗:
⃗⃗⃗⃗⃗ = (⃗(⃗), ⃗1(⃗), ⃗2(⃗), ⃗3(⃗)), ⃗⃗⃗⃗⃗ = ⃗(⃗ +
1)
Please pay attention that this example does not belong to a
training set for this value of T.
Find the residuals at these test time moments. Plot these
residuals and the values ⃗(⃗), ⃗ (⃗). Present
the (⃗(⃗), ⃗ (⃗)) scatter diagram (t=T+1). Calculate the mean
square error. Compare to the previous
task. Comment. (30 marks)

More Related Content

Similar to Data Mining and Neural NetworksComputational Task 1Tas

A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
Editor Jacotech
 
Data_Mining_Exploration
Data_Mining_ExplorationData_Mining_Exploration
Data_Mining_Exploration
Brett Keim
 
Course Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine LearningCourse Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine Learning
John Edward Slough II
 
Course Title Portfolio Name EmailAbstract—Th
Course Title Portfolio  Name  EmailAbstract—ThCourse Title Portfolio  Name  EmailAbstract—Th
Course Title Portfolio Name EmailAbstract—Th
CruzIbarra161
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
Dario Panada
 
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
 ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO... ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
cscpconf
 

Similar to Data Mining and Neural NetworksComputational Task 1Tas (20)

INFLUENCE OF QUANTITY OF PRINCIPAL COMPONENT IN DISCRIMINATIVE FILTERING
INFLUENCE OF QUANTITY OF PRINCIPAL COMPONENT IN DISCRIMINATIVE FILTERINGINFLUENCE OF QUANTITY OF PRINCIPAL COMPONENT IN DISCRIMINATIVE FILTERING
INFLUENCE OF QUANTITY OF PRINCIPAL COMPONENT IN DISCRIMINATIVE FILTERING
 
Dimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptxDimensionality Reduction and feature extraction.pptx
Dimensionality Reduction and feature extraction.pptx
 
Building classification model, tree model, confusion matrix and prediction ac...
Building classification model, tree model, confusion matrix and prediction ac...Building classification model, tree model, confusion matrix and prediction ac...
Building classification model, tree model, confusion matrix and prediction ac...
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
 
1376846406 14447221
1376846406  144472211376846406  14447221
1376846406 14447221
 
DCSM report2
DCSM report2DCSM report2
DCSM report2
 
Data_Mining_Exploration
Data_Mining_ExplorationData_Mining_Exploration
Data_Mining_Exploration
 
IRJET-Handwritten Digit Classification using Machine Learning Models
IRJET-Handwritten Digit Classification using Machine Learning ModelsIRJET-Handwritten Digit Classification using Machine Learning Models
IRJET-Handwritten Digit Classification using Machine Learning Models
 
Course Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine LearningCourse Project for Coursera Practical Machine Learning
Course Project for Coursera Practical Machine Learning
 
Analytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningAnalytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion mining
 
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
 
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
 
Lda
LdaLda
Lda
 
Implementing Minimum Error Rate Classifier
Implementing Minimum Error Rate ClassifierImplementing Minimum Error Rate Classifier
Implementing Minimum Error Rate Classifier
 
Course Title Portfolio Name EmailAbstract—Th
Course Title Portfolio  Name  EmailAbstract—ThCourse Title Portfolio  Name  EmailAbstract—Th
Course Title Portfolio Name EmailAbstract—Th
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
 
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
 
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
 ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO... ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
 

More from OllieShoresna

Think_Vision W5- Importance of VaccinationImportance of Vaccinatio.docx
Think_Vision W5- Importance of VaccinationImportance of Vaccinatio.docxThink_Vision W5- Importance of VaccinationImportance of Vaccinatio.docx
Think_Vision W5- Importance of VaccinationImportance of Vaccinatio.docx
OllieShoresna
 
Thinks for both only 50 words as much for each one1-xxxxd, unf.docx
Thinks for both only 50 words as much for each one1-xxxxd, unf.docxThinks for both only 50 words as much for each one1-xxxxd, unf.docx
Thinks for both only 50 words as much for each one1-xxxxd, unf.docx
OllieShoresna
 

More from OllieShoresna (20)

this assignment is about Mesopotamia and Egypt. Some of these cu.docx
this assignment is about Mesopotamia and Egypt. Some of these cu.docxthis assignment is about Mesopotamia and Egypt. Some of these cu.docx
this assignment is about Mesopotamia and Egypt. Some of these cu.docx
 
This assignment has two goals 1) have students increase their under.docx
This assignment has two goals 1) have students increase their under.docxThis assignment has two goals 1) have students increase their under.docx
This assignment has two goals 1) have students increase their under.docx
 
This assignment has two parts 1 paragraph per questionIn wh.docx
This assignment has two parts 1 paragraph per questionIn wh.docxThis assignment has two parts 1 paragraph per questionIn wh.docx
This assignment has two parts 1 paragraph per questionIn wh.docx
 
This assignment is a minimum of 100 word all parts of each querstion.docx
This assignment is a minimum of 100 word all parts of each querstion.docxThis assignment is a minimum of 100 word all parts of each querstion.docx
This assignment is a minimum of 100 word all parts of each querstion.docx
 
This assignment has three elements a traditional combination format.docx
This assignment has three elements a traditional combination format.docxThis assignment has three elements a traditional combination format.docx
This assignment has three elements a traditional combination format.docx
 
This assignment has four partsWhat changes in business software p.docx
This assignment has four partsWhat changes in business software p.docxThis assignment has four partsWhat changes in business software p.docx
This assignment has four partsWhat changes in business software p.docx
 
This assignment consists of two partsthe core evaluation, a.docx
This assignment consists of two partsthe core evaluation, a.docxThis assignment consists of two partsthe core evaluation, a.docx
This assignment consists of two partsthe core evaluation, a.docx
 
This assignment asks you to analyze a significant textual elemen.docx
This assignment asks you to analyze a significant textual elemen.docxThis assignment asks you to analyze a significant textual elemen.docx
This assignment asks you to analyze a significant textual elemen.docx
 
This assignment allows you to learn more about one key person in Jew.docx
This assignment allows you to learn more about one key person in Jew.docxThis assignment allows you to learn more about one key person in Jew.docx
This assignment allows you to learn more about one key person in Jew.docx
 
This assignment allows you to explore the effects of social influe.docx
This assignment allows you to explore the effects of social influe.docxThis assignment allows you to explore the effects of social influe.docx
This assignment allows you to explore the effects of social influe.docx
 
This assignment addresses pretrial procedures that occur prior to th.docx
This assignment addresses pretrial procedures that occur prior to th.docxThis assignment addresses pretrial procedures that occur prior to th.docx
This assignment addresses pretrial procedures that occur prior to th.docx
 
This assignment allows you to learn more about one key person in J.docx
This assignment allows you to learn more about one key person in J.docxThis assignment allows you to learn more about one key person in J.docx
This assignment allows you to learn more about one key person in J.docx
 
This assignment allows you to explore the effects of social infl.docx
This assignment allows you to explore the effects of social infl.docxThis assignment allows you to explore the effects of social infl.docx
This assignment allows you to explore the effects of social infl.docx
 
this about communication please i eant you answer this question.docx
this about communication please i eant you answer this question.docxthis about communication please i eant you answer this question.docx
this about communication please i eant you answer this question.docx
 
Think of a time when a company did not process an order or perform a.docx
Think of a time when a company did not process an order or perform a.docxThink of a time when a company did not process an order or perform a.docx
Think of a time when a company did not process an order or perform a.docx
 
Think_Vision W5- Importance of VaccinationImportance of Vaccinatio.docx
Think_Vision W5- Importance of VaccinationImportance of Vaccinatio.docxThink_Vision W5- Importance of VaccinationImportance of Vaccinatio.docx
Think_Vision W5- Importance of VaccinationImportance of Vaccinatio.docx
 
Thinks for both only 50 words as much for each one1-xxxxd, unf.docx
Thinks for both only 50 words as much for each one1-xxxxd, unf.docxThinks for both only 50 words as much for each one1-xxxxd, unf.docx
Thinks for both only 50 words as much for each one1-xxxxd, unf.docx
 
Think of a specific change you would like to bring to your organizat.docx
Think of a specific change you would like to bring to your organizat.docxThink of a specific change you would like to bring to your organizat.docx
Think of a specific change you would like to bring to your organizat.docx
 
Think of a possible change initiative in your selected organization..docx
Think of a possible change initiative in your selected organization..docxThink of a possible change initiative in your selected organization..docx
Think of a possible change initiative in your selected organization..docx
 
Thinking About Research PaperConsider the research question and .docx
Thinking About Research PaperConsider the research question and .docxThinking About Research PaperConsider the research question and .docx
Thinking About Research PaperConsider the research question and .docx
 

Recently uploaded

Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
EADTU
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
中 央社
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
CaitlinCummins3
 

Recently uploaded (20)

An overview of the various scriptures in Hinduism
An overview of the various scriptures in HinduismAn overview of the various scriptures in Hinduism
An overview of the various scriptures in Hinduism
 
Supporting Newcomer Multilingual Learners
Supporting Newcomer  Multilingual LearnersSupporting Newcomer  Multilingual Learners
Supporting Newcomer Multilingual Learners
 
Observing-Correct-Grammar-in-Making-Definitions.pptx
Observing-Correct-Grammar-in-Making-Definitions.pptxObserving-Correct-Grammar-in-Making-Definitions.pptx
Observing-Correct-Grammar-in-Making-Definitions.pptx
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
 
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
Transparency, Recognition and the role of eSealing - Ildiko Mazar and Koen No...
 
Mattingly "AI and Prompt Design: LLMs with NER"
Mattingly "AI and Prompt Design: LLMs with NER"Mattingly "AI and Prompt Design: LLMs with NER"
Mattingly "AI and Prompt Design: LLMs with NER"
 
The Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDFThe Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDF
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
 
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
 
ESSENTIAL of (CS/IT/IS) class 07 (Networks)
ESSENTIAL of (CS/IT/IS) class 07 (Networks)ESSENTIAL of (CS/IT/IS) class 07 (Networks)
ESSENTIAL of (CS/IT/IS) class 07 (Networks)
 
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community PartnershipsSpring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
 
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
 
diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....
 
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptxAnalyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
 
Rich Dad Poor Dad ( PDFDrive.com )--.pdf
Rich Dad Poor Dad ( PDFDrive.com )--.pdfRich Dad Poor Dad ( PDFDrive.com )--.pdf
Rich Dad Poor Dad ( PDFDrive.com )--.pdf
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
male presentation...pdf.................
male presentation...pdf.................male presentation...pdf.................
male presentation...pdf.................
 
How to Send Pro Forma Invoice to Your Customers in Odoo 17
How to Send Pro Forma Invoice to Your Customers in Odoo 17How to Send Pro Forma Invoice to Your Customers in Odoo 17
How to Send Pro Forma Invoice to Your Customers in Odoo 17
 

Data Mining and Neural NetworksComputational Task 1Tas

  • 1. Data Mining and Neural Networks Computational Task 1 Task 1 a. What is the problem authors aimed to solve? Authors aimed to distinguish malignant from benign breast cancer, using nuclear size, shape, and texture as features. b. Which methods did they use? The authors used Inductive machine learning and logistic regression to correctly label malignant or benign. c. How did they test the accuracy of classification? The authors used Cross-validation to test the accuracy of the predicted results. The accuracy of logistic regression was 96.2% whereas the accuracy of inductive machine learning was 97.5%. Task 2 For task 2, the data table from ics.uci.edu was downloaded as wdbc.data file. Here there are in total 32 columns with 1 ID column, 1 Diagnosis column and 30 attribute
  • 2. columns. Here the 30 are divided into 3 groups of mean, standard error, and worst radii. There are 212 malignant cases (M) and 357 benign cases (B) as shown in the Figure 1. Figure 1. Number of features and count of each target class The following are the mean, variance and standard deviation of all attributes starting from column 3-32 shown in the Figure 2. These are calculated before normalizing the attributes to unit variance. Figure 2. Mean, Variance and Standard Deviation of each attribute (0-29) The following are the mean, variance, and standard deviation of all attributes for Malignant class (M) in the Figure 3. Figure 3. Mean, Variance and Standard Deviation of each attribute for Malignant class (M) The following are the mean, variance, and standard deviation of all attributes for Benign class (B) in the Figure 4.
  • 3. Figure 4. Mean, Variance and Standard Deviation of each attribute for Benign class (B) The attributes are not normalized as we can tell based on the mean, variance, and standard deviations. To normalize we will subtract the mean of each attribute from each value of the attribute to get zero mean and we divide it with the standard deviation to get unit variance as shown in the Figure 5. Figure 5. Mean and standard deviation after normalization Task 3 To create predictors by one attribute, we plotted histograms for each attribute and each class. Following are some of the histograms shown in Figure 6. Figure 6. Histogram plots of first 4 columns To calcuate the optimal threshold for each single attribute classifier, we have set the threshold from 0-20 (bins) and calcuated the accuracy and specificity. Here we
  • 4. chose the threshold that maximizes the accuracy. The following are the thresholds of each single attribute classifier shown in the Figure 7. Figure 7. Optimal Thresholds of all single attribute classifiers sorted by accuracy From Figure 7, we can determine that attribute ‘20’ gives the best accuracy with least classification errors. The following are some of the classification rules: Attribute Accuracy Error Threshold Classification Rule 20 89.99% 10.03% 16 If x <= 16 then Class B else Class M 0 89.39% 10.60% 15 If x <= 15 then Class B else Class M 12 80.63% 19.36% 3 If x > 3 then Class M else Class B Table 1. Classification rules of the top 3 single attribute classifiers Task 4 To test 1NN and 3NN classification rules, we normalized the values to zero mean and unit variance. We also divided the dataset into 60% training data and 40% test data to test the classification accuracy and error. The following Figure 8 shows the accuracy of both 1NN and 3NN classifiers.
  • 5. Figure 8. Accuracy of 1NN and 3NN classifiers The Figure 9 shows the classification errors of both 1NN and 3NN classifiers Figure 9. Classification errors of 1NN and 3NN classifiers Based on this, 3NN has more accuracy than compared to 1NN classifier, hence 3NN classifier is better in classifying the malignant vs benign cancers. Class 1 Class 2 Task 5 Fischer’s linear discriminant is used to obtain a hyperplane which optimizes the signal-to-noise ratio or the hyperplane that maximizes the distance between means of projected instances and minimized the variance among the projected instances of each class. That is, it tries to find the hyperplane that reduces the distance between two groups of projects instances and in which the groups are closely packed with one another. Figure 10. Shows a hyperplane that divides all projections clearly
  • 6. Here the projections of all data points on the hyperplane are well separated and the projections are also closely packed. This allows us to take a normal to the hyperplane and classify. Fisher’s Linear Discriminant hence finds the hyperplane by maximizing the following ratio: Here (w⃗) is normal to the hyperplane. Task 6 Applied Fisher’s linear discriminant to the Breast Cancer Wisconsin (Diagnostic) data set using sklearn’s LinearDiscriminantAnalysis classifier. Figure 11 shows the accuracy of Fisher’s classifier: Figure 11. Accuracy of Fisher’s Linear Discriminant Figure 12 shows the confusion matrix and classification errors: Figure 12. Confusion matrix and classification errors of Fischer’s Classifier Compared the 1NN, this method provided more accuracy but on par with the accuracy of 3NN methods. Appendix
  • 7. 1. # Import statements import pandas as pd import numpy as np from sklearn import preprocessing from matplotlib import pyplot 2. # Data import headers = ['ID', 'Diagnosis'] headers.extend([str(i) for i in range(30)]) data = pd.read_csv('wdbc.data', sep=",", header=None, names= headers) data 3. # Stats attributes = data.shape[1] - 2 # remove id and class count benign, malignant = 0, 0 for index, row in data.iterrows(): if row[1] == 'M': malignant += 1 elif row[1] == 'B': benign += 1
  • 8. else: print(row[1]) print("There are {} attributes".format(attributes)) print("There are {} malignant cases (M) and {} benign cases (B) ".format(malignant , benign)) 4. # mean variance and standard deviation: All classes all_means = [] all_std_deviations = [] all_variations = [] for column in data.columns[2:]: all_means.append(data[column].mean()) all_std_deviations.append(data[column].std()) all_variations.append(data[column].var()) pd.DataFrame({'Mean': all_means, 'Variance': all_variations, 'St andard Deviation': all_std_deviations}) 5. # mean variance and standard deviation: Class: malignant
  • 9. malignant_means = [] malignant_std_deviations = [] malignant_variations = [] for column in data.columns[2:]: condition = data['Diagnosis'] == 'M' # print(column) # print(data.columns[int(column)+2]) filtered_data = data.loc[condition] malignant_means.append(filtered_data[column].mean()) malignant_std_deviations.append(filtered_data[column].std() ) malignant_variations.append(filtered_data[column].var()) pd.DataFrame({'Malignant Mean': malignant_means, 'Malignant Variance': malignant_v ariations, 'Malignant Standard Deviation': malignant_std_deviat ions}) 6. # mean variance and standard deviation: Class: benign benign_means = [] benign_std_deviations = []
  • 10. benign_variations = [] for column in data.columns[2:]: condition = data['Diagnosis'] == 'B' filtered_data = data.loc[condition] benign_means.append(filtered_data[column].mean()) benign_std_deviations.append(filtered_data[column].std()) benign_variations.append(filtered_data[column].var()) pd.DataFrame({'Benign Mean': benign_means, 'Benign Variance ': benign_variations, ' Benign Standard Deviation': benign_std_deviations}) 7. # Optimal thresholds for all attributes column_specificity_map = {} results = {} for column in data.columns[2:]: # find min max and step num_bins = 20 min = data.iloc[:,data.columns.get_loc(column)].min() max = data.iloc[:,data.columns.get_loc(column)].max()
  • 11. step = (max-min)/num_bins # get bins bins = [min] for i in range(1, num_bins): bins.append(bins[i-1]+step) class_m = np.histogram(data.loc[data.Diagnosis == 'M', colu mn], bins=bins, no rmed=False)[0] class_b = np.histogram(data.loc[data.Diagnosis == 'B', colum n], bins=bins, no rmed=False)[0] total_class_m = sum(class_m) total_class_b = sum(class_b) new_data_m = [ item/total_class_m for item in class_m ] new_data_b = [ item/total_class_b for item in class_b ] new_data = pd.DataFrame({'M': new_data_m, 'B': new_data_ b}) new_data.plot.bar(title="Column: " + column) pyplot.show()
  • 12. # find the optimal threshold threshold_specificity_map = {} for i in range(0,num_bins): # a <= threshold -> class M # a > threshold -> class B class_m_correct = len([ item for item in data.loc[data.Diag nosis == 'M', column] if item <= i ]) class_b_correct = len([ item for item in data.loc[data.Diag nosis == 'B', column] if item > i ]) norm_class_m_correct = class_m_correct/total_class_m norm_class_b_correct = class_b_correct/total_class_b accuracy_1 = (class_m_correct + class_b_correct) / (total_ class_m + total _class_b) specificity_1 = (norm_class_m_correct + norm_class_b_co rrect)/2 # a <= thresold -> class B
  • 13. # a > threshold -> class M class_b_correct = len([ item for item in data.loc[data.Diag nosis == 'B', column] if item <= i ]) class_m_correct = len([ item for item in data.loc[data.Diag nosis == 'M', column] if item > i ]) norm_class_m_correct = class_m_correct/total_class_m norm_class_b_correct = class_b_correct/total_class_b accuracy_2 = (class_m_correct + class_b_correct) / (total_ class_m + total _class_b) specificity_2 = (norm_class_m_correct + norm_class_b_co rrect)/2 specificity = specificity_1 accuracy = accuracy_1 if specificity < specificity_2: specificity = specificity_2 accuracy = accuracy_2 threshold_specificity_map[i] = {'specificity': specificity, ' accuracy': a
  • 14. ccuracy} # Get the optimal threshold max_specificity = -100 max_accuracy = -100 optimal_threshold = -100 for threshold, item in threshold_specificity_map.items(): if item['specificity'] > max_specificity: max_specificity = item['specificity'] max_accuracy = item['accuracy'] optimal_threshold = threshold # print("Optimal Threshold: ", optimal_threshold) # print("Accuracy: ", max_accuracy) # print("Error: ", 1-max_accuracy) column_specificity_map[column] = max_specificity results[column] = { 'Optimal Threshold': optimal_threshold, 'Accuracy': max_accuracy,
  • 15. 'Error': 1-max_accuracy } # print in order of prediciton ability dict(sorted(column_specificity_map.items())) pd.DataFrame(results).transpose().sort_values(by=['Accuracy'], ascending=False) 8. from sklearn.model_selection import train_test_split # Normalization to zero mean and unit variance data.iloc[:,2:] = data.iloc[:,2:].apply(lambda x: (x- x.mean())/x.std()) train, test, train_labels, test_labels = train_test_split(data.iloc[:, 2:], data.i loc[:,1] ,test_size=0.40, random_state=3) 9. # KNN from sklearn.neighbors import KNeighborsClassifier model = KNeighborsClassifier(n_neighbors=1) model.fit(train, train_labels) nn1_predictions = model.predict(test)
  • 16. accuracy_1 = model.score(test, test_labels) print("1NN: Accuracy: ", accuracy_1) model = KNeighborsClassifier(n_neighbors=3) model.fit(train, train_labels) nn2_predictions = model.predict(test) accuracy_2 = model.score(test, test_labels) print("3NN: Accuracy: ", accuracy_2) pd.DataFrame({'1NN Accuracy': accuracy_1, '3NN Accuracy' : a ccuracy_2}, index=[0] ) 10. print("Classification errors of 1NN:") print("PredictiontActual") for prediction, actual in zip(nn1_predictions, test_labels): if prediction != actual: print(prediction, "tt", actual) print("Classification errors of 3NN:") print("PredictiontActual")
  • 17. for prediction, actual in zip(nn2_predictions, test_labels): if prediction != actual: print(prediction, "tt",actual) 11. # Fishers Linear Discriminant from sklearn.discriminant_analysis import LinearDiscriminantA nalysis from sklearn.metrics import accuracy_score, confusion_matrix fisher_classifier = LinearDiscriminantAnalysis() fisher_classifier.fit(train, train_labels) fisher_predictions = fisher_classifier.predict(test) print("Fisher's Accuracy: ", accuracy_score(test_labels, fisher_p redictions)) # Confusion Matrix print("Confusion Matrix") print(confusion_matrix(fisher_predictions, test_labels)) print("Classification errors of Fisher's:") print("PredictiontActual") for prediction, actual in zip(fisher_predictions, test_labels): if prediction != actual:
  • 18. print(prediction, "tt", actual) Data Mining and Neural NetworksComputational Task 1Task 1Task 2Task 3Task 4Task 5Task 6Appendix MA4022/MA7022 DATA MINING and NEURAL NETWORKS Computational Task 3, 2021 Due date 17.04.2021, 23:59 For this task you need to download 4 time series from the Yahoo!Finance website: Any student should have their own unique set of time series! Please collect available data for three years 2018-2020 Please pay attention that for your analysis the time moments should be sorted from oldest to newest. Use the daily closing price. 1. Data evaluation and elementary preprocessing. Analyse completeness of data. Are there missed data (besides weekends)? How many missed data points are in your time series? Are the dates of
  • 19. missed values the same for all your time series? What may be the reasons for missing? How can you handle the missed values in your data (explain at least three approaches)? Use the simple rule: fill in a missed value by the closest in time past existing value. Plot the results. Normalise to the z-score (zero mean and unit standard deviation). Plot the results. (15 marks) 3. Segmentation. Prepare the bottom-up piecewise linear segmentation for the transformed and normalised log-return time series. Use the following mean square errors tolerance levels: 1%, 5%, 10% (the thresholds of the mean square errors). Plot the results. Are the segments similar for different time series you analysed? (25 marks) 4. Prediction. Chose one of the transformed and normalised time series as a target ⃗(⃗) and other 3 as supporting data ⃗1(⃗), ⃗2(⃗), ⃗3(⃗), where ⃗ = 1, … , ⃗. Provide scatter diagrams of (g(t),g(t+1)). Evaluate the error of the “next-day forecast”, ⃗ (⃗ + 1) = ⃗(⃗). Use data for 2018 as the training set and find the predictor of ⃗(⃗ + 1) (the next day value) as a linear function Ψ of ⃗(⃗), ⃗1(⃗), ⃗2(⃗), ⃗3(⃗):
  • 20. ⃗ (⃗ + 1) = Ψ(⃗(⃗), ⃗1(⃗), ⃗2(⃗), ⃗3(⃗)) (1) (linear regression). Evaluate the training set error. Use data for 2019 as a test set and evaluate the test set error for this set. Also, use data for 2020 as a test set and evaluate the test set error for this set. Compare these errors. Compare these errors to the errors of the “next-day forecast”. Comment. Provide plots of ⃗(⃗), ⃗ (⃗), and the residual. Present the (⃗(⃗), ⃗ (⃗)) scatter diagram. (30 marks) 5. Adaptive predictors. For each given value of the “frame width”, Δ=5, 10, 30, create and test the following adaptive predictor. For every T> Δ create the training set with Δ input vectors (⃗(⃗), ⃗1(⃗), ⃗2(⃗), ⃗3(⃗)) (⃗ = ⃗ − Δ, … , ⃗-1) and the corresponding outputs ⃗(⃗ + 1). In more detail, the input vectors ⃗⃗ and the output values ⃗⃗ for a given T are ⃗1 = (⃗(⃗ − Δ), ⃗1(⃗ − Δ), ⃗2(⃗ − Δ), ⃗3(⃗ − Δ)), ⃗1 = ⃗(⃗ − Δ + 1) ……….. ⃗⃗ = (⃗(⃗ − Δ + ⃗ − 1), ⃗1(⃗ − Δ + ⃗ − 1), ⃗2(⃗ − Δ + ⃗ − 1), ⃗3(⃗ − Δ + ⃗ − 1)),
  • 21. ⃗⃗ = ⃗(⃗ − Δ + ⃗) Where i=1,2,…, Δ. Find the linear regression (1) for each T> Δ. Test this linear regression for the next time value, t=T+1. In more detail, for each T there is one test example with the input vector ⃗⃗⃗⃗⃗ and output value ⃗⃗⃗⃗⃗: ⃗⃗⃗⃗⃗ = (⃗(⃗), ⃗1(⃗), ⃗2(⃗), ⃗3(⃗)), ⃗⃗⃗⃗⃗ = ⃗(⃗ + 1) Please pay attention that this example does not belong to a training set for this value of T. Find the residuals at these test time moments. Plot these residuals and the values ⃗(⃗), ⃗ (⃗). Present the (⃗(⃗), ⃗ (⃗)) scatter diagram (t=T+1). Calculate the mean square error. Compare to the previous task. Comment. (30 marks)