Data Mining and Neural NetworksComputational Task 1Tas

Data Mining and Neural Networks
Computational Task 1
Task 1
a. What is the problem authors aimed to solve?
Authors aimed to distinguish malignant from benign breast
cancer, using nuclear size, shape, and
texture as features.
b. Which methods did they use?
The authors used Inductive machine learning and logistic
regression to correctly label malignant or
benign.
c. How did they test the accuracy of classification?
The authors used Cross-validation to test the accuracy of the
predicted results. The accuracy of
logistic regression was 96.2% whereas the accuracy of inductive
machine learning was 97.5%.
Task 2
For task 2, the data table from ics.uci.edu was downloaded as
wdbc.data file. Here there are in total 32
columns with 1 ID column, 1 Diagnosis column and 30 attribute

columns. Here the 30 are divided into 3
groups of mean, standard error, and worst radii. There are 212
malignant cases (M) and 357 benign cases
(B) as shown in the Figure 1.
Figure 1. Number of features and count of each target class
The following are the mean, variance and standard deviation of
all attributes starting from column 3-32
shown in the Figure 2. These are calculated before normalizing
the attributes to unit variance.
Figure 2. Mean, Variance and Standard Deviation of each
attribute (0-29)
The following are the mean, variance, and standard deviation of
all attributes for Malignant class (M) in the
Figure 3.
attribute for Malignant class (M)
The following are the mean, variance, and standard deviation of
all attributes for Benign class (B) in the
Figure 4.

attribute for Benign class (B)
The attributes are not normalized as we can tell based on the
mean, variance, and standard deviations. To
normalize we will subtract the mean of each attribute from each
value of the attribute to get zero mean
and we divide it with the standard deviation to get unit variance
as shown in the Figure 5.
Figure 5. Mean and standard deviation after normalization
Task 3
To create predictors by one attribute, we plotted histograms for
each attribute and each class. Following
are some of the histograms shown in Figure 6.
Figure 6. Histogram plots of first 4 columns
To calcuate the optimal threshold for each single attribute
classifier, we have set the threshold from 0-20
(bins) and calcuated the accuracy and specificity. Here we

chose the threshold that maximizes the
accuracy. The following are the thresholds of each single
attribute classifier shown in the Figure 7.
Figure 7. Optimal Thresholds of all single attribute classifiers
sorted by accuracy
From Figure 7, we can determine that attribute ‘20’ gives the
best accuracy with least classification errors.
The following are some of the classification rules:
Attribute Accuracy Error Threshold Classification Rule
20 89.99% 10.03% 16 If x <= 16 then Class B else Class M
0 89.39% 10.60% 15 If x <= 15 then Class B else Class M
12 80.63% 19.36% 3 If x > 3 then Class M else Class B
Table 1. Classification rules of the top 3 single attribute
classifiers
Task 4
To test 1NN and 3NN classification rules, we normalized the
values to zero mean and unit variance. We
also divided the dataset into 60% training data and 40% test
data to test the classification accuracy and
error. The following Figure 8 shows the accuracy of both 1NN
and 3NN classifiers.

Figure 8. Accuracy of 1NN and 3NN classifiers
The Figure 9 shows the classification errors of both 1NN and
3NN classifiers
Figure 9. Classification errors of 1NN and 3NN classifiers
Based on this, 3NN has more accuracy than compared to 1NN
classifier, hence 3NN classifier is better in
classifying the malignant vs benign cancers.
Class 1
Class 2
Task 5
Fischer’s linear discriminant is used to obtain a hyperplane
which optimizes the signal-to-noise ratio or the
hyperplane that maximizes the distance between means of
projected instances and minimized the
variance among the projected instances of each class. That is, it
tries to find the hyperplane that reduces
the distance between two groups of projects instances and in
which the groups are closely packed with
one another.
Figure 10. Shows a hyperplane that divides all projections
clearly

Here the projections of all data points on the hyperplane are
well separated and the projections are also
closely packed. This allows us to take a normal to the
hyperplane and classify.
Fisher’s Linear Discriminant hence finds the hyperplane by
maximizing the following ratio:
Here (w⃗) is normal to the hyperplane.
Task 6
Applied Fisher’s linear discriminant to the Breast Cancer
Wisconsin (Diagnostic) data set using sklearn’s
LinearDiscriminantAnalysis classifier. Figure 11 shows the
accuracy of Fisher’s classifier:
Figure 11. Accuracy of Fisher’s Linear Discriminant
Figure 12 shows the confusion matrix and classification errors:
Figure 12. Confusion matrix and classification errors of
Fischer’s Classifier
Compared the 1NN, this method provided more accuracy but on
par with the accuracy of 3NN methods.
Appendix

1. # Import statements
import pandas as pd
import numpy as np
from sklearn import preprocessing
from matplotlib import pyplot
2. # Data import
headers = ['ID', 'Diagnosis']
headers.extend([str(i) for i in range(30)])
data = pd.read_csv('wdbc.data', sep=",", header=None, names=
headers)
data
3. # Stats
attributes = data.shape[1] - 2 # remove id and class count
benign, malignant = 0, 0
for index, row in data.iterrows():
if row[1] == 'M':
malignant += 1
elif row[1] == 'B':
benign += 1

else:
print(row[1])
print("There are {} attributes".format(attributes))
print("There are {} malignant cases (M) and {} benign cases (B)
".format(malignant
, benign))
4. # mean variance and standard deviation: All classes
all_means = []
all_std_deviations = []
all_variations = []
for column in data.columns[2:]:
all_means.append(data[column].mean())
all_std_deviations.append(data[column].std())
all_variations.append(data[column].var())
pd.DataFrame({'Mean': all_means, 'Variance': all_variations, 'St
andard Deviation':
all_std_deviations})
5. # mean variance and standard deviation: Class: malignant

malignant_means = []
malignant_std_deviations = []
malignant_variations = []
condition = data['Diagnosis'] == 'M'
# print(column)
# print(data.columns[int(column)+2])
filtered_data = data.loc[condition]
malignant_means.append(filtered_data[column].mean())
malignant_std_deviations.append(filtered_data[column].std()
)
malignant_variations.append(filtered_data[column].var())
pd.DataFrame({'Malignant Mean': malignant_means, 'Malignant
Variance': malignant_v
ariations, 'Malignant Standard Deviation': malignant_std_deviat
ions})
6. # mean variance and standard deviation: Class: benign
benign_means = []
benign_std_deviations = []

benign_variations = []
condition = data['Diagnosis'] == 'B'
filtered_data = data.loc[condition]
benign_means.append(filtered_data[column].mean())
benign_std_deviations.append(filtered_data[column].std())
benign_variations.append(filtered_data[column].var())
pd.DataFrame({'Benign Mean': benign_means, 'Benign Variance
': benign_variations, '
Benign Standard Deviation': benign_std_deviations})
7. # Optimal thresholds for all attributes
column_specificity_map = {}
results = {}
# find min max and step
num_bins = 20
min = data.iloc[:,data.columns.get_loc(column)].min()
max = data.iloc[:,data.columns.get_loc(column)].max()

step = (max-min)/num_bins
# get bins
bins = [min]
for i in range(1, num_bins):
bins.append(bins[i-1]+step)
class_m = np.histogram(data.loc[data.Diagnosis == 'M', colu
mn], bins=bins, no
rmed=False)[0]
class_b = np.histogram(data.loc[data.Diagnosis == 'B', colum
n], bins=bins, no
rmed=False)[0]
total_class_m = sum(class_m)
total_class_b = sum(class_b)
new_data_m = [ item/total_class_m for item in class_m ]
new_data_b = [ item/total_class_b for item in class_b ]
new_data = pd.DataFrame({'M': new_data_m, 'B': new_data_
b})
new_data.plot.bar(title="Column: " + column)
pyplot.show()

# find the optimal threshold
threshold_specificity_map = {}
for i in range(0,num_bins):
# a <= threshold -> class M
# a > threshold -> class B
class_m_correct = len([ item for item in data.loc[data.Diag
nosis == 'M',
column] if item <= i ])
class_b_correct = len([ item for item in data.loc[data.Diag
nosis == 'B',
column] if item > i ])
norm_class_m_correct = class_m_correct/total_class_m
norm_class_b_correct = class_b_correct/total_class_b
accuracy_1 = (class_m_correct + class_b_correct) / (total_
class_m + total
_class_b)
specificity_1 = (norm_class_m_correct + norm_class_b_co
rrect)/2
# a <= thresold -> class B

# a > threshold -> class M
class_b_correct = len([ item for item in data.loc[data.Diag
nosis == 'B',
column] if item <= i ])
class_m_correct = len([ item for item in data.loc[data.Diag
nosis == 'M',
column] if item > i ])
norm_class_m_correct = class_m_correct/total_class_m
norm_class_b_correct = class_b_correct/total_class_b
accuracy_2 = (class_m_correct + class_b_correct) / (total_
class_m + total
_class_b)
specificity_2 = (norm_class_m_correct + norm_class_b_co
rrect)/2
specificity = specificity_1
accuracy = accuracy_1
if specificity < specificity_2:
specificity = specificity_2
accuracy = accuracy_2
threshold_specificity_map[i] = {'specificity': specificity, '
accuracy': a

ccuracy}
# Get the optimal threshold
max_specificity = -100
max_accuracy = -100
optimal_threshold = -100
for threshold, item in threshold_specificity_map.items():
if item['specificity'] > max_specificity:
max_specificity = item['specificity']
max_accuracy = item['accuracy']
optimal_threshold = threshold
# print("Optimal Threshold: ", optimal_threshold)
# print("Accuracy: ", max_accuracy)
# print("Error: ", 1-max_accuracy)
column_specificity_map[column] = max_specificity
results[column] = {
'Optimal Threshold': optimal_threshold,
'Accuracy': max_accuracy,

'Error': 1-max_accuracy
}
# print in order of prediciton ability
dict(sorted(column_specificity_map.items()))
pd.DataFrame(results).transpose().sort_values(by=['Accuracy'],
ascending=False)
8.
from sklearn.model_selection import train_test_split
# Normalization to zero mean and unit variance
data.iloc[:,2:] = data.iloc[:,2:].apply(lambda x: (x-
x.mean())/x.std())
train, test, train_labels, test_labels = train_test_split(data.iloc[:,
2:], data.i
loc[:,1] ,test_size=0.40, random_state=3)
9. # KNN
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1)
model.fit(train, train_labels)
nn1_predictions = model.predict(test)

accuracy_1 = model.score(test, test_labels)
print("1NN: Accuracy: ", accuracy_1)
model = KNeighborsClassifier(n_neighbors=3)
model.fit(train, train_labels)
nn2_predictions = model.predict(test)
accuracy_2 = model.score(test, test_labels)
print("3NN: Accuracy: ", accuracy_2)
pd.DataFrame({'1NN Accuracy': accuracy_1, '3NN Accuracy' : a
ccuracy_2}, index=[0]
)
10.
print("Classification errors of 1NN:")
print("PredictiontActual")
for prediction, actual in zip(nn1_predictions, test_labels):
if prediction != actual:
print(prediction, "tt", actual)
print("Classification errors of 3NN:")

for prediction, actual in zip(nn2_predictions, test_labels):
print(prediction, "tt",actual)
11. # Fishers Linear Discriminant
from sklearn.discriminant_analysis import LinearDiscriminantA
nalysis
from sklearn.metrics import accuracy_score, confusion_matrix
fisher_classifier = LinearDiscriminantAnalysis()
fisher_classifier.fit(train, train_labels)
fisher_predictions = fisher_classifier.predict(test)
print("Fisher's Accuracy: ", accuracy_score(test_labels, fisher_p
redictions))
# Confusion Matrix
print("Confusion Matrix")
print(confusion_matrix(fisher_predictions, test_labels))
print("Classification errors of Fisher's:")
for prediction, actual in zip(fisher_predictions, test_labels):

print(prediction, "tt", actual)
Data Mining and Neural NetworksComputational Task 1Task
1Task 2Task 3Task 4Task 5Task 6Appendix
MA4022/MA7022 DATA MINING and NEURAL NETWORKS
Computational Task 3, 2021
Due date 17.04.2021, 23:59
For this task you need to download 4 time series from the
Yahoo!Finance website: Any
student should have their own unique set of time series!
Please collect available data for three years 2018-2020
Please pay attention that for your analysis the time moments
should be sorted from oldest to newest.
Use the daily closing price.
1. Data evaluation and elementary preprocessing. Analyse
completeness of data. Are there missed
data (besides weekends)? How many missed data points are in
your time series? Are the dates of

missed values the same for all your time series? What may be
the reasons for missing? How can you
handle the missed values in your data (explain at least three
approaches)? Use the simple rule: fill in a
missed value by the closest in time past existing value. Plot the
results. Normalise to the z-score (zero
mean and unit standard deviation). Plot the results. (15 marks)
3. Segmentation. Prepare the bottom-up piecewise linear
segmentation for the transformed and
normalised log-return time series. Use the following mean
square errors tolerance levels: 1%, 5%,
10% (the thresholds of the mean square errors). Plot the results.
Are the segments similar for different
time series you analysed? (25 marks)
4. Prediction. Chose one of the transformed and normalised time
series as a target ⃗(⃗) and other
3 as supporting data ⃗1(⃗), ⃗2(⃗), ⃗3(⃗), where ⃗ = 1, … ,
⃗. Provide scatter diagrams of (g(t),g(t+1)).
Evaluate the error of the “next-day forecast”, ⃗ (⃗ + 1) =
⃗(⃗).
Use data for 2018 as the training set and find the predictor of
⃗(⃗ + 1) (the next day value) as a
linear function Ψ of ⃗(⃗), ⃗1(⃗), ⃗2(⃗), ⃗3(⃗):

⃗ (⃗ + 1) = Ψ(⃗(⃗), ⃗1(⃗), ⃗2(⃗), ⃗3(⃗)) (1)
(linear regression). Evaluate the training set error. Use data for
2019 as a test set and evaluate the test
set error for this set. Also, use data for 2020 as a test set and
evaluate the test set error for this set.
Compare these errors. Compare these errors to the errors of the
“next-day forecast”. Comment.
Provide plots of ⃗(⃗), ⃗ (⃗), and the residual. Present the
(⃗(⃗), ⃗ (⃗)) scatter diagram. (30 marks)
5. Adaptive predictors. For each given value of the “frame
width”, Δ=5, 10, 30, create and test
the following adaptive predictor. For every T> Δ create the
training set with Δ input vectors (⃗(⃗),
⃗1(⃗), ⃗2(⃗), ⃗3(⃗)) (⃗ = ⃗ − Δ, … , ⃗-1) and the
corresponding outputs ⃗(⃗ + 1).
In more detail, the input vectors ⃗⃗ and the output values ⃗⃗
for a given T are
⃗1 = (⃗(⃗ − Δ), ⃗1(⃗ − Δ), ⃗2(⃗ − Δ), ⃗3(⃗ − Δ)), ⃗1 =
⃗(⃗ − Δ + 1)
………..
⃗⃗ = (⃗(⃗ − Δ + ⃗ − 1), ⃗1(⃗ − Δ + ⃗ − 1), ⃗2(⃗ − Δ + ⃗ −
1), ⃗3(⃗ − Δ + ⃗ − 1)),

⃗⃗ = ⃗(⃗ − Δ + ⃗) Where
i=1,2,…, Δ.
Find the linear regression (1) for each T> Δ. Test this linear
regression for the next time value, t=T+1.
In more detail, for each T there is one test example with the
input vector ⃗⃗⃗⃗⃗ and output value ⃗⃗⃗⃗⃗:
⃗⃗⃗⃗⃗ = (⃗(⃗), ⃗1(⃗), ⃗2(⃗), ⃗3(⃗)), ⃗⃗⃗⃗⃗ = ⃗(⃗ +
1)
Please pay attention that this example does not belong to a
training set for this value of T.
Find the residuals at these test time moments. Plot these
residuals and the values ⃗(⃗), ⃗ (⃗). Present
the (⃗(⃗), ⃗ (⃗)) scatter diagram (t=T+1). Calculate the mean
square error. Compare to the previous
task. Comment. (30 marks)

Data Mining and Neural NetworksComputational Task 1Tas

Recommended

Recommended

More Related Content

Similar to Data Mining and Neural NetworksComputational Task 1Tas

Similar to Data Mining and Neural NetworksComputational Task 1Tas (20)

More from OllieShoresna

More from OllieShoresna (20)

Recently uploaded

Recently uploaded (20)

Data Mining and Neural NetworksComputational Task 1Tas