Hyun wong thesis 2019 06_19_rev35_final

Master’s Dissertation
Comparative Analysis of Electricity
Consumption at Home through a
Silhouette-score prospective
Hyun Wong Choi
Department of Electrical and Computer Engineering
The Graduate School
Sungkyunkwan University

Hyun Wong Choi
Department of Electrical and Computer Engineering
The Graduate School

Hyun Wong Choi
A Dissertation Submitted to the Department of
Electrical and Computer Engineering and
the Graduate School of Sungkyunkwan University
in partial fulfillment of the requirements
for the degree of Master of Science in Engineering
April 2019
Approved by
Professor Dr. Dong Ryeol Shin

This certifies that the dissertation of
Hyun Wong Choi is approved.
Dr. MUHAMMAD MANNAN SAEED
Committee Chair: Prof.
Dr. Eung Mo Kim
Committee Member : Prof.
Dr. Dong Ryeol Shin
Major Advisor: Prof.
Dr. Nawab Muhammad Faseeh Querish
Co-Advisor: Prof.
The Graduate School
June 2019

- 3 -
Contents
List of Figures ....................................................................................................................................................................5
Abstract................................................................................................................................................................................6
Title of Abstract...................................................................................................................................................................6
Chapter 1 .............................................................................................................................................................................7
Introduction ..........................................................................................................................................................................7
Chapter 2 .............................................................................................................................................................................9
Overview & Motivation ....................................................................................................................................................9
Chapter 3 .......................................................................................................................................................................... 10
Paper-1 Content................................................................................................................................................................ 10
3.1. Introduction....................................................................................................................................................... 10
Paper-1 Methodology.................................................................................................................................................... 12
3.1.1.1. Sub-topics......................................................................................................................................... 12
3.4 Paper-1 VALUATION .................................................................................................................................... 12
2.4.1. Experimental Environment................................................................................................................. 12
2.4.2. Experimental Dataset............................................................................................................................ 12
3.2. Previous work................................................................................................................................................... 12
3.3. Proposed Approach....................................................................................................................................... 15
2.4.3. Experimental Results ............................................................................................................................ 19
3.4. Related work..................................................................................................................................................... 25
3.1 Summary............................................................................................................................................................. 27
Chapter 4 .......................................................................................................................................................................... 28
Paper-2 Content................................................................................................................................................................ 28
4.1 Introduction....................................................................................................................................................... 28
4.2 Related work...................................................................................................................................................... 29
3.5. Paper-2 Methodology .................................................................................................................................. 31
2.4. Paper-2 EVALUATION .................................................오류! 책갈피가 정의되어 있지 않습니다.
2.4.1. Experimental Environment................................................................................................................. 34
2.4.2. Experimental Dataset............................................................................................................................ 34
2.4.3. Experimental Results ............................................................................................................................ 35
Summary...................................................................................................................................................................... 41

- 4 -
Chapter 5 .......................................................................................................................................................................... 42
Conclusion........................................................................................................................................................................ 42
Acknowledgement.......................................................................................................................................................... 43
References......................................................................................................................................................................... 45

- 5 -
List of Figures
Fig.1. Figure-1 Description 15

- 6 -
Abstract
Title of Abstract
Machine learning is a modern field that has emerged as a new tool for data
analytics in a distributed computing environment. There are several aspects, at
which, machine learning has improved the processing capacity along with the
effectiveness of analysis. In this paper, the electricity usage of the home is analyzed
through K-means clustering algorithm for obtaining the optimal home usage
electricity data points. The Davis Boulden Index and Silhouette_score finds the
detailed optimal number of clusters in the K-means algorithm and present the
application scenario of the machine learning clustering analytics
Machine learning is a state-of-the-art sub-project of artificial intelligence, that
is been evolved for finding large-scale intelligent analytics in the distributed
computing environment. In this paper, we perform comparative analytics onto
dataset collected for the electricity usage of home based on the K-means clustering
algorithm using comparison to silhouette score with a ratio 1/8 dataset. The
performance evaluation shows that the comparison index is similar in numbers of
silhouette score even if datasets are smaller than before
KeyMAwords: Machine Learning, K-means clustering
Big data analytics has simplified the complexity of large-scale dataset
processing in a parallel distributed environment.

- 7 -
Chapter 1
Introduction
Electiricty consumption from power grid
In the power grid, we measure the consumption through sensors
Industrial consumption
Housing consumption
Factories consumption
Housing Consumption
Front end( Consumer End )
Back end ( Electircal Company end)
Back end ( Company end )
- Dataset For consumption UCIRVINE
Somany techniques that solves the optimization problem of electricity but, non of
them focus on housing electricity optimization,
- Reducing the cost
- Factors of overcharge
- Prediction
Are not available.
Solution
K-means algorithm
Why chose k-mean cluster
Predict the answer from the dataset
No any answer is available in terms of k-mean
Why predicting the answers
No clear result
In this paper electricity usage of home is analyzed through k-means
clustering algorithm for obtaining the optimal home usage electricity usage

- 8 -
of home is
3A is analyzed through k-means clustering algorithm for obtaining the
optimal home usage electricity data points The calinski-Harabasz Index,
davis-boulden index and silhouette_score find detailed optimal number of
clusters in the K-menas algorithm and present the application scenario of the
machine learning algorithm.
3B is reducing the 1/8 dataset and result the same result
The proposed approach delivers us efficient and meaning prediction results
never obtained before.
Machine learning is an analyzing mechanism that fetches and identifies the
matching patterns from existing datasets for newer result formations. This paper
discusses comparative analytics related to unsupervised learning algorithms. At
which we compare the K-mean clustering result with a ratio of half dataset to
silhouette_score result. We performed analysis and came to a conclusion that
Davis-Boulden index is not working smoothly in the sci-kit learn library, so
performed a check analysis for Caliski-Harabasz Index and Silhouette score along
with and Davis – Boulden index and compared results to each of them so to learn
that when we reduce the dataset to a mentioned proportion, the resultant dataset
shows half score than the traditional dataset score.

- 9 -
Chapter 2
Overview & Motivation
In real life household power consumptions diverse analytics and electricity
transformer, Transmission power can management period can estimate it.
And each data using electricity consumption. It can be used for progressive
taxation. Regional to regional demand, forecasting, maintenance of power
plant and facilities. In the gas company or Car, company can estimate about
the consumption for the via k-means algorithm and also can estimate via k-
means algorithms and also can estimate about the gas consumption rate to
via k-means clustering and index.
Motivated from Google AI, Tensor-flow Conference 2017

- 10 -
Chapter 3
Paper-1 Content
3.1. Introduction
Machine learning is a sub-project of artificial intelligence, that is used
to develop algorithms and techniques for enabling the computers to learn [1].
It is used to train the computer for various aspects such as (i) distinguish
whether e-mails received are s pam or not, (ii) data classification application,
(iii) association rule identification, and (iv) character recognition.
Machine learning includes a series of processes, in which a computer
lookup for (i) similar patterns, (ii) generate a novel classification system, (iii)
data analytics, and (iv) producing meaningful results. It is a kind of artificial
intelligence, that can be predicted based on the result, if it is supported only
by analytics algorithms. Machine learning is a step-by-step evolution process
that leads from big data analytics to predict future actions towards making
decisions on its own through past learned results. The key issues for
processing a successful prediction model remains to be within increasing the
probability and reducing the error and the said problems are resolved through
enabling numerous iterative learnings [2].

- 11 -
At the heart of machine learning are Representation and Generalization,
where expression is an evaluation of data and generalization is the processing
of future data. Unsupervised learning is a type of machine learning that is
used primarily to determine how data is organized. Unlike Supervised
Learning or Reinforcement Learning, this method does not give a target
value for input values [3].
Autonomous learning is closely related to the density estimation of
statistics. These autonomous learning can summarize and describe the main
characteristics of the data. An example of autonomous learning is clustering.
In this paper, we use the K-means algorithm to measure the optimal number
of clusters based on the Calinski-Harabasz Index and Silhouette_score,
Davis-Boulden index and then apply it to household electricity consumption
analysis.

- 12 -
Paper-1 Methodology
3.1.1.1. Sub-topics
3.4 Paper-1 EVALUATION
2.4.1. Experimental Environment
Software : Anaconda3 + Pycharm3
OS Software : Window 10 Professional
Ram 16.0GB
Processor : i7-6600U CPU @2.60GHz
Harddisk : 420GB SSD
2.4.2. Experimental Dataset
3.2. Previous work
Machine Learning
Machine learning is like data mining, but it is different in predicting
data based on learned attributes, mainly through training data. In addition to
the three techniques, Unsupervised learning, Supervised Learning or
Reinforcement Learning, various types of machine learning techniques such
as Semi-Supervised Learning and Deep Learning algorithms are developed
Has been used.
Clustering
Clustering is a method of data mining by defining a cluster of data
considering the characteristics of given data and finding a representative

- 13 -
point that can represent the data group. A cluster is a group of data with
similar characteristics. If the characteristics of the data are different, they
must belong to different clusters. It is the main task of exploratory data
mining, and a common technique for statistical data analysis, used in many
fields, including pattern recognition, information retrieval, machine learning,
and computer graphics [3].
(1) Maximizing inter-cluster variance
(2) Minimizing the inner-cluster variance
Note, however, that clustering should be distinguished from
Classification. Clustering is unsupervised learning without correct answers.
In other words, we group similar objects without group information of each
object. Classification, on the other hand, is supervised learning. When you
carry out classification tasks, you will learn to predict the dependent variable
(Y) with the independent variable (X) of the data [4].
Community Feasibility Assessment

- 14 -
Since clustering tasks are not correct, they cannot be evaluated as
indicators, such as simple accuracy, as in a typical machine learning
algorithm. As you can see in the example below, it is not easy to find the
optimal number of clusters without the correct answers. Cluster analysis
itself is not one specific algorithm, but the general task to be solved. It can
be achieved by various algorithms that differ significantly in their
understanding of what constitutes a cluster and how to efficiently find them.
Popular notions of clusters include a group with small distances between
cluster members, dense areas of data space,
Scikit-learn

- 15 -
In general, a learning problem considers a set of n samples of data and
then tries to predict the properties of unknown data. If each sample is more
than a single number and for instance. A multi-dimensional entry, it is said
to have several attributes or features.
Supervised learning, in which the data comes with additional attributes
that we want to predict this problem can be either.
Classification: samples belong to two or more classes and we want to learn
from already labeled data on how to predict the class of unlabeled data. An
example of a classification problem would be handwritten digit recognition,
in which the aim is to assign each input vector to one of a finite number of
discrete categories. Another way to think of classification is as a discrete( as
opposed to continuous) form of supervised learning where one has a limited
number of categories and for each of n samples provided. One if to try to
label them with the correct category or class.
Scikit-learn is the machine learning platform in the middle range of
superficial broad python module this package high-level language can us
easily high-level documentation and proper API suggested. Using BSD
license as academic or commercially use it. Source-code, documentation is
downloaded from websites [10]
Supervised learning, Unsupervised Learning is the many problems is
inserted in the Scikit-learn, Generalized Models, Linear and Quadratic
Recruitment Analysis, Kernel Ridged regression, Support Vector machine,
Stochastic Gradient Decent model’s solution also inserted in the Scikit-learn
3.3. Proposed Approach
K-means algorithms is one of the clustering methods for divided,

- 16 -
divided is giving the data among the many partitions, For example, receive
data object n, divided data is input data divided K(<= n) data, each group
consisting of cluster below equation is the at K-means algorithm when
cluster consists of algorithms using cost function use it [11]
argmin ∑ ∑ ‖𝑥 − 𝜇𝑖‖
𝑥 ∈ 𝑆 𝑖
2𝑘
𝑖 =1
In other words, one of the data objects divided by the K group.
Currently, divided similarity is (dissimilarity with reducting the cost function
about it. And from this theory each object similarity increase, different group
similarity will decrease.[12] K-means algorithm is each centroid and in each
group’s data object times’ summation, from this function result, the data
object group updated clustering progressed.[5]
How to be well to be clustering inner way is Caliski-Harabasz Index,

- 17 -
Davies-Bouldin index, Dunn index, Silhouette score. In this paper. Evaluate
via Clainiski-Harabasz Index and silhouette score evaluate it.
From the Cluster Calinski-Harabasz Index s I the clusters distributed
average and cluster distributed ratio will give it to you.
𝑠(𝑘) =
𝑇𝑟(𝐵 𝑘)
𝑇𝑟(𝑊𝑘)
×
𝑁 − 𝑘
𝑘 − 1
For this Bk is the distributed matrix from each group Wk is the cluster
distributed defined.
𝑊𝑘 = ∑ ∑ (𝑥 − 𝑐 𝑞)(𝑥 − 𝑐 𝑞
𝑥∈𝐶 𝑞
𝑘
𝑞=1
) 𝑇
𝐵 𝑘 = ∑ 𝑛 𝑞(𝐶 𝑞 − 𝑐)(𝐶 𝑞
𝑞
− 𝑐) 𝑇
N is the number of Data, Cq data group in Cq, Cq is the cluster q’s centroid,
c is the E of the Centroid, NQ is the number of data number in cluster_q

- 18 -
Silhouette score is the easy way to in data I each data cluster in data’s
definition an (i) each data is not clustered inner and data’s definition b(i)
silhouette score s(i) is equal to calculate that
s(i) =
𝑏(𝑖) − 𝑎(𝑖)
max { 𝑎(𝑖), 𝑏(𝑖)}
From this calculate s(i) is equal to that function
−1 ≤ s(i) ≤ 1
S(i) is the close to 1 is the data 1 is the correct dluster to each thing, close
to -1 cannot distribute cluster is distributed, from this paper machine Using
the machine learning library scikit-learn in the house hold power
consumption clustering[7],

- 19 -
Household power consumption from the dataset Download from
University California Irvine Machine Learning Data Repository[8] and then
use it, this dataset is via delimiter is divided. Global_active_power, Global
Reactive_power, Voltage, Global _intensity is divided. Global
Active_powere and Global Reactive power the X, Y axis experiment it,
Python library is Anaconda3 K-means algorithm’s key point is using Data
keep K clusters, reduce cluster’s distance, K-means algorithms input data put
the labels. Figure 1 is the before check Calinski-Harabasz Index and
Silhouette_score execute K-means algorithm’s result. Figure 1 to Figure 11
are 1/8 dataset k-means clustering result for Household power consumption
from UC Irvine Repository and reduce the dataset 1/8 times from original
UCI machine learning data repository.
2.4.3. Experimental Results

- 20 -
Figure 1. Clustering result at K = 1 Figure2. Clustering result at K=2
Figure 3. Clustering result at K = 3 Figure4. Clustering result at K=4

- 21 -
Figure 5. Clustering result at K = 5 Figure 6. Clustering result at K=6
Figure 7. Clustering result at K=7 Figure 8. Clustering result at K=8

- 22 -
Figure 9. Clustering result at K = 9 Figure 10. Clustering result at K=10
After all, reduce each cluster’s distance calculate each cluster’s
Calinski-Harabasz Index, increasing clusters’ Calinski-Harabasz Index will
decrease with K ratio is too law estimate K this cluster partition will one
more or not electric consumption rate is very important. This one is the most
important fact.

- 23 -
Figure 11. Silhouette score according to change of cluster number.
Equal with Caliski-Harabasz Index estimation, calculate
Silhouette_score. The cluster will increase Silhouette_score will decreases
with K distributed, a low factor with optimal K represented.
From K-means algorithms calculate proper cluster things is very
important, from the data, estimate Silhouette_score, the result is K=7 each
cluster centroid and data prices silhouette score are 0.799 is the optimal score.
From the formal Caliski-Harabasz Index results are 560.3999 is the optimal
result. Using this k-means algorithm the fact is figure 11.
From this K-means algorithm cluster 7th,
each group’s centroid and
each centroid distance will be an optimal value. From this result, each
Centroid can divide. Household power consumption rate via clustering.

- 24 -
Figure 12: Clustering result at K=7
Davies-Bouldin index
If the ground truth labels are not known, the Davies-Bouldin index
(sklearn. Metrixdavis Boulden)
𝑅𝑖𝑗 =
𝑠𝑖 + 𝑠𝑗
𝑑𝑖𝑗

- 25 -
Then the Davis-Bouldin Index is defined as
DB =
1
𝑘
∑ 𝑖 = 1 𝑘
max
𝑖≠𝑗
𝑅𝑖𝑗
The zero is the lowest score a possible. Score. Values closer to zero
indicate a better partition. But the problem is this algorithm does not attach
it in the Scikit-learn library and only explain it in the document page but
cannot experiment easily.
3.4. Related work
Machine learning is a sub-project of artificial intelligence, that is used
to develop algorithms and techniques for enabling the computers to learn [1].
It is used to train the computer for various aspects such as (i) distinguish
whether e-mails received are spam or not, (ii) data classification application,
(iii) association rule identification, and (iv) character recognition.
Machine learning includes a series of processes, in which a computer
lookup for (i) similar patterns, (ii) generate a novel classification system, (iii)

- 26 -
data analytics, and (iv) producing meaningful results. It is a kind of artificial
intelligence, that can be predicted based on the result if it is supported only
by analytics algorithms. Machine learning is a step-by-step evolution process
that leads from big data analytics to predict future actions towards making
decisions on its own through past learned results. The key issues for
processing a successful prediction model remains to be within increasing the
probability and reducing the error and the said problems are resolved through
enabling numerous iterative learnings [2].
At the heart of machine learning are Representation and Generalization,
where expression is an evaluation of data and generalization is the processing
of future data. Unsupervised learning is a type of machine learning that is
used primarily to determine how data is organized. Unlike Supervised
Learning or Reinforcement Learning, this method does not give a target
value for input values [3].
Autonomous learning is closely related to the density estimation of
statistics. These autonomous learning can summarize and describe the main
characteristics of the data. An example of autonomous learning is clustering.
In this paper, we use the K-means algorithm to measure the optimal number
of clusters based on the Calinski-Harabasz Index and Silhouette_score,
Davis-Boulden index and then apply it to household electricity consumption

- 27 -
analysis.
3.1 Summary
From the paper, Household power consumption via k-means clustering,
Used library which is sci-kit learn, Anaconda 3 open-source personally can
easily follow it and because using BSD License to real works don’t have
difficulties to that. Not only the K-means algorithm, PCA Algorithms but

- 28 -
also SVM algorithm, etc other machine learning algorithms clustering can
also do it. From this result, in real life household power consumptions
diverse analytics. And the electricity transformer, Transmission power can
management period can estimate it. And each data using electricity
consumption. It can be used for progressive taxation, regional to regional
demand forecasting, maintenance of power plants and facilities. Can do it. In
the Gas, the company can estimate via k-means algorithms and also can
estimate the gas consumption rate to via K-means clustering and index.
Chapter 4
Paper-2 Comparative Analsysis of Electricity Consumption at Home
through a Silhouette-score prospective
4.1Introduction
Machine learning is an analyzing mechanism that fetches and identifies
the matching patterns from existing datasets for newer result formations.
This paper discusses comparative analytics related to unsupervised learning
algorithms, at which we compare the K-mean clustering result with a ratio

- 29 -
of half dataset to Silhouette_score results. We performed analysis and came
to the conclusion that Davis-Boulden index is not working smoothly in the
sci-kit learning, so performed a check analysis for Caliski-Harabasz Index
and Silhouette score along with and Davis – Boulden index and compared
results to each of them so to learn that when we reduce the dataset to a
mentioned proportion, the resultant dataset shows half score than the
traditional dataset score.
4.2 Related work
Machine learning is a field of artificial intelligence, that is used to
develop algorithms and techniques that enable computers to learn [1]. It is
used to train the computer to distinguish whether e-mails received are spam
or not, and there are various applications such as data classification,
associated rule identification, and character recognition, which comply to the
standard machine learning perspectives.

- 30 -
It includes a series of processes, in which a computer finds its own
patterns, creates a new classification system, analyzes the data, and produces
meaningful results. The successful prediction occurs with the increase in
probability and decrease in the error issues. Machine learning enables to sort
out the issues with various iterative learning [2]. Among them, supervised
learning is highly related to summarizing the learning methods for re-
enforcement mechanisms [3].
Clustering is a process of mining the dataset by defining a cluster of
data that considers the characteristics of input and finds a representative
method to point out the data group. In this way, a cluster is a group of relevant
data elements with similar characteristics. If the functions are not the same,
the ingredients belong to contrast clusters [3]. Clustering is unsupervised
learning without accuracy in answers. In the same way, the objects having
the same information are grouped together for similar elements. However,
the classification is a way related to supervised learning. When you perform
classification operations, the system will learn to predict the dependent

- 31 -
variable (Y) with the independent variable (X) of the data [4].
Scikit-learn is the machine learning platform in the middle range of
superficial broad python module this package high-level language can us
easily high-level documentation and proper API suggested. Using BSD
license as academic or commercially use it. Source-code, documentation is
downloaded from websites [10]. Supervised learning, Unsupervised
Learning is the many problems is inserted in the Scikit-learn, Generalized
Models, Linear and Quadratic Decruitment Analysis, Kernel Ridged
regression, Support Vector machine, Stochastic Gradient Decent model’s
solution also inserted in the Scikit-learn.
3.5. Paper-2 Methodology
K-means algorithm is one of the clustering methods for divided,
divided is giving the data among the many partitions. For example, receive
data object n, divided data is input data divided K (≤ n) data, each group
consisting of cluster below equation is the at K-means algorithm when

- 32 -
cluster consists of algorithms using cost function use it [11]
argmin ∑ ∑ ‖𝑥 − 𝜇𝑖‖
𝑥 ∈ 𝑆 𝑖
2𝑘
𝑖 =1
In other words, one of the data objects divided by the K group.
Currently, the divided similarity is (dissimilarity with reducing the cost
function about it. And from this theory each object similarity increase,
different group similarity will decrease. [12] K-means algorithm is each
centroid and in each group’s data object times’ summation, from this
function result, the data object group updated clustering progressed. [5]
Silhouette score is the easy way to in data I each data cluster in data’s
definition an (i) each data is not clustered inner and data’s definition b(i)
silhouette score s(i) is equal to calculate that

- 33 -
s(i) =
𝑏(𝑖) − 𝑎(𝑖)
max { 𝑎(𝑖), 𝑏(𝑖)}
From this calculate s(i) is equal to that function
−1 ≤ s(i) ≤ 1
S(i) is the close to 1 is the data I is the correct cluster to each thing,
close to -1 cannot distribute cluster is distributed, from this paper machine
Using the machine learning library scikit-learn in the household power
consumption clustering [7]. Household power consumption from the
dataset Download from University California Irvine Machine Learning
Dataset Repository [8] and then use it, this dataset is via delimiter is divided.
Global_active_power, Global Reactive_power, Voltage, Global_intensity is
divided. Global Active_power and Global Reactive power the X, Y axis
experiment it.
Python library is Anaconda3 K-means algorithm’s key point is using
Data keep K clusters, reduce cluster’s distance, K-means algorithms input
data put the labels. figure 1 is the before check Calinski-Harabasz Index and
Silhouette_score execute K-means algorithm’s result. Figure 1 to Figure 11
are 1/8 dataset k-means clustering result for House hold power consumption
from UC Irvine Repository and reduce the dataset 1/8 times from original

- 34 -
UCI machine learning data repository.
1.1.1. Experimental Environment
Software : Anaconda3 + Pycharm3
OS Software : Window 10 Professional
Ram 16.0GB
Processor : i7-6600U CPU @2.60GHz
Harddisk : 420GB SSD
1.1.2. Experimental Dataset
1.date: Date in format dd/mm/yyyy
2.time: time in format hh:mm:ss
3.global_active_power: household global minute-averaged active

- 35 -
power (in kilowatt)
4.global_reactive_power: household global minute-averaged reactive
power (in kilowatt)
5.voltage: minute-averaged voltage (in volt)
6.global_intensity: household global minute-averaged current intensity
(in ampere)
7.sub_metering_1: energy sub-metering No. 1 (in watt-hour of active
energy). It corresponds to the kitchen, containing mainly a dishwasher, an
oven and a microwave (hot plates are not electric but gas powered).
energy). It corresponds to the laundry room, containing a washing-machine,
a tumble-drier, a refrigerator and a light.
energy). It corresponds to an electric water-heater and an air-conditioner.
1.1.3. Experimental Results

- 36 -
Figure 1. 1/8 dataset cluster K=1 Figure2. 1/8 dataset cluster K=2
Figure3. 1/8 dataset cluster K=3

- 37 -
Figure5. 1/8 dataset cluster K=5 Figure6. 1/8 dataset cluster K=6

- 38 -

- 39 -
Figure 12. Shiloutette score according to change of cluster number.
Figure 13. 1/8 dataset Silhouette score according to change of cluster number.

- 40 -
From K-means algorithms calculate proper cluster things is very
important, from the data, estimate Silhouette_score, the result is K = 7 each
cluster centroid and data prices silhouette score is 0.799 is the optimal score.
Even if the dataset is so small but the 1/8 datasets K= 7 each cluster
centroid and data prices silhouette score 0.810 is the optimal score. From this
K-means algorithm cluster 7th,
( all dataset, 1/8 dataset ) each group’s
centroid and each centroid distance will be an optimal value. From this result,
the dataset is decreased but the K-means clustering ‘s class vector space. Its
optimal cluster is the same situation with before original Dataset Household
power consumption rate via clustering.

- 41 -
Summary
From the paper, Household power consumption via k-means clustering,
Used library which is sci-kit learn, Anaconda 3 open-source personally can
easily follow it and because using BSD License to real works don’t have
difficulties to that. From this result even if reduce the dataset 1/8 but the
silhouette score and all the clustering result is same as before. But the
population will increase it can show a clearer result for the classification and
vector space. Large dataset to small dataset is clear to show to the specific
result for the Silhouette score but the opposite site is not clearly allowed.
Because of 4-dimension vector dataset. From the experiment reduce the
estimated time if received huge dataset from the analysis.

- 42 -
Chapter 5
Conclusion
his dissertation approach to a diverse aspect of the k-means clustering
applications, First time try to reduce the k-means algorithm’s time
consumption but next time I try to change my aspect to the how to reduce
the time from Large dataset, the approach is changed. These days, via
machine learning algorithm, can estimate about the when changing the part,
(life span) From this result, all of the experiment Used library scikit-learn
Anaconda3, open-source, it can easily implement any environment, because
using BSD License. Can analyze diverse indexes from the first experiment.
From second experiment, if the dataset is huge need time to analyze, how
many centroid is proper k-mean cluster, at that time can reduce time ,
compare with 1/8 dataset, but limited classification and vector space. From
the experiment reduce the estimated time if received huge dataset from
analysis.

- 43 -
Acknowledgement
대학원 석사 생활 중 총 114회의 컨퍼런스 참가와 7회의 발
표를 하였습니다. IEEE Globcom 2017 이 그 중 인상적이었으며, 본
논문은 Google AI, Tensor-flow Conference 2017 에서 Motivation 을 얻
어 실험하게 되었습니다. 지도교수 이시면서 성균관대학교를 대표
하는 총장님이신 신동렬 교수님의 지도와, Co-Advisor 이신 Nawab
Muhammad Fasheeh Queshi 와의 Co-work 에도 부족한 저를 항상 웃
으며 지도해 주신대에 대하여 감사의 인사를 전합니다.
성균관대학교에 처음 Join 하게 도와주신 모바일 컴퓨팅연구
실 윤희용 교수님 SKKU Fellow 께도 감사드리며, 오픈랩에 생활
함에 있어 불편함이 없이 도와주신 남춘성 박사님과, 같이 사용한
최기현 박사님, Muhammad Hamza, Janaid , 김우현, 소 청에게도 고
마움의 뜻을 전합니다.
학위기간 동안에 끝까지 후원해 주셨던 어머니 이신 동남보
건대학교 이봉순 교수님, LG전자 평택캠퍼스 창립멤버 이신 아버
지 최한청 부장님 (현 온누리이엔지 이사) 에게도 감사의 인사를
전합니다.
학위기간동안 종종 집까지 바래다 주신, 친형인 포스코건설 최현
석 과장 및 분당서울대병원 안여울 간호사, 귀여운 조카 연우에게
도 고마움을 전합니다.
학위를 하면서 이정표가 되어준 사촌 누나 형들께도 감사의 인사
를 드리며 이만 갈음합니다.
2019년 06월 19일

- 44 -
Acknowledgment
I participated in a total of 114 conferences and 7 presentations during
my graduate school life. IEEE Globcom 2017 was impressive, and this paper
was experimented with Motivation at Google AI, Tensor-flow Conference
2017. I would like to express my gratitude to Professor, Dr. Dong-Ryul, Shin
who is the president of Sungkyunkwan University, and co-work with Co-
Advisor Assistant Professor, Dr. Nawab Muhammad Fasheeh Queshi.
Thank you to SKKU Fellow, Professor Hee Yong Yoon Director of
Mobile Computing Lab for helping me to join Sungkyunkwan University for
the first time. I am thankful to Dr. Min Ki Hyun, Muhammad Hamza, Janaid,
Kim Woohyun, I also want to thank you.
I would like to extend my sincere thanks to Bong-Soon, Lee mother,
Professor of Dongnam Health University, who supported me for the duration
of my degree, and to my father Han-chung Choi, who is a founding member
of LG Electronics Pyeongtaek Campus.
I am also grateful to Hyeon-suk, Choi my brother-in-law, who has often
took his car to home during my degree, and Ye-ul, Ahn Nurse at Seoul
National University Bundang Hospital and my cute nephew. Youn-Woo
I give my thanks to my cousins and older brothers who gave me a
milestone in my degree.
June 19, 2019

- 45 -
References
[1] https://en.wikipedia.org/wiki/K-means_clustering
[2] https://en.wikipedia.org/wiki/Cluster_analysis
[3] https://en.wikipedia.org/wiki/Silhouette_(clustering)
[4] https://github.com/sarguido.
[5] http://archive.ics.uci.edu/ml/datasets.html.
[6] http://scikit-learn.org/stable/modules/clustering.html#calinski-harabaz-index
[7] http://scikit-learn.org/stable/.
[8] T. Calinski and J. Harabasz, 1974. “A dendrite method for cluster analysis”.
Communications in Statistics
[9] Kanungo, Tapas et al. “An Efficient k-Means Clustering Algorithm: Analysis and
Implementation.” IEEE Trans. Pattern Anal. Mach. Intell. 24 (2002): 881-892.
[10]David, and Sergei Vassilvitskii ,“k-means++: The advantages of careful seeding”
Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete
algorithms, Society for Industrial and Applied Mathematics (2007): 1027-1035
[11]Wagstaff, K., Cardie, C., Rogers, S., & Schrödl, S. (2001, June). Constrained k-
means clustering with background knowledge. In ICML (Vol. 1, pp. 577-584).
[12]Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering
algorithm. Journal of the Royal Statistical Society. Series C (Applied
Statistics), 28(1), 100-108.
[13]Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu,
A. Y. (2002). An efficient k-means clustering algorithm: Analysis and
implementation. IEEE Transactions on Pattern Analysis & Machine Intelligence, (7),
881-892.
[14]Alsabti, K., Ranka, S., & Singh, V. (1997). An efficient k-means clustering algorithm.
[15]Likas, A., Vlassis, N., & Verbeek, J. J. (2003). The global k-means clustering
algorithm. Pattern recognition, 36(2), 451-461.
[16]Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... &
Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of machine
learning research, 12(Oct), 2825-2830.
[17]Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., ... &
Layton, R. (2013). API design for machine learning software: experiences from the
scikit-learn project. arXiv preprint arXiv:1309.0238.
[18]Abraham, A., Pedregosa, F., Eickenberg, M., Gervais, P., Mueller, A., Kossaifi, J., ...
& Varoquaux, G. (2014). Machine learning for neuroimaging with scikit-
learn. Frontiers in neuroinformatics, 8, 14.
[19]Fabian, P., Gaël, V., Alexandre, G., Vincent, M., Bertrand, T., Olivier, G., ... &
Alexandre, P. (2011). Scikit-learn: Machine learning in Python. Journal of Machine
Learning Research, 12, 2825-2830.
[20]Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., & Scholkopf, B. (1998). Support
vector machines. IEEE Intelligent Systems and their applications, 13(4), 18-28.
[21]Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." Journal of
machine learning research 12.Oct (2011): 2825-2830.
[22]Alsabti, Khaled, Sanjay Ranka, and Vineet Singh. "An efficient k-means clustering
algorithm." (1997).

- 46 -
[23]Ding, Chris, and Xiaofeng He. "K-means clustering via principal component
analysis." Proceedings of the twenty-first international conference on Machine
learning. ACM, 2004.
[24]Paneque-Gálvez, Jaime, et al. "Small drones for community-based forest monitoring:
An assessment of their feasibility and potential in tropical areas." Forests 5.6 (2014):
1481-1507.
[25]Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." Journal of
machine learning research 12.Oct (2011): 2825-2830.
[26]Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.
[27]Rasmussen, Carl Edward. "Gaussian processes in machine learning." Summer
School on Machine Learning. Springer, Berlin, Heidelberg, 2003.
[28]Hartigan, John A., and Manchek A. Wong. "Algorithm AS 136: A k-means clustering
algorithm." Journal of the Royal Statistical Society. Series C (Applied Statistics) 28.1
(1979): 100-108.
[29]Paneque-Gálvez, Jaime, et al. "Small drones for community-based forest monitoring:
An assessment of their feasibility and potential in tropical areas." Forests 5.6 (2014):
1481-1507.
[30]Sass, Ron, et al. "Reconfigurable computing cluster (RCC) project: Investigating the
feasibility of FPGA-based petascale computing." 15th Annual IEEE Symposium on
Field-Programmable Custom Computing Machines (FCCM 2007). IEEE, 2007.
[31] Duda, Richard O., Peter E. Hart, and David G. Stork. Pattern classification. John
Wiley & Sons, 2012.
[32]Cover, Thomas M., and Peter E. Hart. "Nearest neighbor pattern
classification." IEEE transactions on information theory13.1 (1967): 21-27.
[33]Breiman, Leo. Classification and regression trees. Routledge, 2017.
[34]Haralick, Robert M., and Karthikeyan Shanmugam. "Textural features for image
classification." IEEE Transactions on systems, man, and cybernetics 6 (1973): 610-
621.
[35]Chapelle, Olivier, Bernhard Scholkopf, and Alexander Zien. "Semi-supervised
learning (chapelle, o. et al., eds.; 2006)[book reviews]." IEEE Transactions on
Neural Networks 20.3 (2009): 542-542.
[36]Zhu, Xiaojin, Zoubin Ghahramani, and John D. Lafferty. "Semi-supervised learning
using gaussian fields and harmonic functions." Proceedings of the 20th International
conference on Machine learning (ICML-03). 2003.
[37]Caruana, Rich, and Alexandru Niculescu-Mizil. "An empirical comparison of
supervised learning algorithms." Proceedings of the 23rd international conference
on Machine learning. ACM, 2006.
[38]Jain, Anil K. "Data clustering: 50 years beyond K-means." Pattern recognition
letters 31.8 (2010): 651-666.
[39]Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation
learning with deep convolutional generative adversarial networks." arXiv preprint
arXiv:1511.06434 (2015).
[40]Figueiredo, Mario A. T., and Anil K. Jain. "Unsupervised learning of finite mixture
models." IEEE Transactions on Pattern Analysis & Machine Intelligence 3 (2002):
381-396.
[41]Lovmar, Lovisa, et al. "Silhouette scores for assessment of SNP genotype clusters."
BMC genomics 6.1 (2005): 35.
[42]Collins, Robert T., Ralph Gross, and Jianbo Shi. "Silhouette-based human
identification from body shape and gait." Proceedings of fifth IEEE international
conference on automatic face gesture recognition. IEEE, 2002.

- 47 -
[43]Gat-Viks, Irit, Roded Sharan, and Ron Shamir. "Scoring clustering solutions by their
biological relevance." Bioinformatics 19.18 (2003): 2381-2389.
[44]Maulik, Ujjwal, and Sanghamitra Bandyopadhyay. "Performance evaluation of some
clustering algorithms and validity indices." IEEE Transactions on pattern analysis
and machine intelligence 24.12 (2002): 1650-1654.
[45]Łukasik, Szymon, et al. "Clustering using flower pollination algorithm and calinski-
harabasz index." 2016 IEEE Congress on Evolutionary Computation (CEC). IEEE,
2016.
[46]Desgraupes, Bernard. "Clustering indices." University of Paris Ouest-Lab Modal’X
1 (2013): 34.
[47]Petrovic, Slobodan. "A comparison between the silhouette index and the davies-
bouldin index in labelling ids clusters." Proceedings of the 11th Nordic Workshop of
Secure IT Systems. sn, 2006.
[48]Maulik, Ujjwal, and Sanghamitra Bandyopadhyay. "Performance evaluation of some
clustering algorithms and validity indices." IEEE Transactions on pattern analysis
and machine intelligence 24.12 (2002): 1650-1654.
[49]Petrovic, Slobodan. "A comparison between the silhouette index and the davies-
[50] https://scikit-learn.org/stable/
[51] https://www.anaconda.com/
[52] https://www.jetbrains.com/pycharm/
[53] Petrovic, Slobodan. "A comparison between the silhouette index and the davies-
[54] Bandyopadhyay, Sanghamitra, and Ujjwal Maulik. "Nonparametric genetic
clustering: comparison of validity indices." IEEE Transactions on Systems, Man, and
Cybernetics, Part C (Applications and Reviews) 31.1 (2001): 120-125.
[55]
https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consu
mption
[56] https://github.com/sarguido

Hyun wong thesis 2019 06_19_rev35_final

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to Hyun wong thesis 2019 06_19_rev35_final

Similar to Hyun wong thesis 2019 06_19_rev35_final (20)

More from Hyun Wong Choi

More from Hyun Wong Choi (17)

Recently uploaded

Recently uploaded (20)

Hyun wong thesis 2019 06_19_rev35_final