This master's dissertation analyzes electricity consumption at home through a K-means clustering algorithm using a silhouette score perspective. The dissertation contains two papers. Paper 1 uses K-means clustering on a full home electricity usage dataset to obtain optimal clusters, evaluated using Calinski-Harabasz Index, Davis-Boulden Index and silhouette score. Paper 2 reduces the dataset to 1/8 size and finds that the silhouette score results are similar, showing the approach is effective even on smaller datasets. The dissertation applies machine learning clustering techniques to optimize home electricity usage and costs.
Diseno en ingenieria mecanica de Shigley - 8th ---HDes
descarga el contenido completo de aqui http://paralafakyoumecanismos.blogspot.com.ar/2014/08/libro-para-mecanismos-y-elementos-de.html
Semester Project 3: Security of Power SupplySøren Aagaard
The project is about the security of power supply, both current and in the future. Renewable energys part, of the total electricity production will continue to grow in the following years, this will be illuminated and analyzed.
The applicable legislation will be provided and explained to help grasping the legal aspect of the security of power supply.
The economical optimum power supply will be calculated, to help evaluate if it is profitable to uphold Denmarks high security of power supply.
To provide a more practical view, a model of the powergrid has come together, analysing how the grid react to the strain caused by errors, to help fathom by which criteria the grid is constructed.
Similar to Hyun wong thesis 2019 06_19_rev35_final (20)
Introduction to AI for Nonprofits with Tapp NetworkTechSoup
Dive into the world of AI! Experts Jon Hill and Tareq Monaur will guide you through AI's role in enhancing nonprofit websites and basic marketing strategies, making it easy to understand and apply.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
1. Master’s Dissertation
Comparative Analysis of Electricity
Consumption at Home through a
Silhouette-score prospective
Hyun Wong Choi
Department of Electrical and Computer Engineering
The Graduate School
Sungkyunkwan University
2. Comparative Analysis of Electricity
Consumption at Home through a
Silhouette-score prospective
Hyun Wong Choi
Department of Electrical and Computer Engineering
The Graduate School
Sungkyunkwan University
3. Comparative Analysis of Electricity
Consumption at Home through a
Silhouette-score prospective
Hyun Wong Choi
A Dissertation Submitted to the Department of
Electrical and Computer Engineering and
the Graduate School of Sungkyunkwan University
in partial fulfillment of the requirements
for the degree of Master of Science in Engineering
April 2019
Approved by
Professor Dr. Dong Ryeol Shin
4. This certifies that the dissertation of
Hyun Wong Choi is approved.
Dr. MUHAMMAD MANNAN SAEED
Committee Chair: Prof.
Dr. Eung Mo Kim
Committee Member : Prof.
Dr. Dong Ryeol Shin
Major Advisor: Prof.
Dr. Nawab Muhammad Faseeh Querish
Co-Advisor: Prof.
The Graduate School
Sungkyunkwan University
June 2019
8. - 6 -
Abstract
Title of Abstract
Machine learning is a modern field that has emerged as a new tool for data
analytics in a distributed computing environment. There are several aspects, at
which, machine learning has improved the processing capacity along with the
effectiveness of analysis. In this paper, the electricity usage of the home is analyzed
through K-means clustering algorithm for obtaining the optimal home usage
electricity data points. The Davis Boulden Index and Silhouette_score finds the
detailed optimal number of clusters in the K-means algorithm and present the
application scenario of the machine learning clustering analytics
Machine learning is a state-of-the-art sub-project of artificial intelligence, that
is been evolved for finding large-scale intelligent analytics in the distributed
computing environment. In this paper, we perform comparative analytics onto
dataset collected for the electricity usage of home based on the K-means clustering
algorithm using comparison to silhouette score with a ratio 1/8 dataset. The
performance evaluation shows that the comparison index is similar in numbers of
silhouette score even if datasets are smaller than before
KeyMAwords: Machine Learning, K-means clustering
Big data analytics has simplified the complexity of large-scale dataset
processing in a parallel distributed environment.
9. - 7 -
Chapter 1
Introduction
Electiricty consumption from power grid
In the power grid, we measure the consumption through sensors
Industrial consumption
Housing consumption
Factories consumption
Housing Consumption
Front end( Consumer End )
Back end ( Electircal Company end)
Back end ( Company end )
- Dataset For consumption UCIRVINE
Somany techniques that solves the optimization problem of electricity but, non of
them focus on housing electricity optimization,
- Reducing the cost
- Factors of overcharge
- Prediction
Are not available.
Solution
K-means algorithm
Why chose k-mean cluster
Predict the answer from the dataset
No any answer is available in terms of k-mean
Why predicting the answers
No clear result
In this paper electricity usage of home is analyzed through k-means
clustering algorithm for obtaining the optimal home usage electricity usage
10. - 8 -
of home is
3A is analyzed through k-means clustering algorithm for obtaining the
optimal home usage electricity data points The calinski-Harabasz Index,
davis-boulden index and silhouette_score find detailed optimal number of
clusters in the K-menas algorithm and present the application scenario of the
machine learning algorithm.
3B is reducing the 1/8 dataset and result the same result
The proposed approach delivers us efficient and meaning prediction results
never obtained before.
Machine learning is an analyzing mechanism that fetches and identifies the
matching patterns from existing datasets for newer result formations. This paper
discusses comparative analytics related to unsupervised learning algorithms. At
which we compare the K-mean clustering result with a ratio of half dataset to
silhouette_score result. We performed analysis and came to a conclusion that
Davis-Boulden index is not working smoothly in the sci-kit learn library, so
performed a check analysis for Caliski-Harabasz Index and Silhouette score along
with and Davis – Boulden index and compared results to each of them so to learn
that when we reduce the dataset to a mentioned proportion, the resultant dataset
shows half score than the traditional dataset score.
11. - 9 -
Chapter 2
Overview & Motivation
In real life household power consumptions diverse analytics and electricity
transformer, Transmission power can management period can estimate it.
And each data using electricity consumption. It can be used for progressive
taxation. Regional to regional demand, forecasting, maintenance of power
plant and facilities. In the gas company or Car, company can estimate about
the consumption for the via k-means algorithm and also can estimate via k-
means algorithms and also can estimate about the gas consumption rate to
via k-means clustering and index.
Motivated from Google AI, Tensor-flow Conference 2017
12. - 10 -
Chapter 3
Paper-1 Content
3.1. Introduction
Machine learning is a sub-project of artificial intelligence, that is used
to develop algorithms and techniques for enabling the computers to learn [1].
It is used to train the computer for various aspects such as (i) distinguish
whether e-mails received are s pam or not, (ii) data classification application,
(iii) association rule identification, and (iv) character recognition.
Machine learning includes a series of processes, in which a computer
lookup for (i) similar patterns, (ii) generate a novel classification system, (iii)
data analytics, and (iv) producing meaningful results. It is a kind of artificial
intelligence, that can be predicted based on the result, if it is supported only
by analytics algorithms. Machine learning is a step-by-step evolution process
that leads from big data analytics to predict future actions towards making
decisions on its own through past learned results. The key issues for
processing a successful prediction model remains to be within increasing the
probability and reducing the error and the said problems are resolved through
enabling numerous iterative learnings [2].
13. - 11 -
At the heart of machine learning are Representation and Generalization,
where expression is an evaluation of data and generalization is the processing
of future data. Unsupervised learning is a type of machine learning that is
used primarily to determine how data is organized. Unlike Supervised
Learning or Reinforcement Learning, this method does not give a target
value for input values [3].
Autonomous learning is closely related to the density estimation of
statistics. These autonomous learning can summarize and describe the main
characteristics of the data. An example of autonomous learning is clustering.
In this paper, we use the K-means algorithm to measure the optimal number
of clusters based on the Calinski-Harabasz Index and Silhouette_score,
Davis-Boulden index and then apply it to household electricity consumption
analysis.
14. - 12 -
Paper-1 Methodology
3.1.1.1. Sub-topics
3.4 Paper-1 EVALUATION
2.4.1. Experimental Environment
Software : Anaconda3 + Pycharm3
OS Software : Window 10 Professional
Ram 16.0GB
Processor : i7-6600U CPU @2.60GHz
Harddisk : 420GB SSD
2.4.2. Experimental Dataset
3.2. Previous work
Machine Learning
Machine learning is like data mining, but it is different in predicting
data based on learned attributes, mainly through training data. In addition to
the three techniques, Unsupervised learning, Supervised Learning or
Reinforcement Learning, various types of machine learning techniques such
as Semi-Supervised Learning and Deep Learning algorithms are developed
Has been used.
Clustering
Clustering is a method of data mining by defining a cluster of data
considering the characteristics of given data and finding a representative
15. - 13 -
point that can represent the data group. A cluster is a group of data with
similar characteristics. If the characteristics of the data are different, they
must belong to different clusters. It is the main task of exploratory data
mining, and a common technique for statistical data analysis, used in many
fields, including pattern recognition, information retrieval, machine learning,
and computer graphics [3].
(1) Maximizing inter-cluster variance
(2) Minimizing the inner-cluster variance
Note, however, that clustering should be distinguished from
Classification. Clustering is unsupervised learning without correct answers.
In other words, we group similar objects without group information of each
object. Classification, on the other hand, is supervised learning. When you
carry out classification tasks, you will learn to predict the dependent variable
(Y) with the independent variable (X) of the data [4].
Community Feasibility Assessment
16. - 14 -
Since clustering tasks are not correct, they cannot be evaluated as
indicators, such as simple accuracy, as in a typical machine learning
algorithm. As you can see in the example below, it is not easy to find the
optimal number of clusters without the correct answers. Cluster analysis
itself is not one specific algorithm, but the general task to be solved. It can
be achieved by various algorithms that differ significantly in their
understanding of what constitutes a cluster and how to efficiently find them.
Popular notions of clusters include a group with small distances between
cluster members, dense areas of data space,
Scikit-learn
17. - 15 -
In general, a learning problem considers a set of n samples of data and
then tries to predict the properties of unknown data. If each sample is more
than a single number and for instance. A multi-dimensional entry, it is said
to have several attributes or features.
Supervised learning, in which the data comes with additional attributes
that we want to predict this problem can be either.
Classification: samples belong to two or more classes and we want to learn
from already labeled data on how to predict the class of unlabeled data. An
example of a classification problem would be handwritten digit recognition,
in which the aim is to assign each input vector to one of a finite number of
discrete categories. Another way to think of classification is as a discrete( as
opposed to continuous) form of supervised learning where one has a limited
number of categories and for each of n samples provided. One if to try to
label them with the correct category or class.
Scikit-learn is the machine learning platform in the middle range of
superficial broad python module this package high-level language can us
easily high-level documentation and proper API suggested. Using BSD
license as academic or commercially use it. Source-code, documentation is
downloaded from websites [10]
Supervised learning, Unsupervised Learning is the many problems is
inserted in the Scikit-learn, Generalized Models, Linear and Quadratic
Recruitment Analysis, Kernel Ridged regression, Support Vector machine,
Stochastic Gradient Decent model’s solution also inserted in the Scikit-learn
3.3. Proposed Approach
K-means algorithms is one of the clustering methods for divided,
18. - 16 -
divided is giving the data among the many partitions, For example, receive
data object n, divided data is input data divided K(<= n) data, each group
consisting of cluster below equation is the at K-means algorithm when
cluster consists of algorithms using cost function use it [11]
argmin ∑ ∑ ‖𝑥 − 𝜇𝑖‖
𝑥 ∈ 𝑆 𝑖
2𝑘
𝑖 =1
In other words, one of the data objects divided by the K group.
Currently, divided similarity is (dissimilarity with reducting the cost function
about it. And from this theory each object similarity increase, different group
similarity will decrease.[12] K-means algorithm is each centroid and in each
group’s data object times’ summation, from this function result, the data
object group updated clustering progressed.[5]
How to be well to be clustering inner way is Caliski-Harabasz Index,
19. - 17 -
Davies-Bouldin index, Dunn index, Silhouette score. In this paper. Evaluate
via Clainiski-Harabasz Index and silhouette score evaluate it.
From the Cluster Calinski-Harabasz Index s I the clusters distributed
average and cluster distributed ratio will give it to you.
𝑠(𝑘) =
𝑇𝑟(𝐵 𝑘)
𝑇𝑟(𝑊𝑘)
×
𝑁 − 𝑘
𝑘 − 1
For this Bk is the distributed matrix from each group Wk is the cluster
distributed defined.
𝑊𝑘 = ∑ ∑ (𝑥 − 𝑐 𝑞)(𝑥 − 𝑐 𝑞
𝑥∈𝐶 𝑞
𝑘
𝑞=1
) 𝑇
𝐵 𝑘 = ∑ 𝑛 𝑞(𝐶 𝑞 − 𝑐)(𝐶 𝑞
𝑞
− 𝑐) 𝑇
N is the number of Data, Cq data group in Cq, Cq is the cluster q’s centroid,
c is the E of the Centroid, NQ is the number of data number in cluster_q
20. - 18 -
Silhouette score is the easy way to in data I each data cluster in data’s
definition an (i) each data is not clustered inner and data’s definition b(i)
silhouette score s(i) is equal to calculate that
s(i) =
𝑏(𝑖) − 𝑎(𝑖)
max { 𝑎(𝑖), 𝑏(𝑖)}
From this calculate s(i) is equal to that function
−1 ≤ s(i) ≤ 1
S(i) is the close to 1 is the data 1 is the correct dluster to each thing, close
to -1 cannot distribute cluster is distributed, from this paper machine Using
the machine learning library scikit-learn in the house hold power
consumption clustering[7],
21. - 19 -
Household power consumption from the dataset Download from
University California Irvine Machine Learning Data Repository[8] and then
use it, this dataset is via delimiter is divided. Global_active_power, Global
Reactive_power, Voltage, Global _intensity is divided. Global
Active_powere and Global Reactive power the X, Y axis experiment it,
Python library is Anaconda3 K-means algorithm’s key point is using Data
keep K clusters, reduce cluster’s distance, K-means algorithms input data put
the labels. Figure 1 is the before check Calinski-Harabasz Index and
Silhouette_score execute K-means algorithm’s result. Figure 1 to Figure 11
are 1/8 dataset k-means clustering result for Household power consumption
from UC Irvine Repository and reduce the dataset 1/8 times from original
UCI machine learning data repository.
2.4.3. Experimental Results
22. - 20 -
Figure 1. Clustering result at K = 1 Figure2. Clustering result at K=2
Figure 3. Clustering result at K = 3 Figure4. Clustering result at K=4
23. - 21 -
Figure 5. Clustering result at K = 5 Figure 6. Clustering result at K=6
Figure 7. Clustering result at K=7 Figure 8. Clustering result at K=8
24. - 22 -
Figure 9. Clustering result at K = 9 Figure 10. Clustering result at K=10
After all, reduce each cluster’s distance calculate each cluster’s
Calinski-Harabasz Index, increasing clusters’ Calinski-Harabasz Index will
decrease with K ratio is too law estimate K this cluster partition will one
more or not electric consumption rate is very important. This one is the most
important fact.
25. - 23 -
Figure 11. Silhouette score according to change of cluster number.
Equal with Caliski-Harabasz Index estimation, calculate
Silhouette_score. The cluster will increase Silhouette_score will decreases
with K distributed, a low factor with optimal K represented.
From K-means algorithms calculate proper cluster things is very
important, from the data, estimate Silhouette_score, the result is K=7 each
cluster centroid and data prices silhouette score are 0.799 is the optimal score.
From the formal Caliski-Harabasz Index results are 560.3999 is the optimal
result. Using this k-means algorithm the fact is figure 11.
From this K-means algorithm cluster 7th,
each group’s centroid and
each centroid distance will be an optimal value. From this result, each
Centroid can divide. Household power consumption rate via clustering.
26. - 24 -
Figure 12: Clustering result at K=7
Davies-Bouldin index
If the ground truth labels are not known, the Davies-Bouldin index
(sklearn. Metrixdavis Boulden)
𝑅𝑖𝑗 =
𝑠𝑖 + 𝑠𝑗
𝑑𝑖𝑗
27. - 25 -
Then the Davis-Bouldin Index is defined as
DB =
1
𝑘
∑ 𝑖 = 1 𝑘
max
𝑖≠𝑗
𝑅𝑖𝑗
The zero is the lowest score a possible. Score. Values closer to zero
indicate a better partition. But the problem is this algorithm does not attach
it in the Scikit-learn library and only explain it in the document page but
cannot experiment easily.
3.4. Related work
Machine learning is a sub-project of artificial intelligence, that is used
to develop algorithms and techniques for enabling the computers to learn [1].
It is used to train the computer for various aspects such as (i) distinguish
whether e-mails received are spam or not, (ii) data classification application,
(iii) association rule identification, and (iv) character recognition.
Machine learning includes a series of processes, in which a computer
lookup for (i) similar patterns, (ii) generate a novel classification system, (iii)
28. - 26 -
data analytics, and (iv) producing meaningful results. It is a kind of artificial
intelligence, that can be predicted based on the result if it is supported only
by analytics algorithms. Machine learning is a step-by-step evolution process
that leads from big data analytics to predict future actions towards making
decisions on its own through past learned results. The key issues for
processing a successful prediction model remains to be within increasing the
probability and reducing the error and the said problems are resolved through
enabling numerous iterative learnings [2].
At the heart of machine learning are Representation and Generalization,
where expression is an evaluation of data and generalization is the processing
of future data. Unsupervised learning is a type of machine learning that is
used primarily to determine how data is organized. Unlike Supervised
Learning or Reinforcement Learning, this method does not give a target
value for input values [3].
Autonomous learning is closely related to the density estimation of
statistics. These autonomous learning can summarize and describe the main
characteristics of the data. An example of autonomous learning is clustering.
In this paper, we use the K-means algorithm to measure the optimal number
of clusters based on the Calinski-Harabasz Index and Silhouette_score,
Davis-Boulden index and then apply it to household electricity consumption
29. - 27 -
analysis.
3.1 Summary
From the paper, Household power consumption via k-means clustering,
Used library which is sci-kit learn, Anaconda 3 open-source personally can
easily follow it and because using BSD License to real works don’t have
difficulties to that. Not only the K-means algorithm, PCA Algorithms but
30. - 28 -
also SVM algorithm, etc other machine learning algorithms clustering can
also do it. From this result, in real life household power consumptions
diverse analytics. And the electricity transformer, Transmission power can
management period can estimate it. And each data using electricity
consumption. It can be used for progressive taxation, regional to regional
demand forecasting, maintenance of power plants and facilities. Can do it. In
the Gas, the company can estimate via k-means algorithms and also can
estimate the gas consumption rate to via K-means clustering and index.
Chapter 4
Paper-2 Comparative Analsysis of Electricity Consumption at Home
through a Silhouette-score prospective
4.1Introduction
Machine learning is an analyzing mechanism that fetches and identifies
the matching patterns from existing datasets for newer result formations.
This paper discusses comparative analytics related to unsupervised learning
algorithms, at which we compare the K-mean clustering result with a ratio
31. - 29 -
of half dataset to Silhouette_score results. We performed analysis and came
to the conclusion that Davis-Boulden index is not working smoothly in the
sci-kit learning, so performed a check analysis for Caliski-Harabasz Index
and Silhouette score along with and Davis – Boulden index and compared
results to each of them so to learn that when we reduce the dataset to a
mentioned proportion, the resultant dataset shows half score than the
traditional dataset score.
4.2 Related work
Machine learning is a field of artificial intelligence, that is used to
develop algorithms and techniques that enable computers to learn [1]. It is
used to train the computer to distinguish whether e-mails received are spam
or not, and there are various applications such as data classification,
associated rule identification, and character recognition, which comply to the
standard machine learning perspectives.
32. - 30 -
It includes a series of processes, in which a computer finds its own
patterns, creates a new classification system, analyzes the data, and produces
meaningful results. The successful prediction occurs with the increase in
probability and decrease in the error issues. Machine learning enables to sort
out the issues with various iterative learning [2]. Among them, supervised
learning is highly related to summarizing the learning methods for re-
enforcement mechanisms [3].
Clustering is a process of mining the dataset by defining a cluster of
data that considers the characteristics of input and finds a representative
method to point out the data group. In this way, a cluster is a group of relevant
data elements with similar characteristics. If the functions are not the same,
the ingredients belong to contrast clusters [3]. Clustering is unsupervised
learning without accuracy in answers. In the same way, the objects having
the same information are grouped together for similar elements. However,
the classification is a way related to supervised learning. When you perform
classification operations, the system will learn to predict the dependent
33. - 31 -
variable (Y) with the independent variable (X) of the data [4].
Scikit-learn is the machine learning platform in the middle range of
superficial broad python module this package high-level language can us
easily high-level documentation and proper API suggested. Using BSD
license as academic or commercially use it. Source-code, documentation is
downloaded from websites [10]. Supervised learning, Unsupervised
Learning is the many problems is inserted in the Scikit-learn, Generalized
Models, Linear and Quadratic Decruitment Analysis, Kernel Ridged
regression, Support Vector machine, Stochastic Gradient Decent model’s
solution also inserted in the Scikit-learn.
3.5. Paper-2 Methodology
K-means algorithm is one of the clustering methods for divided,
divided is giving the data among the many partitions. For example, receive
data object n, divided data is input data divided K (≤ n) data, each group
consisting of cluster below equation is the at K-means algorithm when
34. - 32 -
cluster consists of algorithms using cost function use it [11]
argmin ∑ ∑ ‖𝑥 − 𝜇𝑖‖
𝑥 ∈ 𝑆 𝑖
2𝑘
𝑖 =1
In other words, one of the data objects divided by the K group.
Currently, the divided similarity is (dissimilarity with reducing the cost
function about it. And from this theory each object similarity increase,
different group similarity will decrease. [12] K-means algorithm is each
centroid and in each group’s data object times’ summation, from this
function result, the data object group updated clustering progressed. [5]
Silhouette score is the easy way to in data I each data cluster in data’s
definition an (i) each data is not clustered inner and data’s definition b(i)
silhouette score s(i) is equal to calculate that
35. - 33 -
s(i) =
𝑏(𝑖) − 𝑎(𝑖)
max { 𝑎(𝑖), 𝑏(𝑖)}
From this calculate s(i) is equal to that function
−1 ≤ s(i) ≤ 1
S(i) is the close to 1 is the data I is the correct cluster to each thing,
close to -1 cannot distribute cluster is distributed, from this paper machine
Using the machine learning library scikit-learn in the household power
consumption clustering [7]. Household power consumption from the
dataset Download from University California Irvine Machine Learning
Dataset Repository [8] and then use it, this dataset is via delimiter is divided.
Global_active_power, Global Reactive_power, Voltage, Global_intensity is
divided. Global Active_power and Global Reactive power the X, Y axis
experiment it.
Python library is Anaconda3 K-means algorithm’s key point is using
Data keep K clusters, reduce cluster’s distance, K-means algorithms input
data put the labels. figure 1 is the before check Calinski-Harabasz Index and
Silhouette_score execute K-means algorithm’s result. Figure 1 to Figure 11
are 1/8 dataset k-means clustering result for House hold power consumption
from UC Irvine Repository and reduce the dataset 1/8 times from original
36. - 34 -
UCI machine learning data repository.
1.1.1. Experimental Environment
Software : Anaconda3 + Pycharm3
OS Software : Window 10 Professional
Ram 16.0GB
Processor : i7-6600U CPU @2.60GHz
Harddisk : 420GB SSD
1.1.2. Experimental Dataset
1.date: Date in format dd/mm/yyyy
2.time: time in format hh:mm:ss
3.global_active_power: household global minute-averaged active
37. - 35 -
power (in kilowatt)
4.global_reactive_power: household global minute-averaged reactive
power (in kilowatt)
5.voltage: minute-averaged voltage (in volt)
6.global_intensity: household global minute-averaged current intensity
(in ampere)
7.sub_metering_1: energy sub-metering No. 1 (in watt-hour of active
energy). It corresponds to the kitchen, containing mainly a dishwasher, an
oven and a microwave (hot plates are not electric but gas powered).
8.sub_metering_2: energy sub-metering No. 2 (in watt-hour of active
energy). It corresponds to the laundry room, containing a washing-machine,
a tumble-drier, a refrigerator and a light.
9.sub_metering_3: energy sub-metering No. 3 (in watt-hour of active
energy). It corresponds to an electric water-heater and an air-conditioner.
1.1.3. Experimental Results
41. - 39 -
Figure 12. Shiloutette score according to change of cluster number.
Figure 13. 1/8 dataset Silhouette score according to change of cluster number.
42. - 40 -
From K-means algorithms calculate proper cluster things is very
important, from the data, estimate Silhouette_score, the result is K = 7 each
cluster centroid and data prices silhouette score is 0.799 is the optimal score.
Even if the dataset is so small but the 1/8 datasets K= 7 each cluster
centroid and data prices silhouette score 0.810 is the optimal score. From this
K-means algorithm cluster 7th,
( all dataset, 1/8 dataset ) each group’s
centroid and each centroid distance will be an optimal value. From this result,
the dataset is decreased but the K-means clustering ‘s class vector space. Its
optimal cluster is the same situation with before original Dataset Household
power consumption rate via clustering.
43. - 41 -
Summary
From the paper, Household power consumption via k-means clustering,
Used library which is sci-kit learn, Anaconda 3 open-source personally can
easily follow it and because using BSD License to real works don’t have
difficulties to that. From this result even if reduce the dataset 1/8 but the
silhouette score and all the clustering result is same as before. But the
population will increase it can show a clearer result for the classification and
vector space. Large dataset to small dataset is clear to show to the specific
result for the Silhouette score but the opposite site is not clearly allowed.
Because of 4-dimension vector dataset. From the experiment reduce the
estimated time if received huge dataset from the analysis.
44. - 42 -
Chapter 5
Conclusion
his dissertation approach to a diverse aspect of the k-means clustering
applications, First time try to reduce the k-means algorithm’s time
consumption but next time I try to change my aspect to the how to reduce
the time from Large dataset, the approach is changed. These days, via
machine learning algorithm, can estimate about the when changing the part,
(life span) From this result, all of the experiment Used library scikit-learn
Anaconda3, open-source, it can easily implement any environment, because
using BSD License. Can analyze diverse indexes from the first experiment.
From second experiment, if the dataset is huge need time to analyze, how
many centroid is proper k-mean cluster, at that time can reduce time ,
compare with 1/8 dataset, but limited classification and vector space. From
the experiment reduce the estimated time if received huge dataset from
analysis.
45. - 43 -
Acknowledgement
대학원 석사 생활 중 총 114회의 컨퍼런스 참가와 7회의 발
표를 하였습니다. IEEE Globcom 2017 이 그 중 인상적이었으며, 본
논문은 Google AI, Tensor-flow Conference 2017 에서 Motivation 을 얻
어 실험하게 되었습니다. 지도교수 이시면서 성균관대학교를 대표
하는 총장님이신 신동렬 교수님의 지도와, Co-Advisor 이신 Nawab
Muhammad Fasheeh Queshi 와의 Co-work 에도 부족한 저를 항상 웃
으며 지도해 주신대에 대하여 감사의 인사를 전합니다.
성균관대학교에 처음 Join 하게 도와주신 모바일 컴퓨팅연구
실 윤희용 교수님 SKKU Fellow 께도 감사드리며, 오픈랩에 생활
함에 있어 불편함이 없이 도와주신 남춘성 박사님과, 같이 사용한
최기현 박사님, Muhammad Hamza, Janaid , 김우현, 소 청에게도 고
마움의 뜻을 전합니다.
학위기간 동안에 끝까지 후원해 주셨던 어머니 이신 동남보
건대학교 이봉순 교수님, LG전자 평택캠퍼스 창립멤버 이신 아버
지 최한청 부장님 (현 온누리이엔지 이사) 에게도 감사의 인사를
전합니다.
학위기간동안 종종 집까지 바래다 주신, 친형인 포스코건설 최현
석 과장 및 분당서울대병원 안여울 간호사, 귀여운 조카 연우에게
도 고마움을 전합니다.
학위를 하면서 이정표가 되어준 사촌 누나 형들께도 감사의 인사
를 드리며 이만 갈음합니다.
2019년 06월 19일
46. - 44 -
Acknowledgment
I participated in a total of 114 conferences and 7 presentations during
my graduate school life. IEEE Globcom 2017 was impressive, and this paper
was experimented with Motivation at Google AI, Tensor-flow Conference
2017. I would like to express my gratitude to Professor, Dr. Dong-Ryul, Shin
who is the president of Sungkyunkwan University, and co-work with Co-
Advisor Assistant Professor, Dr. Nawab Muhammad Fasheeh Queshi.
Thank you to SKKU Fellow, Professor Hee Yong Yoon Director of
Mobile Computing Lab for helping me to join Sungkyunkwan University for
the first time. I am thankful to Dr. Min Ki Hyun, Muhammad Hamza, Janaid,
Kim Woohyun, I also want to thank you.
I would like to extend my sincere thanks to Bong-Soon, Lee mother,
Professor of Dongnam Health University, who supported me for the duration
of my degree, and to my father Han-chung Choi, who is a founding member
of LG Electronics Pyeongtaek Campus.
I am also grateful to Hyeon-suk, Choi my brother-in-law, who has often
took his car to home during my degree, and Ye-ul, Ahn Nurse at Seoul
National University Bundang Hospital and my cute nephew. Youn-Woo
I give my thanks to my cousins and older brothers who gave me a
milestone in my degree.
June 19, 2019
47. - 45 -
References
[1] https://en.wikipedia.org/wiki/K-means_clustering
[2] https://en.wikipedia.org/wiki/Cluster_analysis
[3] https://en.wikipedia.org/wiki/Silhouette_(clustering)
[4] https://github.com/sarguido.
[5] http://archive.ics.uci.edu/ml/datasets.html.
[6] http://scikit-learn.org/stable/modules/clustering.html#calinski-harabaz-index
[7] http://scikit-learn.org/stable/.
[8] T. Calinski and J. Harabasz, 1974. “A dendrite method for cluster analysis”.
Communications in Statistics
[9] Kanungo, Tapas et al. “An Efficient k-Means Clustering Algorithm: Analysis and
Implementation.” IEEE Trans. Pattern Anal. Mach. Intell. 24 (2002): 881-892.
[10]David, and Sergei Vassilvitskii ,“k-means++: The advantages of careful seeding”
Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete
algorithms, Society for Industrial and Applied Mathematics (2007): 1027-1035
[11]Wagstaff, K., Cardie, C., Rogers, S., & Schrödl, S. (2001, June). Constrained k-
means clustering with background knowledge. In ICML (Vol. 1, pp. 577-584).
[12]Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering
algorithm. Journal of the Royal Statistical Society. Series C (Applied
Statistics), 28(1), 100-108.
[13]Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu,
A. Y. (2002). An efficient k-means clustering algorithm: Analysis and
implementation. IEEE Transactions on Pattern Analysis & Machine Intelligence, (7),
881-892.
[14]Alsabti, K., Ranka, S., & Singh, V. (1997). An efficient k-means clustering algorithm.
[15]Likas, A., Vlassis, N., & Verbeek, J. J. (2003). The global k-means clustering
algorithm. Pattern recognition, 36(2), 451-461.
[16]Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... &
Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of machine
learning research, 12(Oct), 2825-2830.
[17]Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., ... &
Layton, R. (2013). API design for machine learning software: experiences from the
scikit-learn project. arXiv preprint arXiv:1309.0238.
[18]Abraham, A., Pedregosa, F., Eickenberg, M., Gervais, P., Mueller, A., Kossaifi, J., ...
& Varoquaux, G. (2014). Machine learning for neuroimaging with scikit-
learn. Frontiers in neuroinformatics, 8, 14.
[19]Fabian, P., Gaël, V., Alexandre, G., Vincent, M., Bertrand, T., Olivier, G., ... &
Alexandre, P. (2011). Scikit-learn: Machine learning in Python. Journal of Machine
Learning Research, 12, 2825-2830.
[20]Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., & Scholkopf, B. (1998). Support
vector machines. IEEE Intelligent Systems and their applications, 13(4), 18-28.
[21]Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." Journal of
machine learning research 12.Oct (2011): 2825-2830.
[22]Alsabti, Khaled, Sanjay Ranka, and Vineet Singh. "An efficient k-means clustering
algorithm." (1997).
48. - 46 -
[23]Ding, Chris, and Xiaofeng He. "K-means clustering via principal component
analysis." Proceedings of the twenty-first international conference on Machine
learning. ACM, 2004.
[24]Paneque-Gálvez, Jaime, et al. "Small drones for community-based forest monitoring:
An assessment of their feasibility and potential in tropical areas." Forests 5.6 (2014):
1481-1507.
[25]Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." Journal of
machine learning research 12.Oct (2011): 2825-2830.
[26]Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.
[27]Rasmussen, Carl Edward. "Gaussian processes in machine learning." Summer
School on Machine Learning. Springer, Berlin, Heidelberg, 2003.
[28]Hartigan, John A., and Manchek A. Wong. "Algorithm AS 136: A k-means clustering
algorithm." Journal of the Royal Statistical Society. Series C (Applied Statistics) 28.1
(1979): 100-108.
[29]Paneque-Gálvez, Jaime, et al. "Small drones for community-based forest monitoring:
An assessment of their feasibility and potential in tropical areas." Forests 5.6 (2014):
1481-1507.
[30]Sass, Ron, et al. "Reconfigurable computing cluster (RCC) project: Investigating the
feasibility of FPGA-based petascale computing." 15th Annual IEEE Symposium on
Field-Programmable Custom Computing Machines (FCCM 2007). IEEE, 2007.
[31] Duda, Richard O., Peter E. Hart, and David G. Stork. Pattern classification. John
Wiley & Sons, 2012.
[32]Cover, Thomas M., and Peter E. Hart. "Nearest neighbor pattern
classification." IEEE transactions on information theory13.1 (1967): 21-27.
[33]Breiman, Leo. Classification and regression trees. Routledge, 2017.
[34]Haralick, Robert M., and Karthikeyan Shanmugam. "Textural features for image
classification." IEEE Transactions on systems, man, and cybernetics 6 (1973): 610-
621.
[35]Chapelle, Olivier, Bernhard Scholkopf, and Alexander Zien. "Semi-supervised
learning (chapelle, o. et al., eds.; 2006)[book reviews]." IEEE Transactions on
Neural Networks 20.3 (2009): 542-542.
[36]Zhu, Xiaojin, Zoubin Ghahramani, and John D. Lafferty. "Semi-supervised learning
using gaussian fields and harmonic functions." Proceedings of the 20th International
conference on Machine learning (ICML-03). 2003.
[37]Caruana, Rich, and Alexandru Niculescu-Mizil. "An empirical comparison of
supervised learning algorithms." Proceedings of the 23rd international conference
on Machine learning. ACM, 2006.
[38]Jain, Anil K. "Data clustering: 50 years beyond K-means." Pattern recognition
letters 31.8 (2010): 651-666.
[39]Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation
learning with deep convolutional generative adversarial networks." arXiv preprint
arXiv:1511.06434 (2015).
[40]Figueiredo, Mario A. T., and Anil K. Jain. "Unsupervised learning of finite mixture
models." IEEE Transactions on Pattern Analysis & Machine Intelligence 3 (2002):
381-396.
[41]Lovmar, Lovisa, et al. "Silhouette scores for assessment of SNP genotype clusters."
BMC genomics 6.1 (2005): 35.
[42]Collins, Robert T., Ralph Gross, and Jianbo Shi. "Silhouette-based human
identification from body shape and gait." Proceedings of fifth IEEE international
conference on automatic face gesture recognition. IEEE, 2002.
49. - 47 -
[43]Gat-Viks, Irit, Roded Sharan, and Ron Shamir. "Scoring clustering solutions by their
biological relevance." Bioinformatics 19.18 (2003): 2381-2389.
[44]Maulik, Ujjwal, and Sanghamitra Bandyopadhyay. "Performance evaluation of some
clustering algorithms and validity indices." IEEE Transactions on pattern analysis
and machine intelligence 24.12 (2002): 1650-1654.
[45]Łukasik, Szymon, et al. "Clustering using flower pollination algorithm and calinski-
harabasz index." 2016 IEEE Congress on Evolutionary Computation (CEC). IEEE,
2016.
[46]Desgraupes, Bernard. "Clustering indices." University of Paris Ouest-Lab Modal’X
1 (2013): 34.
[47]Petrovic, Slobodan. "A comparison between the silhouette index and the davies-
bouldin index in labelling ids clusters." Proceedings of the 11th Nordic Workshop of
Secure IT Systems. sn, 2006.
[48]Maulik, Ujjwal, and Sanghamitra Bandyopadhyay. "Performance evaluation of some
clustering algorithms and validity indices." IEEE Transactions on pattern analysis
and machine intelligence 24.12 (2002): 1650-1654.
[49]Petrovic, Slobodan. "A comparison between the silhouette index and the davies-
bouldin index in labelling ids clusters." Proceedings of the 11th Nordic Workshop of
Secure IT Systems. sn, 2006.
[50] https://scikit-learn.org/stable/
[51] https://www.anaconda.com/
[52] https://www.jetbrains.com/pycharm/
[53] Petrovic, Slobodan. "A comparison between the silhouette index and the davies-
bouldin index in labelling ids clusters." Proceedings of the 11th Nordic Workshop of
Secure IT Systems. sn, 2006.
[54] Bandyopadhyay, Sanghamitra, and Ujjwal Maulik. "Nonparametric genetic
clustering: comparison of validity indices." IEEE Transactions on Systems, Man, and
Cybernetics, Part C (Applications and Reviews) 31.1 (2001): 120-125.
[55]
https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consu
mption
[56] https://github.com/sarguido