master defense hyun-wong choi_2019_05_14_rev19

International Scholar Pooh ®
Electricity consumption optimization
using K-means clustering algorithm
Hyun Wong Choi (2017710116)
Advisor: Dr. Dong Ryeol Shin, President
Co-advisor:, Dr. Nawab Muhammad Faseeh Qureshi
Sungkyunkwan University, South Korea.

Outline
• Introduction
• Related work
• Proposed Approach
• Analysis of Electricity Consumption at Home through a Silhouette-score prospective
• Analysis of Electricity Consumption at Home Using K-means Clustering Algorithm
• Evaluation
• Electricity Consumption at Home analysis
• K-means Clustering analysis
• Conclusion
• References

Introduction
• Electricity consumption
• Power grid
• In the power grid, we measure the consumption through sensors
• Industrial consumption
• Housing consumption
• Factories consumption
• Housing Consumption
• Front end (Consumer End)
• Back end (electrical company end)

Introduction (Cont.)
• Back end (Company End)
• Dataset for consumption UCIRVINE
• So many techniques that solves the optimization problem of electricity but,
none of them focus on housing electricity optimization,
• Reducing the cost
• Factors of overcharge
• Prediction
Are not available

Introduction (Cont.)
• Solution
• K-mean algorithm
• Why chose k-mean cluster
• Predict the answer from the dataset
• No any answer is available in terms of k-mean
• Why predicting the answers
• No clear result
• In this paper electricity usage of home is analyzed through k-means clustering algorithm for
obtaining the optimal home usage electricity usage of home is
• 3A is analyzed through K-means clustering algorithm for obtaining the optimal home usage electricity data
points The Calinski-Harabasz Index, davis-boulden index and silhouette_score find detailed optimal number of
clusters in the K-means algorithm and present the application scenario of the machine learning algorithm.
• 3B is reducing the 1/8 dataset and result the same result
• The proposed approach delivers us efficient and meaning prediction results never obtained before.

Related work
• Machine learning
• A way to learn machine with dataset values [1]
• Several ways such as [2], [3], [4]
• K-mean Clustering
One of the method for the machine learning algorithm under the unsupervised learning category. [5]
It is a main task of exploratory data mining, and a common technique for statistical analysis, used in
many field. [6], [7], [8]
• Community Feasibility Assessment
Cluster analysis is not one specific algorithm
What constitutes a cluster and how to efficiently find them. Popular notions of clusters include group
with small distances between cluster members, dense areas of data space. [9], [10], [11]

Related work (Cont.)
• Classification
is a process related to categorization, the process in which ideas and objects are recognized, differentiated, and understood.
[12], [13], [14]
• Supervised Learning
Supervised Learning is the machine learning task of learning a function that maps an input to an output based on example
input-output pairs. It infers a function from labeled training data consisting of a set of training examples [15],[16],[17]
• Unsupervised Learning
too many approaches to solve the problem, such as in clustering K-means, mixture models, DBSCAN, OPTICS algorithm
[18],[19],[20]
• Index
from the unsupervised learning need how to efficiently find them, Popular notions of clusters include group with small
distances between cluster members, dense areas of data space, such as Calinski-Harabaz index, Silhouette score, Davis-
boulden Index [21]

Related work (Cont.)
• Scikit-learn
One of the machine learning library from the 2007 still 2019 Continuously update the version for
the library. [22]

[23]
Machine Learning Repository Overview

Dataset Parameters
• 1.date: Date in format dd/mm/yyyy
•
2.time: time in format hh:mm:ss
• 3.global_active_power: household global minute-averaged active power (in kilowatt)
• 4.global_reactive_power: household global minute-averaged reactive power (in kilowatt)
• 5.voltage: minute-averaged voltage (in volt)
• 6.global_intensity: household global minute-averaged current intensity (in ampere)
• 7.sub_metering_1: energy sub-metering No. 1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a
dishwasher, an oven and a microwave (hot plates are not electric but gas powered).
• 8.sub_metering_2: energy sub-metering No. 2 (in watt-hour of active energy). It corresponds to the laundry room, containing a
washing-machine, a tumble-drier, a refrigerator and a light.
• 9.sub_metering_3: energy sub-metering No. 3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-
conditioner.

Definition of Dataset.
• Household power consumption from the dataset Download from
University California Irvine Machine Learning Dataset Repository [8]
and then use it, this dataset is via delimiter is divided.
Global_active_power, Global Reactive_power, Voltage,
Global_intensity is divided. Global Active_power and Global Reactive
power the X, Y axis experiment it.

Proposed Approach (a)
Silhouette score
• Silhouette score is the easy way to in data I each data cluster in data’s definition an (i) each data is not
clustered inner and data’s definition b(i) silhouette score s(i) is equal to calculate that
s i =
𝑏 𝑖 − 𝑎(𝑖)
max { 𝑎 𝑖 , 𝑏 𝑖 }
• From this calculate s(i) is equal to that function
−1 ≤ s i ≤ 1
• S(i) is the close to 1 is the data I is the correct cluster to each thing, close to -1 cannot distribute cluster
is distributed, from this paper machine Using the machine learning library scikit-learn in the house hold
power consumption clustering [23],[24],[25]
Analysis of Electricity Consumption at Home through a
Silhouette-score prospective

Calinski-Harabasz Index
• How to be well to be clustering inner way is Caliski-Harabasz Index, Davies-Bouldin index, Dunn index,
Silhouette score. In this paper. Evaluate via Clainiski-Harabasz Index and silhouette score evaluate it.
• From the Cluster Calinski-Harabasz Index s I the clusters distributed average and cluster distributed ratio
will give it to you.
𝑠 𝑘 =
𝑇𝑟(𝐵 𝑘)
𝑇𝑟(𝑊𝑘)
×
𝑁 − 𝑘
𝑘 − 1
• For this Bk is the distributed matrix from each group Wk is the cluster distributed defined [26],[27],[28].
𝑊𝑘 =
𝑞=1
𝑘
𝑥∈𝐶 𝑞
(𝑥 − 𝑐 𝑞)(𝑥 − 𝑐 𝑞) 𝑇
𝐵 𝑘 =
𝑞
𝑛 𝑞(𝐶 𝑞 − 𝑐)(𝐶 𝑞 − 𝑐) 𝑇
Silhouette-score prospective (Cont.)

• Davies-Boulden index
If the ground truth labels are not known, the Davies-Bouldin index (sklearn. MetrixdavisBoulden)
𝑅𝑖𝑗 = 𝑠𝑖 + 𝑠𝑗 𝑑𝑖𝑗 Then the Davis-Bouldin Index is defined as
DB = 1 𝑘
∑𝑖 = 1𝑘 max 𝑖≠𝑗
𝑅𝑖𝑗 The zero is the lowest score a possible. Score. Values closer to zero indicate a better partition.
But the problem is this algorithm do not attach it in the Scikit-learn library and only explain it in the document
page but cannot experiment easily [29],[30],[31]
Silhouette-score prospective (Cont.)

• Reducing the Dataset
1 / 8 dataset for
“ This archive contains 2075259 measurements gathered in a house located
in Sceaux (7km of Paris, France) between December 2006 and November
2010 (47 months).
Notes: ”
This is the fit the same result, labeled data for machine learning, already
clearly cleansing for the dataset.
Proposed Approach (b)
Analysis of Electricity Consumption at Home Using
K-means Clustering Algorithm

Evaluation
• Declaration & Resources
• System PC configuration Software
a) 1st paper execution snapshot
b) 2nd paper execution snapshot

Analysis (a)
• From K-means algorithms calculate proper cluster things is very important, from the data, estimate
Silhouette_score, the result is K – 7 each cluster centroid and data prices silhouette score are 0.799
is the optimal score.
• From the formal Caliski-Harabasz Index results are 560.3999 is the optimal result.
• Using this k-means algorithm the fact is figure.
K = 7

Analysis (a) (Cont.)
But the problem is this algorithm do not attach it in the Scikit-learn library and
only explain it in the document page but can not experiment easily.

Analysis (b)
• From K-means algorithms calculate proper cluster things is very important, from the data, estimate Silhouette_score, the result is
K = 7 each cluster centroid and data prices silhouette score is 0.799 is the optimal score.
• Even if dataset is so small but the 1/8 datasets K= 7 each cluster centroid and data prices silhouette score 0.810 is the optimal
score.
• From this K-means algorithm cluster 7th, ( all dataset , 1/8 dataset ) each group’s centroid and each centroid distance will be an
optimal value. From this result, the dataset is decrease but the K-means clustering ‘s class vector space.
• Its optimal cluster is same situation with before original Dataset Household power consumption rate via clustering.
Figure. Shiloutette score according to change of cluster number. Figure 1/8 dataset Silhouette score according to change of cluster number.

Evaluation (a) execution snapshot
K = 2 K = 3 K = 4 K = 5
K = 6 K = 7 K = 8 K = 9
K = 10

Evaluation (b) execution snapshot
1/8 dataset K =1 1/8 dataset K = 2 1/8 dataset K =3 1/8 dataset K =4 1/8 dataset K =5
1/8 dataset K =6 1/8 dataset K = 7 1/8 dataset K = 8 1/8 dataset K = 9 1/8 dataset K = 10
1/8 dataset K = 11

Software & Workstation Environment
PC Perfomance
Software OS Software Ram Processor Harddisk
Anadconda3 + Pycham3 Window 10 Professional 16.0GB i7-6600U CPU @2.60GHz 420GB SSD

System PC configuration Software
• Dataset UC Irvine Machine learning Dataset
https://archive.ics.uci.edu/ml/index.php
• Sci-kit learn, Anaconda 3, Pycham 3
https://scikit-learn.org/stable/
https://www.anaconda.com/
https://www.jetbrains.com/pycharm/
open-source personally can easily follow it and because using BSD
License to real works don’t have difficulties to that.

Conclusion
• Household power consumption via k-means clustering, Used library which is sci-kit learn,
Anaconda 3 open-source personally can easily follow it and because using BSD License
to real works don’t have difficulties to that.
• Not only the K-means algorithm, PCAAlgorithms, but also SVM algorithm etc other
machine learning algorithms clustering can also do it.
• From this result, in real life household power consumptions diverse analytics.
• And electricity transformer, Transmission power can management period can estimate it.
• And each data using electricity consumption. It can be used for progressive taxation,
regional to regional demand forecasting, maintenance of power plants and facilities. Can
do it.
• In the Gas company (SeoulGas 서울도시가스공사, Google Tensorflow Meetup 2nd)
can estimate via k-means algorithms and also can estimate about the gas consumption rate
to via K-means clustering and index.

Published paper
1. Hyun Wong Choi, Nawab Muhammad Faseeh Qureshi and Dong Ryeol Shin “Comparative Analysis of
Electricity Consumption at Home through a Silhouette-score prospective” , ICACT 2019 , South Korea , 2019
Sungkyunkwan University, Korea
2. Hyun Wong Choi, Nawab Muhammad Faseeh Qureshi and Dong Ryeol Shin “Analysis of Electricity
Consumption at Home Using K-means Clustering Algorithm ”, ICACT 2019 , South Korea , 2019
Sungkyunkwan University, Korea

Acknowledgement
Advisor, Dr. Dong Ryeol Shin, President of SungKyunKwan University (Currently, May, 14, 2019)
Co-advisor, Dr. Nawab Muhammad Faseeh Qureshi , Assistant Professor
- First Join at SKKU, Mobile computing Laboratory, Professor, HY Youn, SKKU Fellow. http://mobile.skku.ac.kr/
- Advising for Pre-defense, Dr. Navrati Saxena Professor.
- First Join the Open-Lab, Dr. Chun Sung Nam,
- POSCO E&C, Hyun Suk Choi, Deputy Manager
- Co-operate Partner : LG Electronics, LG CNS, LG U+
- Myoung Sun Noh, MD, PhD
Google Tensorflow Meetup 2nd 2017 Conference_서울도시가스공사(SeoulGas)
Open – Lab member. (Currently May,14 2019 )
- Dr. Kee Hyun Choi
- Muhammand Hamza , Janaid
- Woo Hyun Kim , So Chung

Acknowledgement
Academic – Tuition
- LG CNS
- LG Electronics
- LG U+
Transportation Support Motivation from Conference
Tensorflow Meetup 2017
Morning Calm Service
At Participate Conference
Safe Security
At Relaxation time
Release Stress
at Volunteer works
Vision management. Sprit Support

References
[1] Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." Journal of machine learning research 12.Oct (2011):
2825-2830.
[2] Alsabti, Khaled, Sanjay Ranka, and Vineet Singh. "An efficient k-means clustering algorithm." (1997).
[3] Ding, Chris, and Xiaofeng He. "K-means clustering via principal component analysis." Proceedings of the twenty-first
international conference on Machine learning. ACM, 2004.
[4] Paneque-Gálvez, Jaime, et al. "Small drones for community-based forest monitoring: An assessment of their feasibility
and potential in tropical areas." Forests 5.6 (2014): 1481-1507.
[5] Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." Journal of machine learning research 12.Oct (2011):
2825-2830.
[6] Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.
[7] Rasmussen, Carl Edward. "Gaussian processes in machine learning." Summer School on Machine Learning. Springer,
Berlin, Heidelberg, 2003.
[8] Hartigan, John A., and Manchek A. Wong. "Algorithm AS 136: A k-means clustering algorithm." Journal of the Royal
Statistical Society. Series C (Applied Statistics) 28.1 (1979): 100-108.

References (Cont.)
[11] Duda, Richard O., Peter E. Hart, and David G. Stork. Pattern classification. John Wiley & Sons, 2012.
[12] Cover, Thomas M., and Peter E. Hart. "Nearest neighbor pattern classification." IEEE transactions on information
theory13.1 (1967): 21-27.
[13] Breiman, Leo. Classification and regression trees. Routledge, 2017.
[14] Haralick, Robert M., and Karthikeyan Shanmugam. "Textural features for image classification." IEEE
Transactions on systems, man, and cybernetics 6 (1973): 610-621.
[15] Chapelle, Olivier, Bernhard Scholkopf, and Alexander Zien. "Semi-supervised learning (chapelle, o. et al., eds.;
2006)[book reviews]." IEEE Transactions on Neural Networks 20.3 (2009): 542-542.
[16] Zhu, Xiaojin, Zoubin Ghahramani, and John D. Lafferty. "Semi-supervised learning using gaussian fields and
harmonic functions." Proceedings of the 20th International conference on Machine learning (ICML-03). 2003.
[17] Caruana, Rich, and Alexandru Niculescu-Mizil. "An empirical comparison of supervised learning
algorithms." Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
[18] Jain, Anil K. "Data clustering: 50 years beyond K-means." Pattern recognition letters 31.8 (2010): 651-666.
[19] Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional
generative adversarial networks." arXiv preprint arXiv:1511.06434 (2015).
[20] Figueiredo, Mario A. T., and Anil K. Jain. "Unsupervised learning of finite mixture models." IEEE Transactions
on Pattern Analysis & Machine Intelligence 3 (2002): 381-396.

References (Cont.)
[23] Lovmar, Lovisa, et al. "Silhouette scores for assessment of SNP genotype clusters." BMC genomics 6.1 (2005): 35.
[24] Collins, Robert T., Ralph Gross, and Jianbo Shi. "Silhouette-based human identification from body shape and
gait." Proceedings of fifth IEEE international conference on automatic face gesture recognition. IEEE, 2002.
[25] Gat-Viks, Irit, Roded Sharan, and Ron Shamir. "Scoring clustering solutions by their biological
relevance." Bioinformatics 19.18 (2003): 2381-2389.
[26] Maulik, Ujjwal, and Sanghamitra Bandyopadhyay. "Performance evaluation of some clustering algorithms and
validity indices." IEEE Transactions on pattern analysis and machine intelligence 24.12 (2002): 1650-1654.
[27] Łukasik, Szymon, et al. "Clustering using flower pollination algorithm and calinski-harabasz index." 2016 IEEE
Congress on Evolutionary Computation (CEC). IEEE, 2016.
[28] Desgraupes, Bernard. "Clustering indices." University of Paris Ouest-Lab Modal’X 1 (2013): 34.
[29] Petrovic, Slobodan. "A comparison between the silhouette index and the davies-bouldin index in labelling ids
clusters." Proceedings of the 11th Nordic Workshop of Secure IT Systems. sn, 2006.
[30] Maulik, Ujjwal, and Sanghamitra Bandyopadhyay. "Performance evaluation of some clustering algorithms and
validity indices." IEEE Transactions on pattern analysis and machine intelligence 24.12 (2002): 1650-1654.

References (Cont.)
[31] Petrovic, Slobodan. "A comparison between the silhouette index and the davies-bouldin index in
labelling ids clusters." Proceedings of the 11th Nordic Workshop of Secure IT Systems. sn, 2006.
[32] https://scikit-learn.org/stable/
[33] https://www.anaconda.com/
[34] https://www.jetbrains.com/pycharm/
[35] Petrovic, Slobodan. "A comparison between the silhouette index and the davies-
bouldin index in labelling ids clusters." Proceedings of the 11th Nordic Workshop of Secure
IT Systems. sn, 2006.
[36] Bandyopadhyay, Sanghamitra, and Ujjwal Maulik. "Nonparametric genetic clustering:
comparison of validity indices." IEEE Transactions on Systems, Man, and Cybernetics, Part
C (Applications and Reviews) 31.1 (2001): 120-125.
[37]
https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption
[38] https://github.com/sarguido

Q & A

master defense hyun-wong choi_2019_05_14_rev19

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to master defense hyun-wong choi_2019_05_14_rev19

Similar to master defense hyun-wong choi_2019_05_14_rev19 (20)

More from Hyun Wong Choi

More from Hyun Wong Choi (20)

Recently uploaded

Recently uploaded (20)

master defense hyun-wong choi_2019_05_14_rev19

Editor's Notes