3. Introduction
Water quality index
Chemical oxygen demand (COD): the amount of dissolved oxygen needed by a
strong oxidizing agent water to break down organic material present in a given
water sample at certain temperature over a specific time period.
Biological oxygen demand (BOD): the amount of dissolved oxygen needed by
aerobic biological organisms in a body of water to break down organic material
present in a given water sample at certain temperature over a specific time
period.
They indirectly measure the amount of organic compounds in water. COD and
BOD should be correlated.
Suspended solids (SS)
Volatile supended
Sediments (SED)
Inorganic element (N-NH3, P, S etc)
pH
Directly measure the amount of a certain contaminant in water
4. Data Description
The dataset comes from the daily measures of sensors in a urban wastewater treatment
plant.
The data was collected by Manel Poch at Universitat Autonoma de Barcelona. Bellaterra.
Barcelona; Spain
The full dataset was donated by Javier Bejar and Ulises Cortes at Universitat Politecnica
de Catalunya. Barcelona; Spain, and is available at:
http://archive.ics.uci.edu/ml/machine-learning-databases/water-treatment/
5. Data Description
Date
In dd/mm/yy format: 1/1/90 to10/30/91. Some days in this period are not
included.
Water volume
The daily flow volume to the plant in m3: 10005 to 60081
Water quality index (28 variables)
Water quality index were recorded before and/or after a process step.
BOD, COD, SS, SSV, SED ...
Performance (9 variables )
Performance variables were directly calculated from water quality index. They
can be used to evaluate the performance of each process unit. 0.6% to 100%
7. Data Management
Data transformation
The original variable “date” is characteristic and too long. So I transform it to
a categorical variable “day”:
date day
1/1/1990 1
2/1/1990 2
……
30/10/1991 668
Then rename the row name of the data-frame with the variable day.
Correct the wrong format in the variable BOD.in3
Subset data
In this study, five water quality index of influent/effluent were used: pH, COD, BOD,
SS, SED.
Omit the missing value in each subset
Pretreatment Primar
y
Secondar
y
influent2 influent3 effluentinfluent1
9. Method Description
Step 1: Principle component analysis (PCA) on each influent/effluent subset
Visualize the data to see the relationships among the observations and
variables in low dimensions
Step 2: Clustering days based on the daily performance
Identify subgroups of similar days based on the daily performance of each
process unit or the whole plant
10. Step 1: Principle component analysis (PCA)
on influent1 subset
Principal component loading vector of influent1 (influent to the pretreatment unit)
Proportion of variance explained (PVE) by each PC and cumulative PVE
1 2 3 4 5 6
0.00.20.40.60.81.0
Principal Component
ProportionofVarianceExplained
1 2 3 4 5 6
0.00.20.40.60.81.0
Principal Component
CumulativeProportionofVarianceExplained
11. Step 1: Principle component analysis (PCA)
on influent1subset
Biplot for influent1
13. Step 2: Clustering days based on the daily
performance
What dissimilarity measure should be used to cluster the days?
If Euclidean distance is used, then days when the process unit/the whole plant
have similar overall performance will be clustered together (Yes, this is
desirable).
if correlation-based distance is used, then days with similar “preferences” (e.g.
days when have better BOD and COD performance but worse SS and SED
performance) will be clustered together, even if some days with these
“preferences” were better overall performance than others
Scale to the unit variance or not?
Data must be scaled, otherwise the water volume will dominate.
Hierarchical clustering will be used.
K-means or K-medoids?
K-medoids is more robust than K-means in the presence of outlier
17. K-medoids clustering
0 5 10 15 20
-505
clusplot(pam(x = sdata, k = k, diss = diss))
Component 1
Component2
These two components explain 80.33 % of the point variability.
Silhouette width si
-0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
Silhouette plot of pam(x = sdata, k = k, diss = diss)
Average silhouette width : 0.37
n = 430 2 clusters Cj
j : nj | avei Cj si
1 : 149 | -0.01
2 : 281 | 0.57
0 5 10 15 20
-10-505
clusplot(pam(x = globalscale2, k = 3))
Component 1
Component2
These two components explain 80.33 % of the point variability.
-0.4
Silhouet
Average s
n = 430
18. Conclusion
Water quality index and flow amount of influent/effluent
have been visualized by PCA to see the relationships
among the observations and variables in low dimensions.
Clustering methods have been used to identify subgroups
of similar days.
19. Reference
``Avaluacio de tecniques de classificacio per a la gestio de Bioprocessos: Aplicacio a un
reactor de fangs activats'' Master Thesis. Dept. de Quimica. Unitat d'Enginyeria Quimica.
Universitat Autonoma de Barcelona. Bellaterra (Barcelona). 1993.
``LINNEO+: A Classification Methodology for Ill-structured Domains''. Research report RT-
93-10-R. Dept. Llenguatges i Sistemes Informatics. Barcelona. 1993.
``A knowledge-based system for the diagnosis of waste-water treatment plant''.
Proceedings of the 5th international conference of industrial and engineering applications of
AI and Expert Systems IEA/AIE-92. Ed Springer-Verlag. Paderborn, Germany, June 92.