Semi-Supervised Fuzzy C-Means for Regression
We propose a method to perform regression on partially labeled data, which is based on SSFCM (Semi-Supervised Fuzzy C-Means), an algorithm for semi-supervised classification based on fuzzy clustering. The proposed method, called SSFCM-R, precedes the application of SSFCM with a relabeling module based on target discretization. After the application of SSFCM, regression is carried out according to one out of two possible schemes: (i) the output corresponds to the label of the closest cluster; (ii) the output is a linear combination of the cluster labels weighted by the membership degree of the input. Some experiments on synthetic data are reported to compare both approaches.
IJCCI 15th International joint Conference on Computational Intelligence, 13-15 November, 2023, Rome, Italy
full paper: https://www.researchgate.net/publication/375671573_Semi-Supervised_Fuzzy_C-Means_for_Regression
5. Semi-Supervised Fuzzy C-Means
• Semi-supervised version of fuzzy C-Means (FCM)
• Exploits partially labeled data to drive the clustering process
• Minimizes the following objective function:
• Outcomes:
Membership matrix and a set of k centroids
U = [ujk] ck =
∑
N
j=1
u2
jkxj
∑
N
j=1
u2
jk
J =
K
∑
k=1
N
∑
j=1
um
jkd2
jk + α
K
∑
k=1
N
∑
j=1
(ujk − bj fjk)
m
d2
jk
supervised component
unsupervised component
6. Semi-Supervised Fuzzy C-Means for Regression
• SSFCM-R Semi-Supervised Fuzzy C-Means for Regression
• Regression algorithm based on the classi
fi
cation algorithm SSFCM (Pedrycz
and Waletzky, 1997)
• Labeled prototypes
• Three main stages:
• Pre-processing: discretization and relabeling is applied to the target values
• Clustering: SSFCM
• Post-processing: matching method using the derived label prototypes
7. SSFCM-R Pre-processing
• Let the set of numerical labels
• The set is discretized into intervals
• For each interval the subset is
computed
• The average value is computed
• New labels:
• The number intervals is a hyperparameter
Y = {y ∈
𝒴
|(x, y) ∈ L}
Y C
[ai, bi], i = 1,2,…, C Yi = Y ∩ [ai, bi]
̂
yi
̂
L = {(x, ̂
y) ∈ ̂
D|y ≠ □ }
C
8. SSFCM-R Pre-processing
• Three discretization strategies:
• D1: Equal-width discretization, separating all
possible values into bins, each having the same
width;
• D2: Equal-frequency discretization, separating all
possible values into bins, each having the same
amount of observations;
• D3: The intervals are de
fi
ned on the basis of the
centroids produced by K-Means clustering
C
C
9. SSFCM-R Post-processing
Given a new input , the estimated value is computed according to one
out of two possible strategies:
• max: The closest prototype to is determined and corresponds to the
class label
• sum: The membership degrees of to each cluster are determined by using
SSFCM, the estimated value corresponds to the weighted average
x ∈
𝒳
y
ck x ymax
̂
yik
x
y
ysum =
K
∑
k=1
uk(x) ̂
yik
11. Experimental settings
Eight labeling percentages
Three synthetic data Three bin sizes
Three discretization strategies Two post-processing methods MSE and TIME
15. Conclusions and future work
• SSFCM-R leverages a discretization mechanism to move from a continuous domain to
a discrete one
• The in
fl
uence of data complexity, discretization strategy, labeling percentage, and
number of bins, on the results, has been studied
• The equal width strategy has been proven to be the more e
ff
ective
• A small number of bins is preferable
• The post-processing method sum achieved lower errors than the max method
• Study di
ff
erent discretization strategies
• The e
ff
ectiveness of the proposed approach will be evaluated on real-world applications
• It will be compared with other semi-supervised regression algorithms