2015/10/17に開催されたRecsys2015の論文読み会で発表した資料です。Hui, Li., et al. の Overlapping Community Regularization for Rating Prediction in Social Recommender Systemsという論文を読みました。発表時間の制約上だいぶ端折っています。
Dokumen tersebut membahas tentang clustering, yaitu teknik pembelajaran tak terawasi untuk mengelompokkan data berdasarkan kesamaan. Dibahas beberapa metode clustering seperti K-Means, hierarchical clustering, dan Fuzzy C-Means beserta ilustrasinya."
DBScan stands for Density-Based Spatial Clustering of Applications with Noise.
DBScan Concepts
DBScan Parameters
DBScan Connectivity and Reachability
DBScan Algorithm , Flowchart and Example
Advantages and Disadvantages of DBScan
DBScan Complexity
Outliers related question and its solution.
DBSCAN is a density-based clustering algorithm that can find clusters of arbitrary shape. It requires two parameters: epsilon, which defines the neighborhood distance, and minimum points. It marks points as core, border or noise based on the number of points within their epsilon-neighborhood. Randomized DBSCAN improves the time complexity from O(n^2) to O(n) by randomly selecting a maximum of k points in each neighborhood to analyze rather than all points. Testing shows Randomized DBSCAN performs as well as DBSCAN in terms of accuracy while improving runtime, especially at higher data densities relative to epsilon. Future work includes analyzing accuracy in higher dimensions and combining with indexing to further improve time complexity.
2015/10/17に開催されたRecsys2015の論文読み会で発表した資料です。Hui, Li., et al. の Overlapping Community Regularization for Rating Prediction in Social Recommender Systemsという論文を読みました。発表時間の制約上だいぶ端折っています。
Dokumen tersebut membahas tentang clustering, yaitu teknik pembelajaran tak terawasi untuk mengelompokkan data berdasarkan kesamaan. Dibahas beberapa metode clustering seperti K-Means, hierarchical clustering, dan Fuzzy C-Means beserta ilustrasinya."
DBScan stands for Density-Based Spatial Clustering of Applications with Noise.
DBScan Concepts
DBScan Parameters
DBScan Connectivity and Reachability
DBScan Algorithm , Flowchart and Example
Advantages and Disadvantages of DBScan
DBScan Complexity
Outliers related question and its solution.
DBSCAN is a density-based clustering algorithm that can find clusters of arbitrary shape. It requires two parameters: epsilon, which defines the neighborhood distance, and minimum points. It marks points as core, border or noise based on the number of points within their epsilon-neighborhood. Randomized DBSCAN improves the time complexity from O(n^2) to O(n) by randomly selecting a maximum of k points in each neighborhood to analyze rather than all points. Testing shows Randomized DBSCAN performs as well as DBSCAN in terms of accuracy while improving runtime, especially at higher data densities relative to epsilon. Future work includes analyzing accuracy in higher dimensions and combining with indexing to further improve time complexity.
5. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Related Works
Low Rank Representation (LRR)
A … d×n given data matrix
Aがcleanな場合、A=AXを満たす低階数行列Xを求める
( ||・||*はnuclear normで特異値の和 )
この解はX=VrV’r (ただしA = UrΣrV’r)
5
6. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Related Works
Aがcleanでない場合、次の最適化問題を解く
データがきれいにk個の部分空間に分けることができる場合には
6
7. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Related Works
Sparse Subspace Clustering (SSC)
・Sは大きな誤差(outlier)を補正する疎行列
・Eは残差行列
・diag(X)=0の制約はtrivialな解を避けるため
得られたXを用いてW=|X|+|X’|で
類似度行列を作りスペクトラルクラスタリング
7
8. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Proposed method
Log-determinant Function
nuclear normは大きな特異値が存在する場合にrankの
近似精度が悪くなる
σiが0のときはlog-det function F(X)への寄与は0
σiが0より大きいとき,
8
9. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Proposed method
データ行列Aについて次の仮定を置く
・Bは未知のcleanなデータ行列でB = BXを満たす
・Sは誤差を示す疎行列→outlierへの対応
・Eはガウシアンノイズ
この仮定のもとで目的関数は
9
10. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Proposed method
B = BXの等式制約を緩和する
Augmented Lagrangian Multipliersを用いて,各パラメータに対
して,逐次更新をする→Y = I - Xとして,ALMの形に書き換える
10
12. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Optimization
whole ALM Algorithm
0. input α, β, γ, µ, A
1. initialize S0, Y0, Λ0, ρ0, and k=0
2.Update X
3.Update B
4.Update S
5.Update Y
6.Update Λ and ρ
7.repeat 2-6 until convergence or k = kmax
肝となるのはXの更新のみ!!
12
13. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Optimization
Update B
Update S
論文中ではl=1とl=2,1の場合を紹介。l=1のときは
13
14. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Optimization
Update Y
Update Λ and ρ
14
15. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Optimization
Update X
このとき、log-det functionはunitarily invariant
σiに対して微分可能
15
16. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Optimization
Moreau-Yoshida proximity operator
最適なXは次で求められる
16
17. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Optimization
17
証明)
18. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Optimization
σi
Dが0より大きいとき、
少なくともひとつは(0,σi
D)に含まれる解を持つ
σi
Dが0のときは、解は0となる
18
19. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Optimization
whole ALM Algorithm
0. input α, β, γ, µ, A
1. initialize S0, Y0, Λ0, ρ0, and k=0
2.Update X
3.Update B
4.Update S
5.Update Y
6.Update Λ and ρ
7.repeat 2-6 until convergence or k = kmax
19
20. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Subspace Clustering
上の最小化問題から得られたX*を用いてクラスタリング
正規化
φが大きすぎるとクラスタ間分離度は上がるが、クラスタ内分
散も大きくなる
20
21. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Experiments
21
・Face image clusteringへの応用 (Extended Yale B data)
38人 64パターンの画像を48 42ピクセルにしてベクトル化
(1-10, 11-20, 21-30, 31-38)
n {2,3,5,8,10} n {2,3,5,8}
クラスター数をおさえるため
22. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Experiments
22
23. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Experiments
23
α、β、γの値による誤分類率の変化
SCLA1 SCLA2,1
24. All Right Reserved. Copyright. (c) Recruit Technologies Co., Ltd
Summary
・基本的にはLRRの考え方と同じ
・nuclear normは単純に特異値を足しただけなので、
階数近似としては微妙→log-det function使おう!
・ハイパーパラメータ多いし、データによっては分類精度がか
なり変わる (Experimentsの部分がちょっと怪しい)
・Clustering and Dimension reductionの最近の流れをつか
むには良い内容だと思う
24