DisCo: Distributed
Co-clustering with
Map-Reduce
2008 IEEE International Conference on Data Engineering (ICDE)

Tzu-Li Tai...
Agenda
A.
B.
C.
D.
E.
F.
G.

Motivation
Background: Co-Clustering + MapReduce
Proposed Distributed Co-Clustering Process
I...
Motivation
Fast Growth in Volume of Data
• Google processes 20 petabytes of data per day
• Amazon and eBay with petabytes ...
Motivation
Problems with Big Data mining for DBMSs
• Significant preprocessing
costs for the majority of
data mining tasks...
Motivation
Why distributed processing can solve the issues:
• MapReduce is irrelevant to the schema or form
of the input d...
Motivation
Contributions of this paper:
• Presents the whole process for distributed data
mining

• Specifically, focuses ...
BackGround: Co-Clustering
• Also named biclustering, or two-mode clustering

• Input format: a matrix of 𝑚 rows and 𝑛 colu...
BackGround: Co-Clustering
Why Co-Clustering?
Traditional Clustering:
Student A

0

1

0

1

1

AC

Student B

1

0

1

0

...
BackGround: Co-Clustering
Why Co-Clustering?

Co-Clustering:

Student A

0

1

0

1

1

Student B

1

0

1

0

0

Student ...
BackGround: Co-Clustering
Another Co-Clustering Example: Animal Data

8

39
BackGround: Co-Clustering
Another Co-Clustering Example: Animal Data

9

39
BackGround: Co-Clustering
Another Co-Clustering Example: Animal Data

10

39
BackGround: MapReduce
The MapReduce Paradigm
Map

(𝒌 𝟏 , 𝒗 𝟏 )

(𝒌 𝟐 , 𝒗 𝟐 )
(𝒌 𝟐 , 𝒗 𝟐 )
(𝒌 𝟐 , 𝒗 𝟐 )

(𝒌 𝟐 , [𝑽 𝟐 ])

Re...
Distributed Co-Clustering Process
Mining Network Logs to
Co-Cluster Communication Behavior

12

39
Distributed Co-Clustering Process
Mining Network Logs to
Co-Cluster Communication Behavior

13

39
Distributed Co-Clustering Process
The Preprocessing Process
HDFS
HDFS
MapReduce Job
MapReduce Job

Build transpose
adjacen...
Distributed Co-Clustering Process
Co-Clustering (Generalized Algorithm)
Goal:
r(1) = 1

0

1

0

1

1

r(2) = 1

1

0

1

...
Distributed Co-Clustering Process
Co-Clustering (Generalized Algorithm)
Fix column labels,
Iterate through rows:
r(1) = 1
...
Distributed Co-Clustering Process
Co-Clustering (Generalized Algorithm)

0

1

0

1

1

0

1

0

1

1

1

0

1

0

0

1

0...
Distributed Co-Clustering Process
Co-Clustering with MapReduce
1 -> 2,4,5

0

1

0

1

1

1

0

1

0

0

0

1

0

1

1

1
...
Distributed Co-Clustering Process
Co-Clustering with MapReduce
1 -> 2,4,5

0

1

0

1

1

1

0

1

0

0

0

1

0

1

1

1
...
Distributed Co-Clustering Process
(𝒌𝒆𝒚 𝒊𝒏 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏 )

M

( 1, 2,4,5 )
1 -> 2,4,5
𝑟, 𝑐, 𝐺

M
( 2, 1,3 )

2 -> 1,3
𝑟, 𝑐, 𝐺...
Distributed Co-Clustering Process
(𝒌𝒆𝒚 𝒊𝒏 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏 )

M

( 1, 2,4,5 )
1 -> 2,4,5

0

𝑟, 𝑐, 𝐺

M
( 2, 1,3 )

2 -> 1,3
𝑟, ...
Distributed Co-Clustering Process
(𝒌𝒆𝒚 𝒊𝒏 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏 )

M

(𝒌𝒆𝒚 𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆 )

(1, { 1,2 , 1})

( 1,...
Distributed Co-Clustering Process
(𝒌𝒆𝒚 𝒓−𝒊𝒏 , [𝑽𝒂𝒍𝒖𝒆] 𝒓−𝒊𝒏 )

(1, [

1,2 , 1 , { 1,2 , 3}])

R

ℊ1 = 1,2 + 1,2
= (2,4)
𝐼1 ...
Distributed Co-Clustering Process
(𝒌𝒆𝒚 𝒓−𝒊𝒏 , [𝑽𝒂𝒍𝒖𝒆] 𝒓−𝒊𝒏 )

(1, [

1,2 , 1 , { 1,2 , 3}])

(𝒌𝒆𝒚 𝒓𝒆𝒔𝒖𝒍𝒕 , 𝒗𝒂𝒍𝒖𝒆 𝒓𝒆𝒔𝒖𝒍𝒕 )
...
Distributed Co-Clustering Process
Preprocessing

HDFS

Co-Clustering

Random 𝑟, 𝑐, 𝐺
given 𝑘, 𝑙

+
MapReduce Job

Sync
Res...
Implementation Details
Tuning the number of Reduce Tasks
• The number of reduce tasks is related to the number of
intermed...
Implementation Details
(𝒌𝒆𝒚 𝒊𝒏 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏 )

M

(𝒌𝒆𝒚 𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆 )

(1, { 1,2 , 1})

( 1, 2,4,5 )
1 ...
Implementation Details
Tuning the number of Reduce Tasks
• So, for the row-iteration/column-iteration jobs, 1 reduce
task ...
Implementation Details
The Preprocessing Process
HDFS
HDFS
(𝑺𝒓𝒄𝑰𝑷, [𝑫𝒔𝒕𝑰𝑷])

MapReduce Job

Build transpose
adjacency list...
Experimental Evaluation

Environment
• There are 39 nodes in four different blade enclosure
Gigabit Ethernet

• Blade Serv...
Experimental Evaluation

Datasets

31

39
Experimental Evaluation

Preprocessing ISS Data
• Optimal values of each situation
• Map tasks number
• Reduce tasks numbe...
Experimental Evaluation

Co-Clustering TREC Data
• After 25 nodes per iteration is roughly
about 20 ± 2 seconds. It is bet...
Conclusion
• Authors of the paper shared their lessons learnt from data
mining experiences with vast quantities of data, p...
Discussion
• Necessity of the global 𝑟, 𝑐, 𝐺 sync action
• Questionable Scalability for DisCo

35

39
Discussion
Necessity of
the global
𝒓, 𝒄, 𝑮
sync action

Co-Clustering

Random 𝑟, 𝑐, 𝐺
given 𝑘, 𝑙

+
MapReduce Job

Sync
Re...
Discussion
(𝒌𝒆𝒚 𝒊𝒏 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏 )

M

(𝒌𝒆𝒚 𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆 )

(1, { 1,2 , 1})

( 1, 2,4,5 )
1 -> 2,4,5

(𝒌...
Discussion
Questionable Scalability of DisCo
• For row-iteration jobs (or column-iteration jobs), the number
of intermedia...
Discussion
(𝒌𝒆𝒚 𝒊𝒏 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏 )

M

(𝒌𝒆𝒚 𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆 )

(1, { 1,2 , 1})

( 1, 2,4,5 )
1 -> 2,4,5

(𝒌...
Upcoming SlideShare
Loading in...5
×

[Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

447

Published on

[Paper Study]
2008 ICDE
Spiros Papadimitriou and Jimeng Sun, IBM T.J. Watson Research Center, NY, USA

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
447
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
25
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

[Paper Study] DisCo: Distributed CoClustering with Map-Reduce, 2008 ICDE

  1. 1. DisCo: Distributed Co-clustering with Map-Reduce 2008 IEEE International Conference on Data Engineering (ICDE) Tzu-Li Tai, Tse-En Liu Kai-Wei Chan, He-Chuan Hoh National Cheng Kung University Dept. of Electrical Engineering HPDS Laboratory S. Papadimitron, J. Sun IBM T.J. Watson Research Center NY, USA
  2. 2. Agenda A. B. C. D. E. F. G. Motivation Background: Co-Clustering + MapReduce Proposed Distributed Co-Clustering Process Implementation Details Experimental Evaluation Conclusions Discussion 0 39
  3. 3. Motivation Fast Growth in Volume of Data • Google processes 20 petabytes of data per day • Amazon and eBay with petabytes of transactional data every day Highly variant structure of data • Data sources naturally generate data in impure forms • Unstructured, semi-structured 1 39
  4. 4. Motivation Problems with Big Data mining for DBMSs • Significant preprocessing costs for the majority of data mining tasks • DBMS lacks performance for large amount of data 2 39
  5. 5. Motivation Why distributed processing can solve the issues: • MapReduce is irrelevant to the schema or form of the input data • Many preprocessing tasks are naturally expressible with MapReduce • Highly scalable with commodity machines 3 39
  6. 6. Motivation Contributions of this paper: • Presents the whole process for distributed data mining • Specifically, focuses on the Co-Clustering mining task, and designs a distributed co-clustering method using MapReduce 4 39
  7. 7. BackGround: Co-Clustering • Also named biclustering, or two-mode clustering • Input format: a matrix of 𝑚 rows and 𝑛 columns • Output: Co-clusters (sub-matrices) which rows that exhibit similar behavior across a subset of columns 4*5 4*5 5 39
  8. 8. BackGround: Co-Clustering Why Co-Clustering? Traditional Clustering: Student A 0 1 0 1 1 AC Student B 1 0 1 0 0 BD Student C 0 1 0 1 1 Student D 1 0 1 0 0 Can only know that students A & C / B & D have similar scores 6 39
  9. 9. BackGround: Co-Clustering Why Co-Clustering? Co-Clustering: Student A 0 1 0 1 1 Student B 1 0 1 0 0 Student C 0 1 0 1 1 Cluster 1 Cluster 2 Student D 1 0 1 0 0 Good at Science + Math Good at English + Chinese + Social Studies B&D A&C Student D 1 1 0 0 0 Student B 1 1 0 0 0 Student C 0 0 1 1 1 Student A 0 0 1 1 1 Rows that have similar properties for a subset of selected columns 7 39
  10. 10. BackGround: Co-Clustering Another Co-Clustering Example: Animal Data 8 39
  11. 11. BackGround: Co-Clustering Another Co-Clustering Example: Animal Data 9 39
  12. 12. BackGround: Co-Clustering Another Co-Clustering Example: Animal Data 10 39
  13. 13. BackGround: MapReduce The MapReduce Paradigm Map (𝒌 𝟏 , 𝒗 𝟏 ) (𝒌 𝟐 , 𝒗 𝟐 ) (𝒌 𝟐 , 𝒗 𝟐 ) (𝒌 𝟐 , 𝒗 𝟐 ) (𝒌 𝟐 , [𝑽 𝟐 ]) Reduce (𝒌 𝟑 , 𝒗 𝟑 ) Map Reduce Map Reduce Map 11 39
  14. 14. Distributed Co-Clustering Process Mining Network Logs to Co-Cluster Communication Behavior 12 39
  15. 15. Distributed Co-Clustering Process Mining Network Logs to Co-Cluster Communication Behavior 13 39
  16. 16. Distributed Co-Clustering Process The Preprocessing Process HDFS HDFS MapReduce Job MapReduce Job Build transpose adjacency list Extract SrcIP + DstIP and build adjacency matrix HDFS SrcIP DstIP MapReduce Job IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress … 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 0 1 1 0 1 0 0 0 …… 0 …… 0 …… 0 …… 0 …… 0 …… HDFS Build adjacency list 14 39
  17. 17. Distributed Co-Clustering Process Co-Clustering (Generalized Algorithm) Goal: r(1) = 1 0 1 0 1 1 r(2) = 1 1 0 1 0 0 r(3) = 1 0 1 0 1 1 r(4) = 2 1 0 1 0 0 Co-cluster into 2x2 = 4 sub-matrices ⇒ 𝑅𝑜𝑤 𝐿𝑎𝑏𝑒𝑙𝑠: 1 or 2, 𝒌 = 𝟐 ⇒ 𝐶𝑜𝑙𝑢𝑚𝑛 𝐿𝑎𝑏𝑒𝑙𝑠: 1 or 2, 𝒍 = 𝟐 Random Initialize: 𝑟 = 1, 1, 1, 2 𝑐 = 1, 1, 1, 2, 2 𝑔11 𝑔12 4 4 𝐺= 𝑔 = 𝑔22 21 2 0 15 39
  18. 18. Distributed Co-Clustering Process Co-Clustering (Generalized Algorithm) Fix column labels, Iterate through rows: r(1) = 1 0 1 0 1 1 r(2) = 1 1 0 1 0 0 r(3) = 1 0 1 0 1 1 r(4) = 2 1 0 1 0 0 1 0 1 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 0 r(2) = 2 𝑟 = 1, 𝟐, 1, 2 𝑐 = 1, 1, 1, 2, 2 𝑔11 𝑔12 𝐺= 𝑔 𝑔22 = 21 𝟐 𝟒 𝟒 𝟎 16 39
  19. 19. Distributed Co-Clustering Process Co-Clustering (Generalized Algorithm) 0 1 0 1 1 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 c(2) = 2 0 0 1 1 0 1 1 1 0 0 1 0 0 0 𝟒 𝟎 0 1 𝟎 𝟒 1 1 𝑟 = 1, 𝟐, 1, 2 𝑐 = 1, 𝟐, 1, 2, 2 𝑔11 𝑔12 𝐺= 𝑔 𝑔22 = 21 1 0 Fix row labels, Iterate through columns: 17 39
  20. 20. Distributed Co-Clustering Process Co-Clustering with MapReduce 1 -> 2,4,5 0 1 0 1 1 1 0 1 0 0 0 1 0 1 1 1 0 1 0 0 1 -> 2, 4, 5 2 -> 1, 3 MR 3 -> 2, 4, 5 2 -> 1,3 4 -> 1, 3 3 -> 2,4,5 4 -> 1,3 18 39
  21. 21. Distributed Co-Clustering Process Co-Clustering with MapReduce 1 -> 2,4,5 0 1 0 1 1 1 0 1 0 0 0 1 0 1 1 1 0 1 0 0 1 -> 2, 4, 5 𝑟, 𝑐, 𝐺 2 -> 1, 3 MR 3 -> 2, 4, 5 2 -> 1,3 𝑟, 𝑐, 𝐺 4 -> 1, 3 MapReduce Job 𝑟, 𝑐, 𝐺 random initialization based on parameters 𝑘, 𝑙 𝑟 = 1,1,1,2 3 -> 2,4,5 𝑟, 𝑐, 𝐺 𝑐 = 1,1,1,2,2 𝐺= 4 2 4 0 4 -> 1,3 𝑟, 𝑐, 𝐺 19 39
  22. 22. Distributed Co-Clustering Process (𝒌𝒆𝒚 𝒊𝒏 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏 ) M ( 1, 2,4,5 ) 1 -> 2,4,5 𝑟, 𝑐, 𝐺 M ( 2, 1,3 ) 2 -> 1,3 𝑟, 𝑐, 𝐺 M ( 3, 2,4,5 ) 3 -> 2,4,5 𝑟, 𝑐, 𝐺 M ( 4, 1,3 ) 4 -> 1,3 𝑟, 𝑐, 𝐺 𝑣 = {2,4,5} 𝑐 = {1,1,1,2,2} ⇒ ℊ1 = (1,2) if r(1) = 2, cost becomes higher ⇒ r(1) = 1 ⇒ emit (r(k), (ℊ 𝑘 , 𝑘) ) = (1, {(1,2), 1}) 0 1 0 1 1 1 0 1 0 0 0 1 0 1 1 1 0 1 0 0 Mapper Function: For each K-V input, (𝑘, 𝑣) 1. Calculate ℊ 𝑘 (with 𝑣 and 𝑐) 2. Change row labels if results in lower cost (function of 𝐺) 3. Emit (r(k), (ℊ 𝑘 , 𝑘)) 20 39
  23. 23. Distributed Co-Clustering Process (𝒌𝒆𝒚 𝒊𝒏 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏 ) M ( 1, 2,4,5 ) 1 -> 2,4,5 0 𝑟, 𝑐, 𝐺 M ( 2, 1,3 ) 2 -> 1,3 𝑟, 𝑐, 𝐺 M ( 3, 2,4,5 ) 3 -> 2,4,5 𝑟, 𝑐, 𝐺 M ( 4, 1,3 ) 4 -> 1,3 𝑟, 𝑐, 𝐺 𝑣 = {1,3} 𝑐 = {1,1,1,2,2} ⇒ ℊ1 = (2,0) if r(2) = 2, cost becomes lower ⇒ r(2) = 2 ⇒ emit (r(k), (ℊ 𝑘 , 𝑘) ) = (2, {(2,0), 2}) 1 0 1 1 1 0 1 0 0 0 1 0 1 1 1 0 1 0 0 Mapper Function: For each K-V input, (𝑘, 𝑣) 1. Calculate ℊ 𝑘 (with 𝑣 and 𝑐) 2. Change row labels if results in lower cost (function of 𝐺) 3. Emit (r(k), (ℊ 𝑘 , 𝑘)) 21 39
  24. 24. Distributed Co-Clustering Process (𝒌𝒆𝒚 𝒊𝒏 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏 ) M (𝒌𝒆𝒚 𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆 ) (1, { 1,2 , 1}) ( 1, 2,4,5 ) 1 -> 2,4,5 (𝒌𝒆𝒚 𝒓−𝒊𝒏 , [𝑽𝒂𝒍𝒖𝒆] 𝒓−𝒊𝒏 ) 𝑟, 𝑐, 𝐺 (1, [ 1,2 , 1 , { 1,2 , 3}]) R M ( 2, 1,3 ) (2, { 2,0 , 2}) 2 -> 1,3 𝑟, 𝑐, 𝐺 M ( 3, 2,4,5 ) 3 -> 2,4,5 (1, { 1,2 , 3}) 𝑟, 𝑐, 𝐺 (2, [ M 2,0 , 2 , { 2,0 , 4}]) R (2, { 2,0 , 4}) ( 4, 1,3 ) 4 -> 1,3 𝑟, 𝑐, 𝐺 22 39
  25. 25. Distributed Co-Clustering Process (𝒌𝒆𝒚 𝒓−𝒊𝒏 , [𝑽𝒂𝒍𝒖𝒆] 𝒓−𝒊𝒏 ) (1, [ 1,2 , 1 , { 1,2 , 3}]) R ℊ1 = 1,2 + 1,2 = (2,4) 𝐼1 = {1,3} ⇒ Emit (1, ( 2,4 , {1,3})) Reducer Function: For each K-V input, (𝑘, [𝑉]) (2, [ 2,0 , 2 , { 2,0 , 4}]) R For each (𝑔, 𝐼) ∈ 𝑉, 1. Accumulate all 𝑔 ∈ 𝑉 into ℊ 𝑘 2. 𝐼 𝑘 = Union of all 𝐼 3. Emit (𝑘, (ℊ 𝑘 , 𝐼 𝑘 ) 23 39
  26. 26. Distributed Co-Clustering Process (𝒌𝒆𝒚 𝒓−𝒊𝒏 , [𝑽𝒂𝒍𝒖𝒆] 𝒓−𝒊𝒏 ) (1, [ 1,2 , 1 , { 1,2 , 3}]) (𝒌𝒆𝒚 𝒓𝒆𝒔𝒖𝒍𝒕 , 𝒗𝒂𝒍𝒖𝒆 𝒓𝒆𝒔𝒖𝒍𝒕 ) R (1, ( 2,4 , {1,3})) 𝒓 = {𝟏, 𝟐, 𝟏, 𝟐} Sync Results (2, [ 2,0 , 2 , { 2,0 , 4}]) R 0 1 0 1 1 𝟐 𝟒 𝑮={ } 𝟒 𝟎 𝑐 = {1,1,1,2,2} 0 1 0 1 1 1 0 1 0 0 1 0 1 0 0 (2, ( 4,0 , {2,4})) 24 39
  27. 27. Distributed Co-Clustering Process Preprocessing HDFS Co-Clustering Random 𝑟, 𝑐, 𝐺 given 𝑘, 𝑙 + MapReduce Job Sync Results Synced 𝑟, 𝑐, 𝐺 with best 𝑟 permutation Fix column Row iteration MapReduce Job + Build transpose adjacency list MapReduce Job HDFS Fix row Column iteration Final Co-Clustering result with best 𝑟, 𝑐 permutations 25 39
  28. 28. Implementation Details Tuning the number of Reduce Tasks • The number of reduce tasks is related to the number of intermediate keys during the shuffle and sort phase • For the co-clustering row-iteration/column-iteration jobs, the number of intermediate keys is either 𝑘 or 𝑙 26 39
  29. 29. Implementation Details (𝒌𝒆𝒚 𝒊𝒏 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏 ) M (𝒌𝒆𝒚 𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆 ) (1, { 1,2 , 1}) ( 1, 2,4,5 ) 1 -> 2,4,5 (𝒌𝒆𝒚 𝒓−𝒊𝒏 , [𝑽𝒂𝒍𝒖𝒆] 𝒓−𝒊𝒏 ) 𝑟, 𝑐, 𝐺 (1, [ 1,2 , 1 , { 1,2 , 3}]) R M ( 2, 1,3 ) (2, { 2,0 , 2}) 2 -> 1,3 𝑟, 𝑐, 𝐺 M ( 3, 2,4,5 ) 3 -> 2,4,5 𝒌 = 𝟐 (row-iterate) ⇒ 𝟐 inter-keys (1, { 1,2 , 3}) 𝑟, 𝑐, 𝐺 (2, [ M 2,0 , 2 , { 2,0 , 4}]) R (2, { 2,0 , 4}) ( 4, 1,3 ) 4 -> 1,3 𝑟, 𝑐, 𝐺 27 39
  30. 30. Implementation Details Tuning the number of Reduce Tasks • So, for the row-iteration/column-iteration jobs, 1 reduce task is enough • However, for some preprocessing tasks such as graph construction where there are a lot of intermediate keys, needs much more reduce tasks 28 39
  31. 31. Implementation Details The Preprocessing Process HDFS HDFS (𝑺𝒓𝒄𝑰𝑷, [𝑫𝒔𝒕𝑰𝑷]) MapReduce Job Build transpose adjacency list Extract SrcIP + DstIP and build adjacency matrix SrcIP DstIP HDFS MapReduce Job IPAddress IPAddress IPAddress IPAddress IPAddress IPAddress … 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 0 1 1 0 1 0 0 0 …… 0 …… 0 …… 0 …… 0 …… 0 …… MapReduce Job HDFS Build adjacency list 29 39
  32. 32. Experimental Evaluation Environment • There are 39 nodes in four different blade enclosure Gigabit Ethernet • Blade Server - CPU: - Memory: - OS: two dual-core (Intel Xeon 2.66GHz) 8GB Red Hat Enterprise Linux • Hadoop Distributed File System(HDFS) capacity: 2.4 TB 30 39
  33. 33. Experimental Evaluation Datasets 31 39
  34. 34. Experimental Evaluation Preprocessing ISS Data • Optimal values of each situation • Map tasks number • Reduce tasks number • Input split size 6 5 6 5 256MB 256MB 32 39
  35. 35. Experimental Evaluation Co-Clustering TREC Data • After 25 nodes per iteration is roughly about 20 ± 2 seconds. It is better than what we can get on a machine with 48GB RAM. 33 39
  36. 36. Conclusion • Authors of the paper shared their lessons learnt from data mining experiences with vast quantities of data, particularly in the context of co-clustering, and recommends using a distributed approach • Designed a general MapReduce approach for co-clustering algorithms • Showed that the MapReduce co-clustering framework scales well with real world large datasets (ICC, TREC) 34 39
  37. 37. Discussion • Necessity of the global 𝑟, 𝑐, 𝐺 sync action • Questionable Scalability for DisCo 35 39
  38. 38. Discussion Necessity of the global 𝒓, 𝒄, 𝑮 sync action Co-Clustering Random 𝑟, 𝑐, 𝐺 given 𝑘, 𝑙 + MapReduce Job Sync Results Synced 𝑟, 𝑐, 𝐺 with best 𝑟 permutation Fix column Row iteration + MapReduce Job Fix row Column iteration Final Co-Clustering result with best 𝑟, 𝑐 permutations 36 39
  39. 39. Discussion (𝒌𝒆𝒚 𝒊𝒏 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏 ) M (𝒌𝒆𝒚 𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆 ) (1, { 1,2 , 1}) ( 1, 2,4,5 ) 1 -> 2,4,5 (𝒌𝒆𝒚 𝒓−𝒊𝒏 , [𝑽𝒂𝒍𝒖𝒆] 𝒓−𝒊𝒏 ) 𝑟, 𝑐, 𝐺 (1, [ 1,2 , 1 , { 1,2 , 3}]) R M ( 2, 1,3 ) (2, { 2,0 , 2}) 2 -> 1,3 𝑟, 𝑐, 𝐺 M ( 3, 2,4,5 ) 3 -> 2,4,5 (1, { 1,2 , 3}) 𝑟, 𝑐, 𝐺 (2, [ M 2,0 , 2 , { 2,0 , 4}]) R (2, { 2,0 , 4}) ( 4, 1,3 ) 4 -> 1,3 𝑟, 𝑐, 𝐺 37 39
  40. 40. Discussion Questionable Scalability of DisCo • For row-iteration jobs (or column-iteration jobs), the number of intermediate keys is fixed to be 𝑘 (or 𝑙) • This implies that for a given 𝑘 and 𝑙, as the input matrix gets larger, the reducer size* will increase dramatically • Since a single reducer (key+associating values) is sent to one reduce task, the memory capacity of a computing node will be a severe bottleneck for overall performance *reference: Upper Bound and Lower Bound of a MapReduce Computation, 2013 VLDB 38 39
  41. 41. Discussion (𝒌𝒆𝒚 𝒊𝒏 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏 ) M (𝒌𝒆𝒚 𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆 , 𝒗𝒂𝒍𝒖𝒆 𝒊𝒏𝒕𝒆𝒓𝒎𝒆𝒅𝒊𝒂𝒕𝒆 ) (1, { 1,2 , 1}) ( 1, 2,4,5 ) 1 -> 2,4,5 (𝒌𝒆𝒚 𝒓−𝒊𝒏 , [𝑽𝒂𝒍𝒖𝒆] 𝒓−𝒊𝒏 ) 𝑟, 𝑐, 𝐺 (1, [ 1,2 , 1 , { 1,2 , 3}]) R M ( 2, 1,3 ) (2, { 2,0 , 2}) 2 -> 1,3 𝑟, 𝑐, 𝐺 M ( 3, 2,4,5 ) 3 -> 2,4,5 (1, { 1,2 , 3}) 𝑟, 𝑐, 𝐺 (2, [ M 2,0 , 2 , { 2,0 , 4}]) R (2, { 2,0 , 4}) ( 4, 1,3 ) 4 -> 1,3 𝑟, 𝑐, 𝐺 39 39
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×