SNA & R (20121011)

2012. 10
BIG DATA Warehouse Code Workshop
송주영
bt22dr@gmail.com

1
Agenda I. Big Data Warehouse
II. SNA & R
III. 실시간 MapReduce

2
I. SNA & R : Overview
 Social Network Analysis
• Social Influencer (Centrality)
• Degree centrality
• Eigenvlue centrality
 R을 이용한 Network Analysis
• Social Network 군집화
• Network Graph의 시각화 (Gephi)

3
Social Influencer
 SNA (Social Network Analysis)
 사람이나 사물, 조직, 기술, 자원 등의 연결관계 속에서 핵심적인 기능을 담당하는
행위주체를 판별/발굴하고 보다 효과적인 인력/자원 활용 및 조직화 방안을 제시하
는 분석기법
 network theory의 관점에서 social relationship을 이해/분석할 수 있는 다양한 분야
에서 활용 가능
 활용 예
 Visualization (social network diagram)
 Centrality
 Community
 Recommendation

4
Social Influencer
 Centrality
 소셜 네트워크에서의 influence(importance)를 구하는 방법
 Social Graph
 Centrality 종류
 Degree centrality
 Betweenness centrality
 Closeness centrality
 Eigenvector centrality
Vertex Edge
Facebook 사람 Relationship (친구, 결혼, 연애, 동창, colocation 등)
Foursquare 장소 Signal (이동경로, co-visitation, menu, 사람 등)

5
Social Influencer
 Centrality
 소셜 네트워크에서의 influence(importance)를 구하는 방법
 Social Graph
 Centrality 종류
 Degree centrality
 Betweenness centrality
 Closeness centrality
 Eigenvector centrality
Vertex Edge
Facebook 사람 Relationship (친구, 결혼, 연애, 동창, colocation 등)
Foursquare 장소 Signal (이동경로, co-visitation, menu, 사람 등)

7
Degree Centrality
 Adjacency Matrix
 Network diagram
single person with high degree
single person low degree
but high connectivity (ex CEO)

8
Degree Centrality
 MapReduce 구현
1. 발신자, 수신자 전화번호 추출
2. 사용자 별 통화 목록 합산
3. 적절한 기준으로 정렬 (통화 횟수, 통화 상대 수, 통화 시간 등등)

9
Degree Centrality
 MapReduce 구현
1. 발신자, 수신자 전화번호 추출
2. 사용자 별 통화 목록 합산
3. 적절한 기준으로 정렬 (통화 횟수, 통화 상대 수, 통화 시간 등등)

10
Degree Centrality
 MapReduce (Key, Value) 설계
A B
A C
D E
D E
입력 파일
1, A B
2, A C
3, D E
4, D E
Input (k1: 열 번호, v1 : 입력 라인)
Output (k2 : 발신자, v2 : 수신자)
A, B
A, C
D, E
D, E
A, [B, C]
D, [E, E]
맵 함수 임시 결과
정렬/병합
Input (k2 : 발신자, v2 : 수신자)
Output (v3 : 통화목록)
A, [(B,1), (C,1)
]
D, [(E,2)]
map : (k1, v1) -> list(k2, v2)
reduce : (k2, list(v2)) -> list(v3)
출력 파일
Mapper
Reducer

11
Degree Centrality
 샘플 데이터를 이용한 테스트
 입력 데이터
[hadoop@cudatest hadoop]$ bin/hadoop dfs -cat
01090084558 0113924558
01063422787 0113924558
01031223732 01055733732
01040446318 01050454111
01093400831 01041491515
01062576087 01041491515
0109187791 01044122476
01191877911 01091493641
01091877911 01040011473
01091877911 0113924558
01091877911 01190679710
01091877911 01190679710
01091877911 #8910
01065326460 #8910
01065326460 0112287440
01065326460 0112287440
01062570000 01190679710
01062570001 01190679710
01062570002 01190679710
01062570003 01190679710
01062570003 01190677777

12
Degree Centrality
발신자 기준 통화 목록 수신자 기준 통화 목록
[hadoop@cudatest hadoop]$ bin/hadoop dfs -text
/user/hadoop/CDR_test/output/hubVectors/part-r-00000
109187791 {1044122476:1.0}
1031223732 {1055733732:1.0}
1040446318 {1050454111:1.0}
1062570000 {1190679710:1.0}
1062570001 {1190679710:1.0}
1062570002 {1190679710:1.0}
1062570003 {1190679710:1.0,1190677777:1.0}
1062576087 {1041491515:1.0}
1063422787 {113924558:1.0}
1065326460 {112287440:2.0}
1090084558 {113924558:1.0}
1091877911 {1040011473:1.0,1190679710:2.0,113924558:1.0}
1093400831 {1041491515:1.0}
1191877911 {1091493641:1.0}
/user/hadoop/CDR_test/output/authorityVectors/part-r-00000
112287440 {1065326460:2.0}
113924558 {1063422787:1.0,1091877911:1.0,1090084558:1.0}
1040011473 {1091877911:1.0}
1041491515 {1093400831:1.0,1062576087:1.0}
1044122476 {109187791:1.0}
1050454111 {1040446318:1.0}
1055733732 {1031223732:1.0}
1091493641 {1191877911:1.0}
1190677777 {1062570003:1.0}
1190679710 {1062570003:1.0,1062570002:1.0,1062570001:1.0,10625
70000:1.0,1091877911:2.0}

13
Degree Centrality
발신자 인플루언서 수신자 인플루언서
/user/hadoop/CDR_test/output/topHubUsers/part-r-00000
(1091877911,3) {113924558:1.0,1040011473:1.0,1190679710:2.0}
(1062570003,2) {1190679710:1.0,1190677777:1.0}
(1191877911,1) {1091493641:1.0}
(1093400831,1) {1041491515:1.0}
(1090084558,1) {113924558:1.0}
(1065326460,1) {112287440:2.0}
(1063422787,1) {113924558:1.0}
(1062576087,1) {1041491515:1.0}
(1062570002,1) {1190679710:1.0}
(1062570001,1) {1190679710:1.0}
(1062570000,1) {1190679710:1.0}
(1040446318,1) {1050454111:1.0}
(1031223732,1) {1055733732:1.0}
(109187791,1) {1044122476:1.0}
/user/hadoop/CDR_test/output/topAuthorityUsers/part-r-00000
(1190679710,5) {1062570003:1.0,1062570002:1.0,1062570001:1.0,106
2570000:1.0,1091877911:2.0}
(113924558,3) {1063422787:1.0,1091877911:1.0,1090084558:1.0}
(1041491515,2) {1093400831:1.0,1062576087:1.0}
(1190677777,1) {1062570003:1.0}
(1091493641,1) {1191877911:1.0}
(1055733732,1) {1031223732:1.0}
(1050454111,1) {1040446318:1.0}
(1044122476,1) {109187791:1.0}
(1040011473,1) {1091877911:1.0}
(112287440,1) {1065326460:2.0}

14
Degree Centrality
 실제 데이터를 사용한 Degree Centrality 분석 결과
 발신자 influencer : 약 700만 건의 VOC / SMS
 수신자 influencer : 특수 번호(119 등), 각종 서비스 번호
발신자 인플루언서 수신자 인플루언서
[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop dfs -text
cdr/output/topHubUsers/part-r-00000 | head -20
(1192838846,6945398) 6945398
(1020770705,4986855) 4986855
(1089673003,2390573) 2390573
(1040079536,1546922) 1546922
(1031076610,1159645) 1159645
(1054435805,627850) 627850
(112074334,579046) 579046
(1190170113,551839) 551839
(1092091847,541011) 541011
(1020972802,529863) 529863
(1047268736,430949) 430949
(1047890008,28289) 28289
(1031296426,20675) 20675
(1092641133,1151) 1151
(1093132716,1110) 1110
(117418885,924) 924
(114099979,924) 924
[hadoop@skcc-nebdap02 hadoop]$ bin/hadoop dfs -text
cdr/output/topAuthorityUsers/part-r-00000 | head -20
(15882100,117746) 117746
(10114,112694) 112694
(101508,106191) 106191
(89,78993) 78993
(1056165827,50225) 50225
(15889999,42442) 42442
(111508,30376) 30376
(15885000,22696) 22696
(114,19252) 19252
(15442100,19142) 19142
(16009316,17604) 17604
(15778000,17093) 17093
(131,16985) 16985
(119,16644) 16644
(112,16444) 16444
(101515,15763) 15763
(100,15100) 15100

15
Eigenvalue Centrality
 Adjacency Matrix
 Network diagram
single person with high degree
single person low degree
but high connectivity (ex CEO)

19
 PageRank 구현

20
 Power Iteration
 matrix-vector 곱셈
 vector-vector 뺄셈
 matrix 전치
 vector 내적
 Matrix-Vector Multiplication
dense matrix sparse matrix
from numpy import *
def norm2(v):
return sqrt(v.T*v).item()
def PowerMethod(A, y, e):
while True:
v = y/norm2(y)
y = A*v
t = dot(v.T,y).item()
if norm2(y - t*v) <= e*abs(t):
return (t, v)

21
기타 분석 응용
 Summary
 이용 시간이 많은 사람 검출
 단순 통계 데이터 구축
 개인 통화량 변화, 지역/시간 별 통계, 통화 실패 건수 수집 등
 Fraud detection
 과도한 해외 전화, unexpected destination, 짧고 반복적인 통화 등
 Centrality
 스팸 발송자 검출 (degree)
 특수 번호나 서비스 번호 명시적으로 제거할 필요 없음 (eigen)
 Clustering
 위치 정보 연동하여 지역 군집과 통화 군집 비교
 군집 타겟 마케팅
 Classification
 이탈 고객 방지
 실시간 분석
 서비스 품질 모니터링
 이동 기지국 활용도 증가

22
I. SNA & R : Overview
 Social Network Analysis
• Social Influencer (Centrality)
• Degree centrality
• Eigenvlue centrality
 R을 이용한 Network Analysis
• Social Network 군집화
• Network Graph의 시각화 (Gephi)

23
Social Network 군집화
 igraph 패키지
> library(igraph)
> sms <- read.table("./sms.sam");sms
V1 V2
1 1092091847 1032466806
2 1192838846 1032466806
3 1031076610 1032466806
4 1020770705 1032466806
5 1192838846 1032466806
6 1020770705 1032466806
> g <- graph.data.frame(sms, directed=TRUE)
> plot(g, vertex.label=V(g)$name, vertex.label.cex=0.8, vertex.size=30)

24
 GraphML
> write.graph(g, "sms.graphml", format="graphml")

25
 Hierarchical Clustering
> library(igraph)
> user.ego <- read.graph("johnmyleswhite_net.graphml", format="graphml")
> user.sp <- shortest.paths(user.ego)
> user.hc <- hclust(dist(user.sp))
> plot(user.hc, labels=F)

26
Network Graph의 시각화
 Hierarchical Clustering
for(i in 2:10) {
user.cluster <- as.character(cutree(user.hc, k=i))
user.cluster[1] <- "0"
user.ego <- set.vertex.attribute(user.ego, name=paste("HC", i, sep=""), value=user.cluster)
}

27
Network Graph의 시각화
 Gephi
 interactive visualization and exploration platform for all kinds of networks and
complex systems, dynamic and hierarchical graphs.
 설정
 거리 : 거리가 가까운 node가 가까운 곳에 위치하도록 설정
 크기 : node의 크기는 in-degree에 비례하도록 설정
 색깔 : cluster에 따라 색깔 지정

SNA & R (20121011)

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to SNA & R (20121011)

Similar to SNA & R (20121011) (11)

More from 주영 송

More from 주영 송 (12)

SNA & R (20121011)