[20150829, PyCon2015] NetworkX를 이용한 네트워크 링크 예측

상암동 누리꿈스퀘어
NetworkX를 이용한 네트워크 링크 예측
김경훈
유니스트 수리과학과
kyunghoon@unist.ac.kr
2015년 8월 29일
김경훈 (UNIST) NetworkX with Link Prediction 2015년 8월 29일 1 / 68

이번 TALK의 목적
1 데이터 조종의 자유로움
2 아이디어 구현의 자유로움
3 융합의 자유로움

About me
Speaker
김경훈 (대학원생)
UNIST (Ulsan National Institute of Science and Technology)
자연과학부 수리과학과
Lab
Adviser : Bongsoo Jang
Homepage : http://amath.unist.ac.kr
“Be the light that shines the world with science and technology.”

도수 중심성 (Degree Centrality)

매개 중심성 (Betweenness Centrality)

목차
1 PyCon2014 피드백
K-means Algorithm
얼마나 큰 행렬을 다룰 수 있나요?
네트워크 공부를 위한 기본 서적
2 링크를 예측하기 위한 준비 운동
NumPy
Pandas
3 네트워크 링크 예측
네트워크 링크 예측이란?
네트워크 링크 예측의 매력
4 데모
ipython과 d3.js

K-means Algorithm

K-means Algorithm
from sklearn import cluster
k = 2
kmeans = cluster.KMeans(n_clusters=k)
kmeans.fit(data)

K-means Algorithm
http://cjauvin.blogspot.kr/2014/03/k-means-vs-louvain.html

NetworkX는 기본 네트워크 구조로 “dictionary of dictionaries of
dictionaries”를 사용
dict-of-dicts-of-dicts 자료 구조의 장점:
Find edges and remove edges with two dictionary look-ups.
Prefer to “lists” because of fast lookup with sparse storage.
Prefer to “sets” since data can be attached to edge.
G[u][v] returns the edge attribute dictionary.
n in G tests if node n is in graph G.
for n in G: iterates through the graph.
for nbr in G[n]: iterates through neighbors.
https://networkx.github.io/documentation/latest/reference/introduction.html

Million-scale Graphs Analytic Frameworks
SNAP : http://snap.stanford.edu/snappy/index.html
Billion-scale Graphs Analytic Frameworks
Apache Hama : https://hama.apache.org/ (소개글)
Pegasus : http://www.cs.cmu.edu/~pegasus/
s2graph : https://github.com/daumkakao/s2graph (슬라이드)
Graph Database
Neo4j : http://neo4j.com/
OrientDB : http://orientdb.com/

네트워크 공부를 위한 기본 서적
1 Networks: An Introduction by Mark Newman
2 링크 : 21세기를 지배하는 네트워크 과학 LINKED The New Science of Networks

링크를 예측하기 위한 준비 운동
1 NumPy : 계산 속도에 최적화된 모듈
2 Pandas : 데이터 구조

NumPy: Numerical Python
다차원 배열
1 근접 메모리를 사용하고, C언어로 구성됨
2 하나의 데이터 타입
3 연산이 한 번에 배열 내의 모든 요소에 적용됨
http://www.numpy.org/

tic = timeit.default_timer()
for index, value in enumerate(b):
b[index] = value*1.1
toc = timeit.default_timer()
print toc-tic
1.82178592682

import numpy as np
import timeit
a = np.arange(1e7)
b = list(a)
tic = timeit.default_timer()
a = a*1.1
toc = timeit.default_timer()
print toc-tic
0.029629945755
사용 방법에 따라, ndarray의 연산 속도는 list()보다 훨씬 빠름.

Pandas: Python Data Analysis Library

Pandas / get data yahoo
%pylab inline
import pandas as pd
import pandas.io.data
import datetime
start=datetime.datetime(2015,1,1); end=datetime.datetime(2015,8,26)
text = """A, AAPL, AMCC, AMD, AMGN, AMKR, AMNT.OB, AMZN, APC, ASOG.P
text = text.replace(’ ’, ’’).split(’,’)
corps = []
for t in text:
if ’.’ not in t:
corps.append(t)
Code : https://goo.gl/8ddrnS

Pandas / get data yahoo
df = pd.io.data.get_data_yahoo(corps, start=start, end=end)
df[’Adj Close’].head()

Pandas / Return Value
returns = df[’Adj Close’].pct_change()
corr = returns.corr()
corr

Pandas / Correlation
bm = corr>0.5
bm.astype(int)

Pandas / Convert to array
mat = bm.astype(int).values
mat

NetworkX / from numpy matrix
import networkx as nx
graph = nx.from_numpy_matrix(mat)
graph = nx.relabel_nodes(graph, dict(enumerate(bm.columns)))
nx.draw(graph, with_labels=True)

NetworkX / figsize
plt.figure(figsize=(20, 20))
nx.draw_spring(graph, with_labels=True)

NetworkX / figsize
first = sorted(nx.connected_components(graph),
key=len, reverse=True)[0]
G = graph.subgraph(first)
nx.draw(G, with_labels=True)

NetworkX / 결국 Gephi에서 작업?
nx.write_gexf(G, ’graph.gexf’)
Gephi에서 gexf 열기

KoNLPy

mecab-ko
은전한닢 프로젝트( http://eunjeon.blogspot.kr/ )
검색에서 쓸만한 오픈소스 한국어 형태소 분석기를 만들자! by 이용운, 유영호
$ sudo docker pull koorukuroo/mecab-ko
$ sudo docker run -i -t koorukuroo/mecab-ko:0.1
안녕하세요
안녕 NNG,*,T,안녕,*,*,*,*
하 XSV,*,F,하,*,*,*,*
세요 EP+EF,*,F,세요,Inflect,EP,EF,시/EP/*+어요/EF/*
EOS
https://github.com/koorukuroo/mecab-ko

mecab-ko

mecab-ko-web
$ sudo docker pull koorukuroo/mecab-ko-web
$ sudo docker run -i -t koorukuroo/mecab-ko-web:0.1
172.17.0.43 (Docker Container IP)
127.0.0.1
* Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
>>> import urllib2
>>> response = urllib2.urlopen(’http://172.17.0.43:5000/?text=안녕’)
>>> text = response.read()
>>> print text
안녕 NNG,*,T,안녕,*,*,*,*
EOS
https://github.com/koorukuroo/mecab-ko-web

mecab api
http://information.center/api/korean?sc=APIKEY&s=안녕하세요
http://information.center/korean

mecab api
import Umorpheme.morpheme as um
from collections import OrderedDict
s = ’유니스트는 울산에 있습니다’
server = ’http://information.center/api/korean’
apikey = ’’ # Register at http://information.center/korean
data = um.analyzer(s, server, apikey, ’유니스트,UNIST’, 1)
temp =
for key, value in data.items():
temp[int(key)] = value
data = OrderedDict(sorted(temp.items()))
for i, j in data.iteritems():
print i, j[’data’], j[’feature’]
0 유니스트 CUSTOM
1 는 JX
2 울산 NNP
3 에 JKB
4 있 VV
5 습니다 EC

Pandas에 대한 자세한 내용은..

링크 예측이란?
사회망(social networks)에서 링크 예측이란
지금의 네트워크에서 빠진 링크를 예측하는 것
미래의 네트워크에서 새롭게 나타나거나 사라질 링크를 예측하는 것

링크 예측 연구 상황
키워드 “link prediction social network”
Wang, Peng, et al. ”Link prediction in social networks: the state-of-the-art.” Science China Information Sciences 58.1 (2015):

1 추천 시스템
친구 추천 (12’)
공동저자 추천 (07’)
온라인 쇼핑몰의 상품 추천 (11’)
특허 추천 (13’)
타분야 협력자 추천 (12’)
연락처 추천 (11’)

2 복잡계 연구
네트워크 진화 연구 (02’)
웹사이트 링크 예측 (02’)

3 다양한 분야에 적용
헬스케어 (12’)
단백질 네트워크 (12’)
비정상적 커뮤니케이션 확인 (09’)

네트워크 링크 예측
사회망
G(V , E) at t
에 대해,
링크가 생기거나 사라지는 것을 (t′ > t)
빠진 링크나 관찰되지 않은 링크가 있는 것을 (at t)
찾아내는 것.

링크 예측 프레임워크
Wang, Peng, et al. ”Link prediction in social networks: the state-of-the-art.”
Science China Information Sciences 58.1 (2015): 1-38.

링크 예측의 이론
https://www.cs.umd.edu/class/spring2008/cmsc828g/
Slides/link-prediction.pdf
Liben‐Nowell, David, and Jon Kleinberg. “The link‐prediction problem
for social networks.” Journal of the American society for information
science and technology 58.7 (2007): 1019-1031.

링크 예측의 세분화
Wang, Peng, et al. ”Link prediction in social networks: the state-of-the-art.”

링크 예측의 세분화

링크 예측의 기본 정의
Γ(x) : 점 x의 이웃들의 집합
|Γ(x)| : 점 x의 이웃들의 개수

공통 이웃들
공통 이웃들(Common Neighbors):
CN(u, v) = |Γ(u) ∩ Γ(v)|
본 그래프는 실제가 아닌 가상으로 설정된 상황임을 알려드립니다

리소스 할당 지수
리소스 할당 지수(Resource Allocation Index):
w∈Γ(u)∩Γ(v)
1
|Γ(w)|

리소스 할당 지수(Resource Allocation Index):
w∈Γ(u)∩Γ(v)
1
|Γ(w)|
preds = nx.resource_allocation_index(G)
for u, v, p in preds:
print ’(%s, %s) -> %.8f’ % (u, v, p)

(수지, 혜리) -> 0.33333333
(수지, 경훈) -> 0.83333333
(아이유, 민호) -> 1.00000000
(혜리, 민호) -> 0.00000000
(혜리, 경훈) -> 0.33333333

w∈Γ(u)∩Γ(v)
1
|Γ(w)|
(수지, 혜리) -> 0.33333333
(수지, 경훈) -> 0.83333333
(아이유, 민호) -> 1.00000000
(혜리, 민호) -> 0.00000000
(혜리, 경훈) -> 0.33333333

한국어 표시하기
pip install --upgrade
git+https://github.com/koorukuroo/networkx_for_unicode
import matplotlib.font_manager as fm
fp1 = fm.FontProperties(fname="./NotoSansKR-Regular.otf")
nx.set_fontproperties(fp1)
G = nx.Graph()
G.add_edge(u’한국어’,u’영어’)
nx.draw(G, with_labels=True)

선호적 연결
선호적 연결(Preferential attachment):
|Γ(u)||Γ(v)|

선호적 연결
nx.draw_networkx_nodes(G, pos, node_size=500, node_color=’yellow’)
nx.draw_networkx_edges(G, pos, alpha=0.2)
nx.draw_networkx_labels(G, pos, font_size=20);
selected_lines = []
for u in G.nodes_iter():
preds = nx.preferential_attachment(G, [(u, v) for v in nx.non_neighbors(G, u)])
largest = heapq.nlargest(5, preds, key = lambda x: x[2])
for l in largest:
selected_lines.append(l)
subG = nx.Graph()
for line in selected_lines:
print line[0], line[1], line[2]
if line[2]>1:
subG.add_edge(line[0], line[1])
pos_subG = dict()
for s in subG.nodes():
pos_subG[s] = pos[s]
nx.draw_networkx_edges(subG, pos_subG, edge_color=’red’)

선호적 연결

선호적 연결
degree = nx.degree_centrality(G)
nx.draw_networkx_nodes(G, pos, node_color=’yellow’, nodelist=degree.keys(),
node_size=np.array(degree.values())*10000)
nx.draw_networkx_edges(G, pos, alpha=0.2)
nx.draw_networkx_labels(G, pos, font_size=20);
selected_lines = []
for u in G.nodes_iter():
preds = nx.preferential_attachment(G, [(u, v) for v in nx.non_neighbors(G, u)])
largest = heapq.nlargest(5, preds, key = lambda x: x[2])
for l in largest:
selected_lines.append(l)
subG = nx.Graph()
for line in selected_lines:
print line[0], line[1], line[2]
if line[2]>1:
subG.add_edge(line[0], line[1])
pos_subG = dict()
for s in subG.nodes():
pos_subG[s] = pos[s]
nx.draw_networkx_edges(subG, pos_subG, edge_color=’red’)

선호적 연결

NetworkX의 Link Prediction 함수들

LPmade
https://github.com/rlichtenwalter/LPmade

데모
matplotlib
ipython과 d3.js

ipython과 d3.js
from IPython.display import display, HTML

d3.js (Data-Driven Documents)

ipython에서 파일 쓰기

ipython에서 d3.js 가동하기
코드 https://goo.gl/LpxsKc

ipython과 d3.js
edges = d3_graph(G)
make_html_graph(edges, 1000, 500) # make_html_graph(edges)
%%HTML
<iframe src="d3.html" width=100% height=500 frameborder=0></iframe>
Demo 화면 : http://i.imgur.com/FeQ9kii.gif

다시 한 번, 이번 TALK의 목적
1 데이터 조종의 자유로움
2 아이디어 구현의 자유로움
3 융합의 자유로움

The End

[20150829, PyCon2015] NetworkX를 이용한 네트워크 링크 예측

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to [20150829, PyCon2015] NetworkX를 이용한 네트워크 링크 예측

Similar to [20150829, PyCon2015] NetworkX를 이용한 네트워크 링크 예측 (20)

More from Kyunghoon Kim

More from Kyunghoon Kim (20)

[20150829, PyCon2015] NetworkX를 이용한 네트워크 링크 예측