Our project considers the problem of implementing metrics for link prediction in a social network over different types of database systems (MySQL, Redis and Neo4J). In particular, we study how the features of the database system affect the ease in which link prediction may be performed.
3. What backend to choose?
● Premise: <1M Nodes
● DIY vs. existing
● Data model
● Limitations/Features
● TPC-C won't help...
4. Previous work
Compared databases
Limitations and Features
Data model
Implementation
Experiments
Measurements
N. Ruflin, H. Burkhart, and S. Rizzotti. Social-data storage-systems.
In Databases and Social Networks, DBSocial '11, pages 7-12, New York, NY, USA, 2011. ACM.
5. Our work
● Implemented 7 Link-Prediction metrics
● Experimented on 10 social-networks
● Over 3 different backends
○ Relational (MySQL)
○ Key-Value (Redis)
○ Graph (Neo4J)
● What did we find?
○ Stay tuned :)
7. ● Why Link Prediction?
○ Well researched
○ Useful
○ Multiple scoring functions
Link Prediction
D. Liben-Nowell and J. Kleinberg.
The link prediction problem for social networks. In CIKM, 2003.
"Given a snapshot of a social network at time t,
we seek to accurately predict the edges that will be added
to a specific node during the interval from time t
to a given future time t'."
8. ● Common Neighbors
○ Only neighbors
● Katz measure
○ Paths
● Rooted PageRank
○ Random walk
Link Prediction examples
12. Storage systems
● Why these systems?
○ Popular
○ Open Source
● Perfect implementation?
○ No. But,
■ Unbiased
■ Best practices
■ Same time-frame
Full implementation available on GitHub:
github.com/natict/gdbb
13. Implementation of
Common Neighbours
select E2.id2 as y, count(E2.id1) as neighbor_count
from edges as E1 join edges as E2
where E1.id1 = x and E1.id2 = E2.id1
and E1.id1 <> E2.id2
group by y
order by neighbor_count desc
imit 100;
START a=node({n})
MATCH (a)-[:COAUTH]->(b)<-[:COAUTH]-(c)
WHERE a <> c
RETURN a.nid,c.nid,count(b) as score
ORDER BY score DESC
LIMIT 100
local tc = {};
local x = KEYS[1];
for k1,n in pairs(redis.call('smembers', x)) do
for k2,y in pairs(redis.call('smembers', n)) do
if x ~= y then
tc[y] = (tc[y] or 0) + 1;
end;
end;
end;
local ttop = {}; -- Extract top 100 results
local min = math.huge;
local mini = '';
for k,v in pairs(tc) do
if (#ttop < 100) then
table.insert(ttop, {k,v});
if v<min then min=v; mini=table.maxn(ttop); end;
else
if v>min then
ttop[mini] = {k,v};
min = math.huge;
for i = 1,#ttop,1 do
if ttop[i][2]<min then min=ttop[i][2]; mini=i; end;
end;
end;
end;
end; -- Now we just need to sort, and format the output...
...
SQL
Cypher
Lua
19. Conclusions
● MySQL is highly optimised
○ mainly for simple queries (with few joins)
● Redis is very flexible and fast
○ mainly with complex metrics
● Neo4J has implementation simplicity
○ with some limitations
○ still evolving at a fast pace
● Future work
○ More databases
○ More algorithms
20. Thank you
Nati (Netanel) Cohen-Tzemach
linkedin.com/in/natict
Acknowledgments:
● Israel Science Foundation (Grant 143/09)
● Ministry of Science and Technology (Grant 3-8710)
● DBSocial Travel Award