Mining the social graph

Mining the Social Graph
mixi.inc
Shunya Kimura

Introduction

• Name: Shunya Kimura

• twitter: @kimuras

• Job:Data mining, Software engineering

• text mining, graph mining, search engine

Agenda

• Introduction
• The past work
• Introduction to GraphDB
• Introduction to Neo4j
• Introduction to analysis sample

Motivation for social graph analysis
Test of millions of nodes, hundreds of millions of edges.

The diversity of graph algorithm by developing distributed processing technology.

Challenging.

Number of users on mixi
30000000

ID
22500000
# of member id

15000000

7500000

0
2007 2008 2009 2010 2011
year

Approach for SG analysis

Feed Back

• Friend recommend
• Community recommend

Relational Databases

from_id to_id id name age
1 2 1 Kimura 18
1 3 2 kato 45
2 3 3 ito 21


Dump &
Denormalization

from_id to_id id name age
1 2 1 Kimura 18
1 3 2 kato 45
2 3 3 ito 21


Dump &
Denormalization

from_id to_id id name age Key value

1 2 1 Kimura 18 From:1 2,3

1 3 2 kato 45 From:2 3

2 3 3 ito 21 Prof:1 Kimura,18
Prof:2 Kato,45


Dump &
Denormalization


1 2 1 Kimura 18 From:1 2,3

1 3 2 kato 45 From:2 3

2 3 3 ito 21 Prof:1 Kimuras,18
Prof:2 Kato,45


Dump &

reimplementation Denormalization


1 2 1 Kimura 18 From:1 2,3

1 3 2 kato 45 From:2 3

Prof:2 Kato,45


Dump &



1
1
2
3
maintenance cost
1
2
Kimura
kato
18
45
From:1
From:2
2,3
3

Prof:2 Kato,45


Dump &



1
1
2
3
maintenance cost
1
2
Kimura
kato
18
45
From:1
From:2
2,3
3

Prof:2 Kato,45

scalability

What is graph
Vertex (node)

What is graph
Vertex (node)

Edge

What is graph
Vertex (node)

Undirected graph

Edge

What is graph
Vertex (node)

Directed graph

Edge

What is GraphDB
Vertex (node)

Edge

What is GraphDB
ID: 1
Vertex (node)
NAME: kimura
PROP: Male
AGE: 18

Edge

What is GraphDB
ID: 1
Vertex (node)
NAME: kimura
PROP: Male
AGE: 18

Edge
ID: 2
NAME: ITO
PROP: Female
AGE: 21

What is GraphDB
ID: 1
Vertex (node)
NAME: kimura
PROP: Male
AGE: 18

Edge
ID: 2
ID: 3 NAME: ITO
LABEL: Like PROP: Female
Since: 2011/08/06 AGE: 21
OutGoing: 2

The implementations
for GraphDB

http://en.wikipedia.org/wiki/GraphDB

GraphDB Neo4j
• True ACID transactions
• High availability
• Scales to billions of nods and relationships
• High speed querying through traversals

Single instance(GPLv3) Multiple instance(AGPLv3)
Embedded EmbeddedGraphDatabase HighlyAvailableGraphDatabase
Standalone Neo4j Server Neo4j Server high availability mode

http://neo4j.org/

Other my favorite features
for Neo4j

http://www.tinkerpop.com/post/4633229547/tinkerpop-graph-stack

for Neo4j
• RESTful APIs


for Neo4j
• RESTful APIs
• Query Language(Cypher)


for Neo4j
• RESTful APIs
• Full indexing
– lucene


for Neo4j
• RESTful APIs
• Full indexing
– lucene
• Implemented graph algorithm
– A*, Dijkstra
– High speed traverse


for Neo4j
• RESTful APIs
• Full indexing
– lucene
• Implemented graph algorithm
– A*, Dijkstra
– High speed traverse
• Gremlin supported
– Like a query language


Introduction simple Neo4j usecase
Single node Multi node
Embedded
Server

Embedded

Analyses system
Server

Embedded

Analyses system Analyses system
Server

Embedded


Analyses system
Server

Embedded


Server

Analyses system
Embedded

Analyses system

Server

Introduction to simple
embedded Neo4j

• Insert Vertices & make Relationships
• Single node & Embedded
• Traversal sample

Insert vertices,
make relationship
public final class InputVertex {
public static void main(final String[] args) {
GraphDatabaseService graphDb = new
EmbeddedGraphDatabase("/tmp/neo4j");
Transaction tx = graphDb.beginTx();
try {
Node firstNode = graphDb.createNode();
firstNode.setProperty("Name", "Kimura");
Node secondNode = graphDb.createNode();
secondNode.setProperty("Name", "Kato");
firstNode.createRelationshipTo(secondNode,
DynamicRelationshipType.withName("LIKE"));
tx.success();
} finally {
tx.finish();
}
graphDb.shutdown();
}
}

Insert vertices,
make relationship
public static void main(final String[] args) { ID: 1
GraphDatabaseService graphDb = new NAME: kimura
try {
tx.success();
} finally {
tx.finish();
}
graphDb.shutdown();
}
}

Insert vertices,
make relationship
try {
tx.success();
} finally { ID: 2
tx.finish(); NAME: Kato
}
graphDb.shutdown();
}
}

Insert vertices,
make relationship
try {
ID: 3
firstNode.setProperty("Name", "Kimura"); Relation: Like
tx.success();
} finally { ID: 2
tx.finish(); NAME: Kato
}
graphDb.shutdown();
}
}

Batch Insert
• Non thread safe, non transaction
• But very fast!
public final class Batch {
BatchInserter inserter = new BatchInserterImpl("/tmp/neo4j",
BatchInserterImpl.loadProperties("/tmp/neo4j.props"));
Map<String, Object> prop = new HashMap<String, Object>();
prop.put("Name", "Kimura");
prop.put("Age", 21);
long node1 = inserter.createNode(prop);

prop.put("Name", "Kato");
prop.put("Age", 21);
long node2 = inserter.createNode(prop);
inserter.createRelationship(node1, node2,
DynamicRelationshipType.withName("LIKE"), null);
inserter.shutdown();
}
}

Traversal sample
• You can specify the traverse criteria
GraphDatabaseService graphDB = new EmbeddedGraphDatabase(args[0]);
Node node = graphDB.getNodeById(1);
Traverser friends = node.traverse(

Order.DEPTH_FIRST,

StopEvaluator.END_OF_GRAPH,

ReturnableEvaluator.ALL_BUT_START_NODE,

DynamicRelationshipType.withName("LIKE"),

Direction.OUTGOING);
for (Node nodeBuf : friends) {
TraversalPosition currentPosition = friends.currentPosition();
}
}

Traversal sample
//how to traversal
Order.DEPTH_FIRST, BREADTH_FIRST

StopEvaluator.END_OF_GRAPH,



}
}

Traversal sample
//how to traversal
//traversal termination condition
StopEvaluator.END_OF_GRAPH, DEPTH_ONE



}
}

Traversal sample
//how to traversal
// to get the type of node
ReturnableEvaluator.ALL_BUT_START_NODE, ALL, isReturnableNode()


}
}

Traversal sample
//how to traversal
// type of relational for traverse

}
}

Traversal sample
//how to traversal
// type of relational for traverse
// specify a edge type for traverse
Direction.OUTGOING); INCOMING, BOTH
}
}

Traversal sample
Order.BREADTH_FIRST
• Breadth-ﬁrst search

Traversal sample
Order.DEPTH_FIRST
• Depth-ﬁrst search

Neoclipse sample

http://wiki.neo4j.org/content/Neoclipse

experiment
• Store the mixi’s social graph for Neo4j

• Condition

• Machine: 24 core CPU, Memory 65GB

• Neo4j: BatchInsert, community, embedded

• Data

• # of node 15 million # of edge 600 million

experiment
• Store the mixi’s social graph for Neo4j

• Condition

• Machine: 24 core CPU, Memory 65GB

• Neo4j: BatchInsert, community, embedded

• Data

• # of node 15 million # of edge 600 million

process time 513m17sec (about 8.6h)

Network Dataset
• Stanford Large Network Dataset Collection

• SNAP has a Wide variety of graph data!
Social Networks Communication networks

Citation networks Collaboration networks

Web graphs Product co-purchasing networks

Internet peer-to-peer networks Road networks

Autonomous systems graphs Signed networks

Wikipedia networks and metadata Memetracker and Twitter

http://snap.stanford.edu/data/index.html

Introduction to Analysis
Sample

Architecture

Service
Database Analysis Visualization
(Social Graph)

Introduction Analyses
Sample

• Centrality
• Clustering coefﬁcient

Centrality
• Centrality
• to measure the importance of eahc nodes

Centrality
• Centrality
closeness centrality

Centrality
• Centrality
closeness centrality Pagerank

Centrality
• Centrality

degree centrality

Centrality
• Centrality

degree centrality betweenness centrality

Centrality
• Centrality


eigenvector centrality

Centrality
• Centrality


eigenvector centrality centraization

Centrality
• Centrality

degree centralitybetweenness centrality
eigenvector centrality centraization

Degree centrality
• The simplest measuring.

• Counting the number of edge of each nodes.

• num of friends

Degree centrality


• num of friends

1 1

1

Degree centrality


• num of friends

2
1 1

2
1
2

Degree centrality


• num of friends

2
1 1
5
2
1
2

Degree distribution of mixi

• Random sampling the 1000 users

• the summary of degree sistribution

Min 1st Que. Median Mean 3rd Que. Max

1.00 3.00 10.00 25.69 30.00 903.00

Clustering coefﬁcient

• Network destiny around any node.

• ≒ destiny relationship



clustering coefﬁcient
0 / 3 = 0 (min)



0 / 3 = 0 (min)

=1/3



0 / 3 = 0 (min)

=1/3

=2/3



0 / 3 = 0 (min)

=1/3

=2/3

= 3 / 3 = 1 (max)


• Random sampling the 1000 users

• summary for Clustering coefﬁcient

Min 1st Que. Median Mean 3rd Que. Max

0.00 0.00 0.1157 0.2071 0.2667 1.000

the sample of low Clustering
coefﬁcient user
• degree 25, clustering coefﬁcient 0.08

the sample of middle
Clustering coefﬁcient user

the sample of high Clustering
coefﬁcient user

the sample of MAX Clustering
coefﬁcient user
• degree 4, clustering coefﬁcient 1

• Visualize a my social graph on mixi

• Weighting the Edge

• Amount of communication(color, thickness)

• Weighting the Vertex

• cluster coefﬁcient(color, thickness)

• visualization tool Gephi

http://gephi.org/

• Motivation for Social Graph mining

• Overview for GraphDB

• Introduction for Neo4j

• The samples for graph analysis with R

• Introduction Visualization tool Gephi

Mining the social graph

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Recently uploaded

Recently uploaded (20)

Mining the social graph

Editor's Notes