S3G2 - a Scalable Structure-correlated Social Graph Generator

S3G2: a Scalable Structure-
correlated Social Graph Generator
Minh-Duc Pham Peter
Boncz Orri Erling

Database Architectures Group
Centrum Wiskunde & Informatica (CWI)

S3G2 . 27-Aug-12. Page 1/23

Data correlations between attributes

SELECT personID from person
WHERE firstName = ‘Joachim’ AND addressCountry = ‘Germany’

SELECT personID from person
Anti-Correlation
WHERE firstName = ‘Cesare’ AND addressCountry = ‘Italy’

 Query optimizers may underestimate or overestimate the result size of conjunctive
predicates
Joachim Loew
Cesare Cesare
Joachim Prandelli
Correlation between predicates has been studied to some
extent in database research (e.g. in the LEO project)

But: correlation-aware query optimization is still hardly
mainstream in database products S3G2 . 27-Aug-12. Page 2/23

Data correlations between attributes

SELECT COUNT(*)
FROM paper pa1 JOIN journal jn1 ON pa1.journal = jn1.ID
paper pa2 JOIN journal jn2 ON pa2.journal = jn2.ID
WHERE pa1.author = pa2.author AND
jn1.name = ‘VLDB Journal’ AND jn2.name = ‘TODS’

S3G2 . 27-Aug-12. Page 3/23

Data correlations over joins

SELECT COUNT(*)
FROM paper pa1 JOIN journal jn1 ON pa1.journal = jn1.ID
paper pa2 JOIN journal jn2 ON pa2.journal = jn2.ID
WHERE pa1.author = pa2.author AND
jn1.name = ‘VLDB Journal’ AND ‘TODS’
jn2.name = ‘Bioinformatics’

 A challenge to the optimizers to adjust estimated join hit ratio pa1.author =
pa2.author depending on other predicates

Correlated predicates are still a frontier area in database research

S3G2 . 27-Aug-12. Page 4/23

Graph database systems

 Emerging class in database systems

 Higher need for correlation-awareness
• graph queries navigate over many steps (=joins)
• well known effect in RDF systems (many self-joins)
• implicit structure of graph/RDF data model re-appears in queries as correlations
(structural correlation)

 No existing graph benchmark specifically tests for the effects of correlations
• Synthetic graphs used for benchmarking do not have structural correlations

Need a data generator generating synthetic graph
with data/structure correlations  S3G2

S3G2 . 27-Aug-12. Page 5/23

Next …

 whatdata do we generate?
• social network, Facebook-like

 how to generate correlated properties?
• with a compact data generator

 how to generate correlated structure?
• multiple correlation dimensions
• scalable MapReduce algorithm (multi-pass)

S3G2 . 27-Aug-12. Page 6/23

S3G2: Generating a Correlated Social Graph
“Switzerland”
“Yamaku”

t
eA
Ph

liv
me
Po ot “EPFL”
st o At
dy

a
hasN
Co st u
mm
en
t

upload
cre
InRelationShip

a
Co

te
m User
me User

create
nt

s
ow
kn

knows
kn
o ws

User

cre
like

a
te
User

User

S3G2 . 27-Aug-12. Page 7/23

Next …

 what data do we generate?



S3G2 . 27-Aug-12. Page 8/23

Generating Correlated Property Values

 How do data generators generate values? E.g. FirstName

S3G2 . 27-Aug-12. Page 9/23

Generating Property Values


 Value Dictionary D()
• a fixed set of values, e.g.,
{“Andrea”,“Anna”,“Cesare”,“Camilla”,“Duc”,“Joachim”,“Leon”,“Orri”
.. }

 Probability density function F()
• steers how the generator chooses values
− cumulative distribution over dictionary entries determines which value to pick

• could be anything: uniform, binomial, geometric, etc…
− geometric (discrete exponential) seems to explain many natural phenomena

S3G2 . 27-Aug-12. Page 10/23





 Ranking Function R()
• Gives each value a unique rank between 1 and |D|
−determines which value gets which probability

• Depends on some parameters (parameterized function)
− value frequency distribution becomes correlated by the parameters or R()

S3G2 . 27-Aug-12. Page 11/23



How to implement R()?
{“Andrea”,“Anna”,“Cesare”,“Camilla”,“Duc”,“Joachim”,“Leon”,“Orri”
, .. }
We need a table
limited #combinations storing
geometric distribution
|Gender| X |Country| X |BirthYear| X |D|
Potentially
 Ranking Function R(gender,country,birthyear) Many! 
• gender, country, birthyear  correlation parameters
Our Solution:
- Just store the rank of the top-N values, not all|D|
- Assign the rank of the other dictionary values randomly
S3G2 . 27-Aug-12. Page 12/23

Compact Correlated Property Value Generation
Using geometric distribution for function F()

S3G2 . 27-Aug-12. Page 13/23

Correlated Properties used in S3G2

 Main source of dictionary values from DBpedia (http://dbpedia.org)

 Various realistic property value correlations ()
e.g.,
(person.location,person.gender,person.birthDay)  person.firstName
person.location  person.lastName
person.location  person.university
person.createdDate  person.photoAlbum.createdDate
….

S3G2 . 27-Aug-12. Page 14/23

Next …

 what data do we generate?



S3G2 . 27-Aug-12. Page 15/23

Correlated Edges in a social network
<Britney Spears>
ik e>
“1990” <l
P4
a ”
<b nn
“University of < s irt “A
t y”

s>
t ud hY n
de an

w

e>
Leipzig” yAt ea
Stu e rm

no
>

am
r>
“G
<k
me>
<knows> t>

n
<firstna A

<is>
iv e

rst
“Laura” P1 <l

<fi
”
90
<k
r> “19
hYea
e>

no

5 <birt
k

s> P f
ws
<li

w ityo
>

no <stu
dyA v ers
<Britney Spears>
P3 <k > t “Uni ig”
pz
> P2 Lei
<birth

y At
d
< stu <s
> tudy
ear> Y

<live
At
“University
“University of
At>
of Leipzig”
“1990” Amsterdam”
“Netherlands”
S3G2 . 27-Aug-12. Page 16/23

How to generated correlated edges? ed
as b
<Britney Spears>s
de
ke
>
o no ies. t to
“1990”
P4 <l
i
f tw rt ” n wr
<b y o prope Annctio es
rit ) na d
a “u
mil atednt sity f g no any”
“University of < s irt
hY i
e s rrelStud den ctin erm
t ud e

e>
yAt
put (co lity nne > “G
Leipzig” ea
>

am
r>
name> Com heir babi r co veAt

n
?

?
<is>
<first
P1 •

rst
t o

?
“Laura” on a pro arity f <li

<fi
”
90
Use si l

?
r> “19
Multiple correlation dimensions:mi P5 hYea
e>

?
• <birt
k

is f
<li

-Studying near each other t h ityo
<stu
dyA v ers
<Britney Spears>
--liking the same music P3 ion > t “Uni ig”
ct la r z
imi Leip
?
o nne bility s
-- etc, etc t > c roba P2 less
<birth

A
tu dy p
ilar

-- <s i<
m
ly s stu
hig >
Y

h dy
ear>

<live
A t
“University
“University of
At>
Continuously access
of Leipzig”possibly any
“1990” node for correlated
Amsterdam”
“Netherlands”
edges  Expensive random I/Os for graphs of a size > RAM S3G2 . 27-Aug-12. Page 17/23

Our observation
<Britney Spears>
ik e>
“1990” <l nodes with too large similarity distance
P4
Trick: disregard ”
<b n na
irt (only connect nodes in a similarity window) “A
“University of < s t y”

s>
t ud hY en an

w
d

e>
Leipzig” yAt ea
Stu e rm

no
> Window

am
r>
“G
<k
me>
<knows> t>

n
<firstna A

<is>
iv e

rst
“Laura” P1 <l

<fi
”
90
<k
r> “19
hYea
e>

no

5 <birt
k

s> P f
ws
<li

w ityo
>

no <stu
dyA v ers
<Britney Spears>
P3 <k
io n > t “Uni ig”
ct la r z
o nne bility simi Leip
t > c roba P2 less
<birth

A
tu dy p
ilar

<s <
im
ly s stud
hig >
h
Y

yA
ear>

“University <live t
“University of
At>
Probability of that two nodes are connected is skewed w.r.t
Leipzig”
“1990” Amsterdam”
“Netherlands”
the similarity between the nodes (due to probability distr.) S3G2 . 27-Aug-12. Page 18/23

We can Sort nodes on Correlation
Dimension

Similarity metric + Probability function
 Similar metric
Sort nodes on similarity (similar nodes are brought near each other)

P1 P5 P3 P2 P4
Munich Dresden Leipzig Leipzig Potsdam
<Ranking along the “Having study together” dimension>
we use space filling curves (e.g. Z-order) to get a linear dimension
 Probability function
Pick edge between two nodes based on their ranked distance
(often: geometric distribution, again)

S3G2 . 27-Aug-12. Page 19/23

Generate edges along correlation dimensions

W

 Sort nodes using MapReduce on similarity metric
 Reduce() function keeps a window of nodes to generate edges
• Keep low memory usage (sliding window approach)

 Slide the window for multiple passes, each pass corresponds to one correlation
dimension (multiple MapReduce jobs)
• for each node we choose degree per pass (also using a prob. function)
steers how many edges are picked in the window for that node

S3G2 . 27-Aug-12. Page 20/23

Correlation dimensions for our Social Graph

 Having studied together

 Having common interests (hobbies)

 Random dimension
• motivation: not all friendships are explainable (…)

(of course, these two correlation dimensions are still a gross simplification of reality
but this provides some interesting material for benchmark queries)

S3G2 . 27-Aug-12. Page 21/23

Evaluation (… see the paper)

 Social graph characteristics
• Output graph has similar characteristics as observed in real social network
(i.e., “small-world network” characteristics)
- Power-law social degree distribution
- Low average path-length
- High clustering coefficient

 Scalability
• Generates up to 1.2 TB of data (1.2 million users) in half an hour
- Runs on a cluster of 16 nodes (part of the SciLens cluster, www.scilens.org)
• Scales out linearly

S3G2 . 27-Aug-12. Page 22/23

Conclusion

 Propose novel framework for scalable graph generator that can
• Generate huge graph having correlations between the graph structure and
graph data
• Exploit parallelism offered by MapReduce paradigm for scalability

 Future step: Use S3G2 as the base for a novel benchmark in graph query processing (
www.w3.org/wiki/Social_Network_Intelligence_BenchMark)

S3G2 . 27-Aug-12. Page 23/23

S3G2 - a Scalable Structure-correlated Social Graph Generator

Recommended

Recommended

More Related Content

Similar to S3G2 - a Scalable Structure-correlated Social Graph Generator

Similar to S3G2 - a Scalable Structure-correlated Social Graph Generator (20)

S3G2 - a Scalable Structure-correlated Social Graph Generator

Editor's Notes