S3G2 - a Scalable Structure-correlated Social Graph Generator
1. S3G2: a Scalable Structure-
correlated Social Graph Generator
Minh-Duc Pham Peter
Boncz Orri Erling
Database Architectures Group
Centrum Wiskunde & Informatica (CWI)
S3G2 . 27-Aug-12. Page 1/23
2. Data correlations between attributes
SELECT personID from person
WHERE firstName = ‘Joachim’ AND addressCountry = ‘Germany’
SELECT personID from person
Anti-Correlation
WHERE firstName = ‘Cesare’ AND addressCountry = ‘Italy’
Query optimizers may underestimate or overestimate the result size of conjunctive
predicates
Joachim Loew
Cesare Cesare
Joachim Prandelli
Correlation between predicates has been studied to some
extent in database research (e.g. in the LEO project)
But: correlation-aware query optimization is still hardly
mainstream in database products S3G2 . 27-Aug-12. Page 2/23
3. Data correlations between attributes
SELECT COUNT(*)
FROM paper pa1 JOIN journal jn1 ON pa1.journal = jn1.ID
paper pa2 JOIN journal jn2 ON pa2.journal = jn2.ID
WHERE pa1.author = pa2.author AND
jn1.name = ‘VLDB Journal’ AND jn2.name = ‘TODS’
S3G2 . 27-Aug-12. Page 3/23
4. Data correlations over joins
SELECT COUNT(*)
FROM paper pa1 JOIN journal jn1 ON pa1.journal = jn1.ID
paper pa2 JOIN journal jn2 ON pa2.journal = jn2.ID
WHERE pa1.author = pa2.author AND
jn1.name = ‘VLDB Journal’ AND ‘TODS’
jn2.name = ‘Bioinformatics’
A challenge to the optimizers to adjust estimated join hit ratio pa1.author =
pa2.author depending on other predicates
Correlated predicates are still a frontier area in database research
S3G2 . 27-Aug-12. Page 4/23
5. Graph database systems
Emerging class in database systems
Higher need for correlation-awareness
• graph queries navigate over many steps (=joins)
• well known effect in RDF systems (many self-joins)
• implicit structure of graph/RDF data model re-appears in queries as correlations
(structural correlation)
No existing graph benchmark specifically tests for the effects of correlations
• Synthetic graphs used for benchmarking do not have structural correlations
Need a data generator generating synthetic graph
with data/structure correlations S3G2
S3G2 . 27-Aug-12. Page 5/23
6. Next …
whatdata do we generate?
• social network, Facebook-like
how to generate correlated properties?
• with a compact data generator
how to generate correlated structure?
• multiple correlation dimensions
• scalable MapReduce algorithm (multi-pass)
S3G2 . 27-Aug-12. Page 6/23
7. S3G2: Generating a Correlated Social Graph
“Switzerland”
“Yamaku”
t
eA
Ph
liv
me
Po ot “EPFL”
st o At
dy
a
hasN
Co st u
mm
en
t
upload
cre
InRelationShip
a
Co
te
m User
me User
create
nt
s
ow
kn
knows
kn
o ws
User
cre
like
a
te
User
User
S3G2 . 27-Aug-12. Page 7/23
8. Next …
what data do we generate?
• social network, Facebook-like
how to generate correlated properties?
• with a compact data generator
how to generate correlated structure?
• multiple correlation dimensions
• scalable MapReduce algorithm (multi-pass)
S3G2 . 27-Aug-12. Page 8/23
9. Generating Correlated Property Values
How do data generators generate values? E.g. FirstName
S3G2 . 27-Aug-12. Page 9/23
10. Generating Property Values
How do data generators generate values? E.g. FirstName
Value Dictionary D()
• a fixed set of values, e.g.,
{“Andrea”,“Anna”,“Cesare”,“Camilla”,“Duc”,“Joachim”,“Leon”,“Orri”
.. }
Probability density function F()
• steers how the generator chooses values
− cumulative distribution over dictionary entries determines which value to pick
• could be anything: uniform, binomial, geometric, etc…
− geometric (discrete exponential) seems to explain many natural phenomena
S3G2 . 27-Aug-12. Page 10/23
11. Generating Correlated Property Values
How do data generators generate values? E.g. FirstName
Value Dictionary D()
Probability density function F()
Ranking Function R()
• Gives each value a unique rank between 1 and |D|
−determines which value gets which probability
• Depends on some parameters (parameterized function)
− value frequency distribution becomes correlated by the parameters or R()
S3G2 . 27-Aug-12. Page 11/23
12. Generating Correlated Property Values
How do data generators generate values? E.g. FirstName
Value Dictionary D()
How to implement R()?
{“Andrea”,“Anna”,“Cesare”,“Camilla”,“Duc”,“Joachim”,“Leon”,“Orri”
, .. }
We need a table
limited #combinations storing
Probability density function F()
geometric distribution
|Gender| X |Country| X |BirthYear| X |D|
Potentially
Ranking Function R(gender,country,birthyear) Many!
• gender, country, birthyear correlation parameters
Our Solution:
- Just store the rank of the top-N values, not all|D|
- Assign the rank of the other dictionary values randomly
S3G2 . 27-Aug-12. Page 12/23
13. Compact Correlated Property Value Generation
Using geometric distribution for function F()
S3G2 . 27-Aug-12. Page 13/23
14. Correlated Properties used in S3G2
Main source of dictionary values from DBpedia (http://dbpedia.org)
Various realistic property value correlations ()
e.g.,
(person.location,person.gender,person.birthDay) person.firstName
person.location person.lastName
person.location person.university
person.createdDate person.photoAlbum.createdDate
….
S3G2 . 27-Aug-12. Page 14/23
15. Next …
what data do we generate?
• social network, Facebook-like
how to generate correlated properties?
• with a compact data generator
how to generate correlated structure?
• multiple correlation dimensions
• scalable MapReduce algorithm (multi-pass)
S3G2 . 27-Aug-12. Page 15/23
16. Correlated Edges in a social network
<Britney Spears>
ik e>
“1990” <l
P4
a ”
<b nn
“University of < s irt “A
t y”
s>
t ud hY n
de an
w
e>
Leipzig” yAt ea
Stu e rm
no
>
am
r>
“G
<k
me>
<knows> t>
n
<firstna A
<is>
iv e
rst
“Laura” P1 <l
<fi
”
90
<k
r> “19
hYea
e>
no
5 <birt
k
s> P f
ws
<li
w ityo
>
no <stu
dyA v ers
<Britney Spears>
P3 <k > t “Uni ig”
pz
> P2 Lei
<birth
y At
d
< stu <s
> tudy
ear> Y
<live
At
“University
“University of
At>
of Leipzig”
“1990” Amsterdam”
“Netherlands”
S3G2 . 27-Aug-12. Page 16/23
17. How to generated correlated edges? ed
as b
<Britney Spears>s
de
ke
>
o no ies. t to
“1990”
P4 <l
i
f tw rt ” n wr
<b y o prope Annctio es
rit ) na d
a “u
mil atednt sity f g no any”
“University of < s irt
hY i
e s rrelStud den ctin erm
t ud e
e>
yAt
put (co lity nne > “G
Leipzig” ea
>
am
r>
name> Com heir babi r co veAt
n
?
?
<is>
<first
P1 •
rst
t o
?
“Laura” on a pro arity f <li
<fi
”
90
Use si l
?
r> “19
Multiple correlation dimensions:mi P5 hYea
e>
?
• <birt
k
is f
<li
-Studying near each other t h ityo
<stu
dyA v ers
<Britney Spears>
--liking the same music P3 ion > t “Uni ig”
ct la r z
imi Leip
?
o nne bility s
-- etc, etc t > c roba P2 less
<birth
A
tu dy p
ilar
-- <s i<
m
ly s stu
hig >
Y
h dy
ear>
<live
A t
“University
“University of
At>
Continuously access
of Leipzig”possibly any
“1990” node for correlated
Amsterdam”
“Netherlands”
edges Expensive random I/Os for graphs of a size > RAM S3G2 . 27-Aug-12. Page 17/23
18. Our observation
<Britney Spears>
ik e>
“1990” <l nodes with too large similarity distance
P4
Trick: disregard ”
<b n na
irt (only connect nodes in a similarity window) “A
“University of < s t y”
s>
t ud hY en an
w
d
e>
Leipzig” yAt ea
Stu e rm
no
> Window
am
r>
“G
<k
me>
<knows> t>
n
<firstna A
<is>
iv e
rst
“Laura” P1 <l
<fi
”
90
<k
r> “19
hYea
e>
no
5 <birt
k
s> P f
ws
<li
w ityo
>
no <stu
dyA v ers
<Britney Spears>
P3 <k
io n > t “Uni ig”
ct la r z
o nne bility simi Leip
t > c roba P2 less
<birth
A
tu dy p
ilar
<s <
im
ly s stud
hig >
h
Y
yA
ear>
“University <live t
“University of
At>
Probability of that two nodes are connected is skewed w.r.t
Leipzig”
“1990” Amsterdam”
“Netherlands”
the similarity between the nodes (due to probability distr.) S3G2 . 27-Aug-12. Page 18/23
19. We can Sort nodes on Correlation
Dimension
Similarity metric + Probability function
Similar metric
Sort nodes on similarity (similar nodes are brought near each other)
P1 P5 P3 P2 P4
Munich Dresden Leipzig Leipzig Potsdam
<Ranking along the “Having study together” dimension>
we use space filling curves (e.g. Z-order) to get a linear dimension
Probability function
Pick edge between two nodes based on their ranked distance
(often: geometric distribution, again)
S3G2 . 27-Aug-12. Page 19/23
20. Generate edges along correlation dimensions
W
Sort nodes using MapReduce on similarity metric
Reduce() function keeps a window of nodes to generate edges
• Keep low memory usage (sliding window approach)
Slide the window for multiple passes, each pass corresponds to one correlation
dimension (multiple MapReduce jobs)
• for each node we choose degree per pass (also using a prob. function)
steers how many edges are picked in the window for that node
S3G2 . 27-Aug-12. Page 20/23
21. Correlation dimensions for our Social Graph
Having studied together
Having common interests (hobbies)
Random dimension
• motivation: not all friendships are explainable (…)
(of course, these two correlation dimensions are still a gross simplification of reality
but this provides some interesting material for benchmark queries)
S3G2 . 27-Aug-12. Page 21/23
22. Evaluation (… see the paper)
Social graph characteristics
• Output graph has similar characteristics as observed in real social network
(i.e., “small-world network” characteristics)
- Power-law social degree distribution
- Low average path-length
- High clustering coefficient
Scalability
• Generates up to 1.2 TB of data (1.2 million users) in half an hour
- Runs on a cluster of 16 nodes (part of the SciLens cluster, www.scilens.org)
• Scales out linearly
S3G2 . 27-Aug-12. Page 22/23
23. Conclusion
Propose novel framework for scalable graph generator that can
• Generate huge graph having correlations between the graph structure and
graph data
• Exploit parallelism offered by MapReduce paradigm for scalability
Future step: Use S3G2 as the base for a novel benchmark in graph query processing (
www.w3.org/wiki/Social_Network_Intelligence_BenchMark)
S3G2 . 27-Aug-12. Page 23/23
Editor's Notes
As you can see, data in real life is correlated, for example, people living in Germany have different distribution of names than people in Italy. And in the database systems, the data correlations strongly influence the performance of the system in processing queries. It influences the intermediate result sizes of query plans, the effectiveness of indexing techniques & cause the absence or presence of locality in data access pattern. As an example, let’s have a look at the influence of the data correlation to the intermediate results of selection. Here we have two queries. One looking for people with firstname Joachim in Germany, and the other looks for people with first name Cesare in Italy. As Joachim is more popular in Germany than in other countries, and similarly for Cesare in Italy, these queries returns large number of results. These example queries are actually motivated from names of the coaches Joachim Lowe and Cesare Prandelli, of nation football teams in Germany and Italy. What if we change the predicates in these queries, for e.g., looking for people with name Cesare in Germany. As there is anti-correlation between these names and the coutries, these queries will return a very small number of results. Since the query optimizers commonly use independence assumption for estimating the result size of conjunctive predicates, they may underestimate or overestimate the results size In the relation database, the correlations between predicates in the same table have been studied in some degree. However, employing technique for detecting the correlations is still hardly mainstream in database products.
We talked about the correlations between attributes of the same table
Now consider the data correlations between predicates separated by joins. Here, we consider a DBLP example that look for all the authors have publication both in VLDB Journal and TODS. This query is likely to have a larger result size than a query that substitutes TODS for Bioinformatics, even though Bioinformatics is a much larger publication than TODS. The reason is that database researchers are less likely do cross-disciplinary work. In this query plan, the query optimizer should be aware of the correlation predicates in different tables cross join. And this correlation ofcourse influences the best join order. As we know, currently no system can handle this well. To be summarized, Correlated predicates are still a frontier area in database research
And the requirement for recognizing correlations in query processing is even higher in the graph database system, which is an emerging class in database systems with many recent start-up companies. If we consider the most popular graph model, the RDF graph model. In RDF work load, there are many self-join over big table of RDF triple. The selection of a property will be join with the selection of other property in a big table. And there can be more than 20 joins. Thus, the join hit ratio is heavily correlated with the correlations. And there are implicit correlations between the structure of the graph and the data in the graph, which also strongly influence the performance of systems and algorithm. However, existing graph benchmarks do not specifically test for the effect of the correlations. The reason is the synthetic graph generated by these existing graph benchmarks do not have structural correlations. Therefore, in our work, we propose a framework for generating a huge highly connected graph with data/structure correlations.
Now we talk about the data that we generate for demonstrating our framework which is a social network graph simulates the logical schema of the most popular social network, Facebook.
This social network data generator is actually a part of our current work on a graph benchmark, however, the benchmark is out of the scope of our talk. A s real social network is huge graph with many structural & data correlations. It is a very good test case for the performance of the graph database system. In addition, we would want to note that the social network data is very precious & interesting. For instance, marketing companies usually try to obtain or crawl a subset of social network data for their analysis. Therefore, the social network seems to be one typical market for graph database system.
Next we talk about the correlated properties can be generated with compact data generator.
However, before talk about generating correlated values for property
We talk about how the generator generates non-correlated property values like firstname. To do that, the data generator needs two ingredients A value dictionary which contains a fixed set of values. Here we have a set of first names. A probability density function, that pick a value from the dictionary with different distribution. The density function could be uniform, bionomial or geometric distribution.
Back to our question, How to generate correlated property values, To do this, our data generator use the third ingredient, a ranking function This function introduce the correlation by having correlation parameters. It map each dictionary value to unique rank. However, given different parameter, it does that in different ways.
Specifically, in example case of generating the correlated property values for the first names. We use the geometric distribution for the probability density function which is appropriate for many natural phenomena. We use the parameters gender, country and birthyear for the ranking function since the distribution of firstnames will be influenced by these parameters. However, a question is how we should implement the ranking function. Normally, we need a table that stores a cartisian products of all parameters and dictionary values. There number of combinations for the parameters usually limited, however, there potentially many dictionary values. Which requires a too big table. We don’t want our generator to depend on huge data files. Thus, we propose a simple solution by just store the rank of only top-N values from dictionary, and assign the rank of other dictionary values randomly. The implicit reason is that the values ranked lower than N have a very small probability to be selected, thus, randomly assign their ranks only slightly decreases the plausibility of the generated values.
This figure show the distribution of name popularity in Germany above and Italy below. The x-axis is the rank of the dictionary value produced by the ranking function. And y-axis is the popularity. We store the top-10 ranks which is the green stuff in the table, and the other ranks are produced randomly. We do not store any of them. So that it is compact. The figure also shows that certain names popular in Germany but not popular in Italy and vice versa. Thus, we have the location correlated firstname here.
For our Social network, the main sources of our dictionary values are from Dbpedia, And we genenerate the property values with various correlations such as the lastname & the university where people study correlated with the location. Detailed correlations can be found in the paper.
Finally, we talk about how to generate the structure by something called correlation dimension and make the generating algorithm scalable with MapReduce paradigm.
Here, the edges in the social graph are the friendships. The friendship generation between two people is usually correlated with their properties. For example, people study in the same university have high probability to be friends, or people are likely to be connected with the one who have the same hobby.
How the correlated friendship edges are generated. Formalizing what I have said that people study together have high probability to be friend, For connecting two nodes, we compute the similarity of two nodes based on their correlated properties, and then use a density function that give high probability for two nodes of small similarity distance, and low probability for large similarity distance. We call the combination of the similarity metric and the probability function as the correlation dimension, and there are multiple correlation dimensions However, if you would use monte carlo approach & start comparing all nodes using the probability function to decide the whether they should be connected or not, you get random access pattern. And for large graph, this cause expensive random I/Os, so that it is not feasible for generating huge graph.
To address this problem, we need sth smarter. we observe that the probability that two nodes are connected is skewed with regards to the similarity between nodes. And the connection probability is very small between two nodes that are less similar. Thus, we suggested to use a trick, that disregard nodes with too much large similarity distance, and only consider generating the connections for nodes in a similarity window
To do that with the correlation dimension, we first sort nodes according to the similarity metric so that similar nodes are brought near each other. Here is an example of sorting people according to the similarity metric “having study together”. We would like to note that for the similarity metric that are multi-dimensions, we use space filling curves to map it to a linear dimension so that the values can be sorted. The probability function is used for selecting edges according to the distance between ranks of two nodes along the similarity metric.
We implement this using MapReduce paradigm. Each pass of MapReduce sort nodes on one similarity metric. The reducer keeps a window of sorted nodes to generate edges so that we do not need to keep all nodes in memory. Since there can be many correlation dimensions, the window slides for multiple passes, each pass for one correlations dimension. This means that we have to run multiple MapReduce jobs. Here, we also have degree for each pass, which specify how many edges that we should generate for a node in each pass.
In our social graph generator, we consider three correlation dimensions: Having studied together, having common interest or hobbies, and a random dimension. The reason that we use a random dimension is that not all friendships are explainable. Their connectivity is not only around these two above correlation dimension. The random noise occurs in practice, and this random dimension can make the data distribution more realistic. And of course, considering two correlations is still a simplication of reality but we believe that this can provide interesting material for graph benchmark queries.
The evaluation on the generated social graph shows that our graph have all the important characteristics observed in real social network. The experimental also shows that our generating algorithm is scalable that can generate 1.2 TB of data in about half an hour using a cluster of 16 nodes.
To be concluded, we have prosed a novel framework for scalable graph generator that can generate huge graph having structure and data correlations. We have exploit the MapReduce paradigm for implementing a scalable generator. As a future step we will use this data generator for a novel benchmark in graph query processing that we call Social Network Intelligence Benchmark.