SlideShare a Scribd company logo
On Statistical Characteristics
of Real-life Knowledge Graphs
Weining Qian
Institute for Data Science and Engineering
East China Normal University
wnqian@sei.ecnu.edu.cn
NSFC key project on Big Data Benchmarks
• Theory and Methods of Benchmarking Big Data
Management Systems
– 2015.1 – 2019.12
2
East China
Normal University
Renmin University
of China
Institute of Computing Technology
Chinese Academy of Sciences
6/22/2016 LDBC TUC 2016
Research focuses
• BigDataBench Suite (@ict.ac)
– http://prof.ict.ac.cn
• Benchmarking transaction processing in
NewSQL systems (@ecnu)
– SecKillBench
• Benchmarking Big Data systems (@rmu)
• Benchmarks for graph data (@ecnu)
– Social media, knowledge graphs, ...
36/22/2016 LDBC TUC 2016
Knowledge graphs
4
knowledge
basesemantic
web
knowledge
graph
linked data
6/22/2016 LDBC TUC 2016
Some KG‘s
5
• YAGO
– 10M entities in 350K classes
– 120M facts for 100 relations
– 100 languages
– 95% accuracy
• DBPedia
– 4M entities in 250 classes
– 500M facts for 6000 properties
– live updates
• Kosmix
– 6.5M concepts, 6.7M concept
instances,
– 165M relationship instances
• Freebase
– 40M entities in 15000 topics
– 1B facts for 4000 properties
– core of Google Knowledge
Graph
• Google Knowledge Graph
– 600M entities in 15000 topics
– 20B facts
• Probase
– 2.7 million+ concepts
• And many domain/application-
specific knowledge graphs
6/22/2016 LDBC TUC 2016
A natural question
• Knowledge graph can serve as the backbone
of many Web-scale applications, such as
search engine, question answering, text
understanding etc.
• How to effectively and efficiently manage a
large-scale knowledge graph?
– MySQL, Oracle, Neo4j, TITAN, Trinity, or other
triple stores???
66/22/2016 LDBC TUC 2016
Social networks vs. Knowledge graphs
• Though there are some benchmarks for social
networks exist
– Facebook LinkBench, LDBC SNB, BSMA, ...
• Knowledge graph is different with social network
– More semantic labels in both entities and relations
– Topic or domain sensitive
– Contains various kinds of knowledge
– Hard to define a unified schema
76/22/2016 LDBC TUC 2016
Why study their statistical characteristics?
• To better understand knowledge graphs
• To help the selection of seeding data sets in
benchmarks
• To help the development of data generators
86/22/2016 LDBC TUC 2016
Characteristics of large-scale graphs
• Previous research works on analyzing structural
properties of large scale graphs, e.g.
– [Broder et al. Comput. Netw. 2000] studied the web
structure as a graph via a series of metrics, e.g degree,
diameter, component.
– [Kumar et al. KDD, 2006] studied the dynamic social
network’s structure properties, e.g. degree, hop etc.
– [Boccaletti et al. Phys. Rep. 2006] surveyed the studies of
the structure and dynamics of complex network.
96/22/2016 LDBC TUC 2016
Real-life knowledge graphs
• YAGO2
– A huge semantic knowledge graph based on WordNet,
Wikipedia and GeoNames
– 10+ million entities, 120+ million facts
• Separate the YAGO2 into three sub-graphs
– YagoTax: Taxonomy tree of YAGO2
– YagoFact: Facts in YAGO2
– YagoWiki: Hyperlink relations in YAGO2 based on
Wikipedia
106/22/2016 LDBC TUC 2016
Real-life knowledge graphs
• WordNet
– A lexical network for the English language.
– Synonym set as node and semantic relation as edge.
– 98,000 entities, 154,000 relationships
• DBpedia
– A multi-language knowledge base extracted from
Wikipedia info-boxes
– English version of DBPedia
– 4.58 million things and 2,795 different kinds of properties
116/22/2016 LDBC TUC 2016
Real-life knowledge graphs
• Enterprise Knowledge Graph (EKG)
– Describes an enterprise relationships in Chinese
– Extracted from reports from enterprises in Shanghai Stock
Market
– Used for credit and risk analysis in financial companies
– A domain specific knowledge graph
– Seven kinds of relationships between two entities
• Assignment, hold, subcompany, changename, manager,
cooperate, merge
– 51,853 entities and 430,973 relationships.
126/22/2016 LDBC TUC 2016
Plus two social networks
• SNRand
– 0.2 million randomly selected users
– 5 million fellowship relations between users
• SNRank
– 0.2 million most active users.
– 36+ million fellowship relations between users
• The raw data is collected from a famous social
media platform named Sina Weibo in China
136/22/2016 LDBC TUC 2016
Statistical characteristics
146/22/2016 LDBC TUC 2016
Four kinds of distributions
• Distribution of degrees
– In-degree and out-degree
– Power-law distribution
• Distribution of hops
– Reflects the connectivity cost inside a graph
• Distribution of connected components
– Strongly and weakly connected components
– Reflects the connectivity of a graph
• Distribution of clustering coefficients
– Measures the nodes’ tendency to cluster together
156/22/2016 LDBC TUC 2016
In-degrees and out-degrees
16
All the in/out-degree distributions exhibit the power-law (or piece-wise power-
law), except for some initial segments that deviate the power-law.
6/22/2016 LDBC TUC 2016
Size of connected components
17
Both the strongly and weakly connected component distributions of knowledge
graphs exhibit the power-law distribution in general. While the social networks
are nearly in a whole strongly connected component.
6/22/2016 LDBC TUC 2016
Distance between two vertices
• Social networks are
“small worlds”
• Taxonomy’s
diameter is large
(tree-alike)
• Basically, the larger
the network is, the
smaller the diameter
is
186/22/2016 LDBC TUC 2016
196/22/2016 LDBC TUC 2016
Cluster coefficient
• Taxonomy and social
networks are
different
• All other KG’s are of
power-law
distributions
206/22/2016 LDBC TUC 2016
Cluster coefficient
• Taxonomy and social
networks are
different
• All other KG’s are of
power-law
distributions
216/22/2016 LDBC TUC 2016
Node degrees of different parts in EKG
22
Different relationships show different out-degree distributions
6/22/2016 LDBC TUC 2016
Size of connected components of
different parts in EKG
23
Few connected components are much larger than others
6/22/2016 LDBC TUC 2016
Different parts in EKG
24
Highly clustered Small-world network
6/22/2016 LDBC TUC 2016
Conclusions and discussions
25
• KG’s are different to SN’s
– Taxonomy/ontology + fact
• Following different statistical distributions
– KG’s are labeled
• Different subgraphs are of different sizes and characteristics
• Both triple stores and relational databases have reasons to be
used
– The key is to avoid joins over power-law distributed data
Wenliang Cheng, Chengyu Wang, Bing Xiao, Weining Qian, Aoying Zhou:
On
Statistical Characteristics of Real-Life Knowledge Graphs. BPOE 2015: 37-49
6/22/2016 LDBC TUC 2016
谢谢!
Thanks!
http://dase.ecnu.edu.cn
wnqian@sei.ecnu.edu.cn
Statistical characteristics
6/22/2016 LDBC TUC 2016 27

More Related Content

Viewers also liked

Close to Home Cruises
Close to Home Cruises Close to Home Cruises
Close to Home Cruises
Landry & Kling, Inc.
 
Informe de composición de drogas ilíicitas en Euskadi 2015
Informe de composición de drogas ilíicitas en Euskadi 2015Informe de composición de drogas ilíicitas en Euskadi 2015
Informe de composición de drogas ilíicitas en Euskadi 2015
Ai Laket!! elkartea
 
Plc scada by vishal kumar from niec delhi
Plc scada by vishal kumar from niec delhiPlc scada by vishal kumar from niec delhi
Plc scada by vishal kumar from niec delhi
7532993375
 
Stili comunicativi
Stili comunicativiStili comunicativi
Stili comunicativi
Francesco Menegalli
 
Challenges faced by women entrepreneurs in the present technological era
Challenges faced by women entrepreneurs in the present technological eraChallenges faced by women entrepreneurs in the present technological era
Challenges faced by women entrepreneurs in the present technological era
7411010287
 
Termòmetre Lingüístic: Un eina de recerca per la pràctica docent
Termòmetre Lingüístic: Un eina de recerca  per la pràctica docent Termòmetre Lingüístic: Un eina de recerca  per la pràctica docent
Termòmetre Lingüístic: Un eina de recerca per la pràctica docent
Neus Lorenzo
 

Viewers also liked (7)

Close to Home Cruises
Close to Home Cruises Close to Home Cruises
Close to Home Cruises
 
Informe de composición de drogas ilíicitas en Euskadi 2015
Informe de composición de drogas ilíicitas en Euskadi 2015Informe de composición de drogas ilíicitas en Euskadi 2015
Informe de composición de drogas ilíicitas en Euskadi 2015
 
Plc scada by vishal kumar from niec delhi
Plc scada by vishal kumar from niec delhiPlc scada by vishal kumar from niec delhi
Plc scada by vishal kumar from niec delhi
 
Stili comunicativi
Stili comunicativiStili comunicativi
Stili comunicativi
 
Challenges faced by women entrepreneurs in the present technological era
Challenges faced by women entrepreneurs in the present technological eraChallenges faced by women entrepreneurs in the present technological era
Challenges faced by women entrepreneurs in the present technological era
 
Termòmetre Lingüístic: Un eina de recerca per la pràctica docent
Termòmetre Lingüístic: Un eina de recerca  per la pràctica docent Termòmetre Lingüístic: Un eina de recerca  per la pràctica docent
Termòmetre Lingüístic: Un eina de recerca per la pràctica docent
 
Sidetur
SideturSidetur
Sidetur
 

Similar to Weining Qian (ECNU). On Statistical Characteristics of Real-Life Knowledge Graphs.

Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
Geoffrey Fox
 
IRJET- Predicting Social Network Communities Structure Changes and Detection ...
IRJET- Predicting Social Network Communities Structure Changes and Detection ...IRJET- Predicting Social Network Communities Structure Changes and Detection ...
IRJET- Predicting Social Network Communities Structure Changes and Detection ...
IRJET Journal
 
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Enrico Motta
 
Profiling Linked Open Data
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open Data
Blerina Spahiu
 
Rubrics for DMPs
Rubrics for DMPsRubrics for DMPs
Rubrics for DMPs
Jisc RDM
 
Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)
Anna Fensel
 
ONS local presents clustering
ONS local presents clusteringONS local presents clustering
ONS local presents clustering
Office for National Statistics
 
An Open Spatial Systems Framework for Place-Based Decision-Making
An Open Spatial Systems Framework for Place-Based Decision-MakingAn Open Spatial Systems Framework for Place-Based Decision-Making
An Open Spatial Systems Framework for Place-Based Decision-Making
Raed Mansour
 
Service Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data MiningService Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data Mining
IIRindia
 
Provenance and social network analysis for recommender systems: a literature...
Provenance and social network analysis for recommender  systems: a literature...Provenance and social network analysis for recommender  systems: a literature...
Provenance and social network analysis for recommender systems: a literature...
IJECEIAES
 
NBK update briefing October 2017
NBK update briefing October 2017NBK update briefing October 2017
NBK update briefing October 2017
Bethan Ruddock
 
Towards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data GraphTowards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data Graph
Besnik Fetahu
 
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
CILIP MDG
 
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
Adel Sabour
 
UNIT_1-BD.pptx
UNIT_1-BD.pptxUNIT_1-BD.pptx
UNIT_1-BD.pptx
Ponnusamy S Pichaimuthu
 
Survey on Location Based Recommendation System Using POI
Survey on Location Based Recommendation System Using POISurvey on Location Based Recommendation System Using POI
Survey on Location Based Recommendation System Using POI
IRJET Journal
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs
Jason Riedy
 
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Citadelh2020
 
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Gayane Sedrakyan
 
Overview of XSEDE Systems Engineering
Overview of XSEDE Systems EngineeringOverview of XSEDE Systems Engineering
Overview of XSEDE Systems Engineering
John Towns
 

Similar to Weining Qian (ECNU). On Statistical Characteristics of Real-Life Knowledge Graphs. (20)

Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
IRJET- Predicting Social Network Communities Structure Changes and Detection ...
IRJET- Predicting Social Network Communities Structure Changes and Detection ...IRJET- Predicting Social Network Communities Structure Changes and Detection ...
IRJET- Predicting Social Network Communities Structure Changes and Detection ...
 
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
 
Profiling Linked Open Data
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open Data
 
Rubrics for DMPs
Rubrics for DMPsRubrics for DMPs
Rubrics for DMPs
 
Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)Towards Semantic APIs for Research Data Services (Invited Talk)
Towards Semantic APIs for Research Data Services (Invited Talk)
 
ONS local presents clustering
ONS local presents clusteringONS local presents clustering
ONS local presents clustering
 
An Open Spatial Systems Framework for Place-Based Decision-Making
An Open Spatial Systems Framework for Place-Based Decision-MakingAn Open Spatial Systems Framework for Place-Based Decision-Making
An Open Spatial Systems Framework for Place-Based Decision-Making
 
Service Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data MiningService Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data Mining
 
Provenance and social network analysis for recommender systems: a literature...
Provenance and social network analysis for recommender  systems: a literature...Provenance and social network analysis for recommender  systems: a literature...
Provenance and social network analysis for recommender systems: a literature...
 
NBK update briefing October 2017
NBK update briefing October 2017NBK update briefing October 2017
NBK update briefing October 2017
 
Towards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data GraphTowards Integration of Web Data into a coherent Educational Data Graph
Towards Integration of Web Data into a coherent Educational Data Graph
 
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
 
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
حلقة تكنولوجية 11 بحث علمى بعنوان A Systematic Mapping Study for Big Data Str...
 
UNIT_1-BD.pptx
UNIT_1-BD.pptxUNIT_1-BD.pptx
UNIT_1-BD.pptx
 
Survey on Location Based Recommendation System Using POI
Survey on Location Based Recommendation System Using POISurvey on Location Based Recommendation System Using POI
Survey on Location Based Recommendation System Using POI
 
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs
 
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
 
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
Data Harvesting, Curation and Fusion Model to Support Public Service Recommen...
 
Overview of XSEDE Systems Engineering
Overview of XSEDE Systems EngineeringOverview of XSEDE Systems Engineering
Overview of XSEDE Systems Engineering
 

More from LDBC council

8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...
8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...
8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...
LDBC council
 
8th TUC Meeting - Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
8th TUC Meeting -  Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...8th TUC Meeting -  Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
8th TUC Meeting - Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
LDBC council
 
8th TUC Meeting – Yinglong Xia (Huawei), Big Graph Analytics Engine
8th TUC Meeting – Yinglong Xia (Huawei), Big Graph Analytics Engine8th TUC Meeting – Yinglong Xia (Huawei), Big Graph Analytics Engine
8th TUC Meeting – Yinglong Xia (Huawei), Big Graph Analytics Engine
LDBC council
 
8th TUC Meeting – George Fletcher (TU Eindhoven), gMark: Schema-driven data a...
8th TUC Meeting – George Fletcher (TU Eindhoven), gMark: Schema-driven data a...8th TUC Meeting – George Fletcher (TU Eindhoven), gMark: Schema-driven data a...
8th TUC Meeting – George Fletcher (TU Eindhoven), gMark: Schema-driven data a...
LDBC council
 
8th TUC Meeting – Marcus Paradies (SAP) Social Network Benchmark
8th TUC Meeting – Marcus Paradies (SAP) Social Network Benchmark8th TUC Meeting – Marcus Paradies (SAP) Social Network Benchmark
8th TUC Meeting – Marcus Paradies (SAP) Social Network Benchmark
LDBC council
 
8th TUC Meeting - Sergey Edunov (Facebook). Generating realistic trillion-edg...
8th TUC Meeting - Sergey Edunov (Facebook). Generating realistic trillion-edg...8th TUC Meeting - Sergey Edunov (Facebook). Generating realistic trillion-edg...
8th TUC Meeting - Sergey Edunov (Facebook). Generating realistic trillion-edg...
LDBC council
 
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
LDBC council
 
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
LDBC council
 
8th TUC Meeting - Eugene I. Chong (Oracle USA). Balancing Act to improve RDF ...
8th TUC Meeting - Eugene I. Chong (Oracle USA). Balancing Act to improve RDF ...8th TUC Meeting - Eugene I. Chong (Oracle USA). Balancing Act to improve RDF ...
8th TUC Meeting - Eugene I. Chong (Oracle USA). Balancing Act to improve RDF ...
LDBC council
 
8th TUC Meeting -
8th TUC Meeting - 8th TUC Meeting -
8th TUC Meeting -
LDBC council
 
8th TUC Meeting - David Meibusch, Nathan Hawes (Oracle Labs Australia). Frapp...
8th TUC Meeting - David Meibusch, Nathan Hawes (Oracle Labs Australia). Frapp...8th TUC Meeting - David Meibusch, Nathan Hawes (Oracle Labs Australia). Frapp...
8th TUC Meeting - David Meibusch, Nathan Hawes (Oracle Labs Australia). Frapp...
LDBC council
 
8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...
8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...
8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...
LDBC council
 
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
LDBC council
 
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status update
LDBC council
 
LDBC 6th TUC Meeting conclusions
LDBC 6th TUC Meeting conclusionsLDBC 6th TUC Meeting conclusions
LDBC 6th TUC Meeting conclusions
LDBC council
 
Parallel and incremental materialisation of RDF/DATALOG in RDFOX
Parallel and incremental materialisation of RDF/DATALOG in RDFOXParallel and incremental materialisation of RDF/DATALOG in RDFOX
Parallel and incremental materialisation of RDF/DATALOG in RDFOX
LDBC council
 
MODAClouds Decision Support System for Cloud Service Selection
MODAClouds Decision Support System for Cloud Service SelectionMODAClouds Decision Support System for Cloud Service Selection
MODAClouds Decision Support System for Cloud Service Selection
LDBC council
 
E-Commerce and Graph-driven Applications: Experiences and Optimizations while...
E-Commerce and Graph-driven Applications: Experiences and Optimizations while...E-Commerce and Graph-driven Applications: Experiences and Optimizations while...
E-Commerce and Graph-driven Applications: Experiences and Optimizations while...
LDBC council
 
LDBC SNB Benchmark Auditing
LDBC SNB Benchmark AuditingLDBC SNB Benchmark Auditing
LDBC SNB Benchmark Auditing
LDBC council
 
Social Network Benchmark Interactive Workload
Social Network Benchmark Interactive WorkloadSocial Network Benchmark Interactive Workload
Social Network Benchmark Interactive Workload
LDBC council
 

More from LDBC council (20)

8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...
8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...
8th TUC Meeting - Juan Sequeda (Capsenta). Integrating Data using Graphs and ...
 
8th TUC Meeting - Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
8th TUC Meeting -  Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...8th TUC Meeting -  Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
8th TUC Meeting - Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
 
8th TUC Meeting – Yinglong Xia (Huawei), Big Graph Analytics Engine
8th TUC Meeting – Yinglong Xia (Huawei), Big Graph Analytics Engine8th TUC Meeting – Yinglong Xia (Huawei), Big Graph Analytics Engine
8th TUC Meeting – Yinglong Xia (Huawei), Big Graph Analytics Engine
 
8th TUC Meeting – George Fletcher (TU Eindhoven), gMark: Schema-driven data a...
8th TUC Meeting – George Fletcher (TU Eindhoven), gMark: Schema-driven data a...8th TUC Meeting – George Fletcher (TU Eindhoven), gMark: Schema-driven data a...
8th TUC Meeting – George Fletcher (TU Eindhoven), gMark: Schema-driven data a...
 
8th TUC Meeting – Marcus Paradies (SAP) Social Network Benchmark
8th TUC Meeting – Marcus Paradies (SAP) Social Network Benchmark8th TUC Meeting – Marcus Paradies (SAP) Social Network Benchmark
8th TUC Meeting – Marcus Paradies (SAP) Social Network Benchmark
 
8th TUC Meeting - Sergey Edunov (Facebook). Generating realistic trillion-edg...
8th TUC Meeting - Sergey Edunov (Facebook). Generating realistic trillion-edg...8th TUC Meeting - Sergey Edunov (Facebook). Generating realistic trillion-edg...
8th TUC Meeting - Sergey Edunov (Facebook). Generating realistic trillion-edg...
 
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
 
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
8th TUC Meeting | Lijun Chang (University of New South Wales). Efficient Subg...
 
8th TUC Meeting - Eugene I. Chong (Oracle USA). Balancing Act to improve RDF ...
8th TUC Meeting - Eugene I. Chong (Oracle USA). Balancing Act to improve RDF ...8th TUC Meeting - Eugene I. Chong (Oracle USA). Balancing Act to improve RDF ...
8th TUC Meeting - Eugene I. Chong (Oracle USA). Balancing Act to improve RDF ...
 
8th TUC Meeting -
8th TUC Meeting - 8th TUC Meeting -
8th TUC Meeting -
 
8th TUC Meeting - David Meibusch, Nathan Hawes (Oracle Labs Australia). Frapp...
8th TUC Meeting - David Meibusch, Nathan Hawes (Oracle Labs Australia). Frapp...8th TUC Meeting - David Meibusch, Nathan Hawes (Oracle Labs Australia). Frapp...
8th TUC Meeting - David Meibusch, Nathan Hawes (Oracle Labs Australia). Frapp...
 
8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...
8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...
8th TUC Meeting - Martin Zand University of Rochester Clinical and Translatio...
 
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
 
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status update
 
LDBC 6th TUC Meeting conclusions
LDBC 6th TUC Meeting conclusionsLDBC 6th TUC Meeting conclusions
LDBC 6th TUC Meeting conclusions
 
Parallel and incremental materialisation of RDF/DATALOG in RDFOX
Parallel and incremental materialisation of RDF/DATALOG in RDFOXParallel and incremental materialisation of RDF/DATALOG in RDFOX
Parallel and incremental materialisation of RDF/DATALOG in RDFOX
 
MODAClouds Decision Support System for Cloud Service Selection
MODAClouds Decision Support System for Cloud Service SelectionMODAClouds Decision Support System for Cloud Service Selection
MODAClouds Decision Support System for Cloud Service Selection
 
E-Commerce and Graph-driven Applications: Experiences and Optimizations while...
E-Commerce and Graph-driven Applications: Experiences and Optimizations while...E-Commerce and Graph-driven Applications: Experiences and Optimizations while...
E-Commerce and Graph-driven Applications: Experiences and Optimizations while...
 
LDBC SNB Benchmark Auditing
LDBC SNB Benchmark AuditingLDBC SNB Benchmark Auditing
LDBC SNB Benchmark Auditing
 
Social Network Benchmark Interactive Workload
Social Network Benchmark Interactive WorkloadSocial Network Benchmark Interactive Workload
Social Network Benchmark Interactive Workload
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 

Weining Qian (ECNU). On Statistical Characteristics of Real-Life Knowledge Graphs.

  • 1. On Statistical Characteristics of Real-life Knowledge Graphs Weining Qian Institute for Data Science and Engineering East China Normal University wnqian@sei.ecnu.edu.cn
  • 2. NSFC key project on Big Data Benchmarks • Theory and Methods of Benchmarking Big Data Management Systems – 2015.1 – 2019.12 2 East China Normal University Renmin University of China Institute of Computing Technology Chinese Academy of Sciences 6/22/2016 LDBC TUC 2016
  • 3. Research focuses • BigDataBench Suite (@ict.ac) – http://prof.ict.ac.cn • Benchmarking transaction processing in NewSQL systems (@ecnu) – SecKillBench • Benchmarking Big Data systems (@rmu) • Benchmarks for graph data (@ecnu) – Social media, knowledge graphs, ... 36/22/2016 LDBC TUC 2016
  • 5. Some KG‘s 5 • YAGO – 10M entities in 350K classes – 120M facts for 100 relations – 100 languages – 95% accuracy • DBPedia – 4M entities in 250 classes – 500M facts for 6000 properties – live updates • Kosmix – 6.5M concepts, 6.7M concept instances, – 165M relationship instances • Freebase – 40M entities in 15000 topics – 1B facts for 4000 properties – core of Google Knowledge Graph • Google Knowledge Graph – 600M entities in 15000 topics – 20B facts • Probase – 2.7 million+ concepts • And many domain/application- specific knowledge graphs 6/22/2016 LDBC TUC 2016
  • 6. A natural question • Knowledge graph can serve as the backbone of many Web-scale applications, such as search engine, question answering, text understanding etc. • How to effectively and efficiently manage a large-scale knowledge graph? – MySQL, Oracle, Neo4j, TITAN, Trinity, or other triple stores??? 66/22/2016 LDBC TUC 2016
  • 7. Social networks vs. Knowledge graphs • Though there are some benchmarks for social networks exist – Facebook LinkBench, LDBC SNB, BSMA, ... • Knowledge graph is different with social network – More semantic labels in both entities and relations – Topic or domain sensitive – Contains various kinds of knowledge – Hard to define a unified schema 76/22/2016 LDBC TUC 2016
  • 8. Why study their statistical characteristics? • To better understand knowledge graphs • To help the selection of seeding data sets in benchmarks • To help the development of data generators 86/22/2016 LDBC TUC 2016
  • 9. Characteristics of large-scale graphs • Previous research works on analyzing structural properties of large scale graphs, e.g. – [Broder et al. Comput. Netw. 2000] studied the web structure as a graph via a series of metrics, e.g degree, diameter, component. – [Kumar et al. KDD, 2006] studied the dynamic social network’s structure properties, e.g. degree, hop etc. – [Boccaletti et al. Phys. Rep. 2006] surveyed the studies of the structure and dynamics of complex network. 96/22/2016 LDBC TUC 2016
  • 10. Real-life knowledge graphs • YAGO2 – A huge semantic knowledge graph based on WordNet, Wikipedia and GeoNames – 10+ million entities, 120+ million facts • Separate the YAGO2 into three sub-graphs – YagoTax: Taxonomy tree of YAGO2 – YagoFact: Facts in YAGO2 – YagoWiki: Hyperlink relations in YAGO2 based on Wikipedia 106/22/2016 LDBC TUC 2016
  • 11. Real-life knowledge graphs • WordNet – A lexical network for the English language. – Synonym set as node and semantic relation as edge. – 98,000 entities, 154,000 relationships • DBpedia – A multi-language knowledge base extracted from Wikipedia info-boxes – English version of DBPedia – 4.58 million things and 2,795 different kinds of properties 116/22/2016 LDBC TUC 2016
  • 12. Real-life knowledge graphs • Enterprise Knowledge Graph (EKG) – Describes an enterprise relationships in Chinese – Extracted from reports from enterprises in Shanghai Stock Market – Used for credit and risk analysis in financial companies – A domain specific knowledge graph – Seven kinds of relationships between two entities • Assignment, hold, subcompany, changename, manager, cooperate, merge – 51,853 entities and 430,973 relationships. 126/22/2016 LDBC TUC 2016
  • 13. Plus two social networks • SNRand – 0.2 million randomly selected users – 5 million fellowship relations between users • SNRank – 0.2 million most active users. – 36+ million fellowship relations between users • The raw data is collected from a famous social media platform named Sina Weibo in China 136/22/2016 LDBC TUC 2016
  • 15. Four kinds of distributions • Distribution of degrees – In-degree and out-degree – Power-law distribution • Distribution of hops – Reflects the connectivity cost inside a graph • Distribution of connected components – Strongly and weakly connected components – Reflects the connectivity of a graph • Distribution of clustering coefficients – Measures the nodes’ tendency to cluster together 156/22/2016 LDBC TUC 2016
  • 16. In-degrees and out-degrees 16 All the in/out-degree distributions exhibit the power-law (or piece-wise power- law), except for some initial segments that deviate the power-law. 6/22/2016 LDBC TUC 2016
  • 17. Size of connected components 17 Both the strongly and weakly connected component distributions of knowledge graphs exhibit the power-law distribution in general. While the social networks are nearly in a whole strongly connected component. 6/22/2016 LDBC TUC 2016
  • 18. Distance between two vertices • Social networks are “small worlds” • Taxonomy’s diameter is large (tree-alike) • Basically, the larger the network is, the smaller the diameter is 186/22/2016 LDBC TUC 2016
  • 20. Cluster coefficient • Taxonomy and social networks are different • All other KG’s are of power-law distributions 206/22/2016 LDBC TUC 2016
  • 21. Cluster coefficient • Taxonomy and social networks are different • All other KG’s are of power-law distributions 216/22/2016 LDBC TUC 2016
  • 22. Node degrees of different parts in EKG 22 Different relationships show different out-degree distributions 6/22/2016 LDBC TUC 2016
  • 23. Size of connected components of different parts in EKG 23 Few connected components are much larger than others 6/22/2016 LDBC TUC 2016
  • 24. Different parts in EKG 24 Highly clustered Small-world network 6/22/2016 LDBC TUC 2016
  • 25. Conclusions and discussions 25 • KG’s are different to SN’s – Taxonomy/ontology + fact • Following different statistical distributions – KG’s are labeled • Different subgraphs are of different sizes and characteristics • Both triple stores and relational databases have reasons to be used – The key is to avoid joins over power-law distributed data Wenliang Cheng, Chengyu Wang, Bing Xiao, Weining Qian, Aoying Zhou:
On Statistical Characteristics of Real-Life Knowledge Graphs. BPOE 2015: 37-49 6/22/2016 LDBC TUC 2016