SlideShare a Scribd company logo
1
RDF2Vec: RDF Graph Embeddings
for Data Mining
Petar Ristoski and Heiko Paulheim
11/7/2016 2
Introduction
Linking
Exploration
/ Selection
Consolidation
/ Cleansing
Graph Data
Transformation
Data
Mining
Visualization /
Explanation
Ristoski, Paulheim
Motivation
• Standard data mining algorithms require propositional feature
vector representation
• Feature space: V={v1,v2,…, vn}
• Each instance is represented as an n-dimensional feature vector
(v1,v2,…,vn), where for each 1≤ vi ≤n :
– vi ∈ {true, false}, or vi ∈ {1,0}
– vi ∈ ℝ
– vi ∈ S, where S is a finite set of symbols
11/7/2016 Ristoski, Paulheim 3
11/7/2016 Ristoski, Paulheim 4
Name Person Music Artist Instrument Genre
Trent Reznor 1 1 1 0
Wolfgang A. Mozart 1 1 1 1
Barack Obama 1 0 0 0
Motivation
Vision
• Preserve the information given in the original graph
• Unsupervised
– task and dataset independent
• Compatible with traditional data mining algorithms and tools
• Efficient computation and application
– Low dimensional representation
11/7/2016 Ristoski, Paulheim 5
RDF2VEC APPROACH
11/7/2016 6Ristoski, Paulheim
RDF2Vec
• Adaptation of neural language models
– Word2vec
– Latent representation of words based on text corpus
• Convert RDF graphs in sequences of entities and relations (sentences)
– Graph Walks
– Weisfeiler-Lehman Subtree RDF Graph Kernels
• Train neural language model
– Each entity and relation is represented as N-dimensional numerical vector
– Semantically similar entities appear closer in the embedded space
• Use entity vectors in different ML tasks
11/7/2016 Ristoski, Paulheim 7
Word2vec – Neural Language Model
• Two-layer neural net that converts raw text into vectors
– Each word is represented into a numerical vector
• Continuous Bag-of-Words (CBOW)
– Predict target words from source context words
– Tokyo is the capital of Japan
• Skip-gram
11/7/2016 8
[1] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." NIPS, 2013.
[2] Rong, Xin. "word2vec parameter learning explained." 2014.
Ristoski, Paulheim
CBOW
11/7/2016 9
Capital
Japan
Tokyo
Ristoski, Paulheim
Word Embedding
11/7/2016 10
• Japan
• Russia
• Germany
• Austria
• Berlin
• Tokyo
• Moscow
• Vienna
Tokyo = [f1, f2, f3, …, fn]
Japan= [f1, f2, f3, …, fn]
Ristoski, Paulheim
?
v(Japan) - v(Tokyo) + v(Berlin) ≈ v(Germany)
Word2vec – Neural Language Model
• Two-layer neural net that converts raw text into vectors
– Each word is represented into a numerical vector
• Continuous Bag-of-Words (CBOW)
– Predict target words from source context word
– Tokyo is the capital of Japan
• Skip-gram
– Predict context words from the target word
– Tokyo is the capital of Japan
11/7/2016 11
[1] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." NIPS, 2013.
[2] Rong, Xin. "word2vec parameter learning explained." 2014.
Ristoski, Paulheim
Skip-gram
11/7/2016 12
Capital
Japan
Tokyo
Ristoski, Paulheim
RDF2vec
11/7/2016 13
• Convert the graph into sequence of tokens (sentences)
– Graph walks
– Weisfeiler-Lehman Subtree RDF Graph Kernels
Ristoski, Paulheim
Graph Walks RDF2vec
• For each entity in the graph:
– Extract a subgraph with depth d
– Extract walks on the subgraph
– Build word2vec model
dbr:Trent_Reznor -> dbo:associatedBand -> dbr:Exotic_Birds -> dbo:bandMember -> dbr:Chris_Vrenna
dbr:Trent_Reznor -> dbo:genre - > dbr:Dark_ambient -> dbo:instrument -> dbr:Field_recording
11/7/2016 14Ristoski, Paulheim
Random Walks RDF2vec
11/7/2016 15
V*S Walks
V Vectors
Ristoski, Paulheim
Entity Embedding
11/7/2016 16
• dbr:Berlin
• dbr:Tokyo
• dbr:Moscow
• dbr:Vienna
• dbr:Japan
• dbr:Russia
• dbr:Germany
• dbr:Austria
Ristoski, Paulheim
dbr:Tokyo = [f1, f2, f3, …, fn]
dbr:Japan= [f1, f2, f3, …, fn]
Weisfeiler-Lehman Kernel
11/7/2016 17Ristoski, Paulheim
WL Kernel RDF2vec
• Construct sequences using random walks with depth d after each
iteration for each entity in the graph
• Graph G sequences after 1 iteration:
– 1->6->11; 1->6->11->13; 1->6->11->10 …
– 4->11->6; 4->11->13; 4->11->10; 4->11->10->8 …
– …
11/7/2016 18
de Vries, Gerben KD. "A fast approximation of the Weisfeiler-Lehman graph kernel for RDF data.“ ECML, 2013.
Ristoski, Paulheim
WL Kernels RDF2vec
11/7/2016 19
V*S*I
sequences V Vectors
Ristoski, Paulheim
EVALUATION
11/7/2016 20Ristoski, Paulheim
Evaluation Setup
• Datasets
– 3 domain-specific RDF datasets
– 2 large cross-domain RDF datasets with 5 evaluation datasets
• Tasks
– Classification: Naive Bayes, k-Nearest Neighbors (k=3), C4.5 decision tree
and Support Vector Machines.
– Regression: Linear Regression, M5Rules, and k-Nearest Neighbors (k=3).
• Baselines
– Features derived from incoming and outgoing relations and values
– Features derived from graph substructures: WL and Walk-Count Kernels
11/7/2016 Ristoski, Paulheim 21
Domain Specific RDF Datasets
• Datasets
• Results (accuracy)
– Best scores per dataset
11/7/2016 Ristoski, Paulheim 22
Dataset Task #statements #instances #walks depth #sequences WL iter. WL depth #sequences
AIFB C (c=4) 30K 176 all 10 360K 4 2 346K
BGS C (c=2) 600K 146 all 10 2.4M 4 2 5.3M
MUTAG C (c=2) 80K 340 all 10 168K 4 2 908K
Dataset Baseline Walks2vec WL2vec (SG 500)
AIFB 92.68 89.55 93.41
BGS 91.05 78.10 96.18
MUTAG 94.29 82.06 96.33
Large Cross-Domain RDF Datasets
• Datasets
• Evaluation datasets
11/7/2016 Ristoski, Paulheim 23
Dataset #instances depth #sequences Vector size model
DBpedia 5M 4/8 2.5B 200/500 CBOW/SG
Wikidata 17M 4 8.5B 200/500 CBOW/SG
Dataset #Instances ML Task Original Source
Cities 212 R/C (c=3) Mercer
Metacritic Albums 1,600 R/C (c=3) Metacritic
Metacritic Movies 2,000 R/C (c=3) Metacritic
AAUP 960 R/C(c=3) JSE
Forbes 1,585 R/C (c=3) Forbes
• Accuracy Results
– Best scores only
Results: classification
Cities Movies Albums AAUP Forbes
Best Baseline 75.13 79.30 77.94 93.44 76.75
DB2vec CBOW 200 8 77.39 83.65 78.44 92.23 88.30
DB2vec CBOW 500 8 76.84 83.25 77.25 90.61 89.86
DB2vec SG 200 8 78.92 83.30 79.72 91.04 90.10
DB2vec SG 500 8 89.73 82.80 78.20 94.48 88.53
WD2vec CBOW 200 4 75.56 52.20 51.44 90.18 81.08
WD2vec CBOW 500 4 85.56 51.04 53.28 89.74 80.74
WD2vec SG 200 4 75.48 75.39 64.76 90.50 81.17
WD2vec SG 500 4 83.20 76.30 63.42 90.60 81.17
11/7/2016 Ristoski, Paulheim 24
• RMSE Results
– Best scores only
Results: regression
Cities Movies Albums AAUP Forbes
Best Baseline 17.04 19.19 12.81 6.16 18.32
1
DB2vec CBOW 200 8 12.55 15.90 11.79 6.47 17.43
DB2vec CBOW 500 8 12.54 15.81 11.30 6.54 17.62
DB2vec SG 200 8 12.85 15.12 10.90 6.22 17.85
DB2vec SG 500 8 10.19 15.45 10.89 6.26 16.61
WD2vec CBOW 200 4 17.52 23.39 14.55 6.60 21.77
WD2vec CBOW 500 4 18.33 22.18 14.00 6.08 21.92
WD2vec SG 200 4 18.69 19.10 13.51 6.52 21.59
WD2vec SG 500 4 19.23 19.19 13.23 6.05 21.58
11/7/2016 Ristoski, Paulheim 25
Results Summary
• RDF2vec outperform all the baseline approaches
– Smaller feature vectors - more efficient training than bassline
approaches
• WL kernel sequences capture the graph structure better than walks
– Not efficient on large graphs
– Large number of sequences produced – not scalable
• Increasing the depth of the paths increases the quality of the
embeddings
• The vector dimensionality doesn’t affect the performance
• Skip-Gram models constantly outperforms CBOW models
• DBpedia produces higher quality embeddings than Wikidata
11/7/2016 Ristoski, Paulheim 26
Other Use-Cases
• Recommender systems
• Document modeling
– Document similarity
– Entity relatedness
• Alignment of knowledge bases
– DBpedia and Wikidata
• Knowledge base relation prediction and error detection
• Linking text and semi-structured knowledge to knowledge bases
11/7/2016 Ristoski, Paulheim 27
Conclusion
• RDF2Vec: an approach for learning latent numerical representations
of entities in RDF graphs
• Preserves the graph information
• Compatible with all the traditional machine learning algorithms
• More efficient ML models training
• Task and dataset independent approach
• Download the code and the models: http://data.dws.informatik.uni-
mannheim.de/rdf2vec/
11/7/2016 Ristoski, Paulheim 28

More Related Content

What's hot

SPARQL 사용법
SPARQL 사용법SPARQL 사용법
SPARQL 사용법
홍수 허
 
Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기
NeoClova
 
MariaDB 10.5 binary install (바이너리 설치)
MariaDB 10.5 binary install (바이너리 설치)MariaDB 10.5 binary install (바이너리 설치)
MariaDB 10.5 binary install (바이너리 설치)
NeoClova
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Jim Mlodgenski
 
Mongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorials
Mongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorialsMongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorials
Mongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorials
SpringPeople
 
MySQL 상태 메시지 분석 및 활용
MySQL 상태 메시지 분석 및 활용MySQL 상태 메시지 분석 및 활용
MySQL 상태 메시지 분석 및 활용
I Goo Lee
 
Improve PostgreSQL replication with Oracle GoldenGate
Improve PostgreSQL replication with Oracle GoldenGateImprove PostgreSQL replication with Oracle GoldenGate
Improve PostgreSQL replication with Oracle GoldenGate
Bobby Curtis
 
Beyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeBeyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To Code
Yuto Hayamizu
 
Open Source 101 2022 - MySQL Indexes and Histograms
Open Source 101 2022 - MySQL Indexes and HistogramsOpen Source 101 2022 - MySQL Indexes and Histograms
Open Source 101 2022 - MySQL Indexes and Histograms
Frederic Descamps
 
SHACL by example
SHACL by exampleSHACL by example
SHACL by example
Jose Emilio Labra Gayo
 
MySQL 8.0.16 New Features Summary
MySQL 8.0.16 New Features SummaryMySQL 8.0.16 New Features Summary
MySQL 8.0.16 New Features Summary
Olivier DASINI
 
Redis Introduction
Redis IntroductionRedis Introduction
Redis Introduction
Alex Su
 
MySQL/MariaDB Proxy Software Test
MySQL/MariaDB Proxy Software TestMySQL/MariaDB Proxy Software Test
MySQL/MariaDB Proxy Software Test
I Goo Lee
 
ShEx vs SHACL
ShEx vs SHACLShEx vs SHACL
ShEx vs SHACL
Jose Emilio Labra Gayo
 
Kotlin coroutines 톺아보기
Kotlin coroutines 톺아보기Kotlin coroutines 톺아보기
Kotlin coroutines 톺아보기
Taewoo Kim
 
ETL Patterns with Postgres
ETL Patterns with PostgresETL Patterns with Postgres
ETL Patterns with Postgres
Martin Loetzsch
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
Anurag Patel
 
NGSI-LD Introduction
NGSI-LD IntroductionNGSI-LD Introduction
NGSI-LD Introduction
FIWARE
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
Alexander Korotkov
 
Postgresql tutorial
Postgresql tutorialPostgresql tutorial
Postgresql tutorial
Ashoka Vanjare
 

What's hot (20)

SPARQL 사용법
SPARQL 사용법SPARQL 사용법
SPARQL 사용법
 
Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기Maria db 이중화구성_고민하기
Maria db 이중화구성_고민하기
 
MariaDB 10.5 binary install (바이너리 설치)
MariaDB 10.5 binary install (바이너리 설치)MariaDB 10.5 binary install (바이너리 설치)
MariaDB 10.5 binary install (바이너리 설치)
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
Mongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorials
Mongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorialsMongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorials
Mongo DB: Fundamentals & Basics/ An Overview of MongoDB/ Mongo DB tutorials
 
MySQL 상태 메시지 분석 및 활용
MySQL 상태 메시지 분석 및 활용MySQL 상태 메시지 분석 및 활용
MySQL 상태 메시지 분석 및 활용
 
Improve PostgreSQL replication with Oracle GoldenGate
Improve PostgreSQL replication with Oracle GoldenGateImprove PostgreSQL replication with Oracle GoldenGate
Improve PostgreSQL replication with Oracle GoldenGate
 
Beyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeBeyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To Code
 
Open Source 101 2022 - MySQL Indexes and Histograms
Open Source 101 2022 - MySQL Indexes and HistogramsOpen Source 101 2022 - MySQL Indexes and Histograms
Open Source 101 2022 - MySQL Indexes and Histograms
 
SHACL by example
SHACL by exampleSHACL by example
SHACL by example
 
MySQL 8.0.16 New Features Summary
MySQL 8.0.16 New Features SummaryMySQL 8.0.16 New Features Summary
MySQL 8.0.16 New Features Summary
 
Redis Introduction
Redis IntroductionRedis Introduction
Redis Introduction
 
MySQL/MariaDB Proxy Software Test
MySQL/MariaDB Proxy Software TestMySQL/MariaDB Proxy Software Test
MySQL/MariaDB Proxy Software Test
 
ShEx vs SHACL
ShEx vs SHACLShEx vs SHACL
ShEx vs SHACL
 
Kotlin coroutines 톺아보기
Kotlin coroutines 톺아보기Kotlin coroutines 톺아보기
Kotlin coroutines 톺아보기
 
ETL Patterns with Postgres
ETL Patterns with PostgresETL Patterns with Postgres
ETL Patterns with Postgres
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
NGSI-LD Introduction
NGSI-LD IntroductionNGSI-LD Introduction
NGSI-LD Introduction
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
 
Postgresql tutorial
Postgresql tutorialPostgresql tutorial
Postgresql tutorial
 

Viewers also liked

DS2014: Feature selection in hierarchical feature spaces
DS2014: Feature selection in hierarchical feature spacesDS2014: Feature selection in hierarchical feature spaces
DS2014: Feature selection in hierarchical feature spaces
Petar Ristoski
 
DBpedia Japanese 運営の現状
DBpedia Japanese 運営の現状DBpedia Japanese 運営の現状
DBpedia Japanese 運営の現状
Fumihiro Kato
 
Access Control for RDF graphs using Abstract Models
Access Control for RDF graphs using Abstract ModelsAccess Control for RDF graphs using Abstract Models
Access Control for RDF graphs using Abstract Models
PlanetData Network of Excellence
 
Towards Knowledge-Enabled Society
Towards Knowledge-Enabled SocietyTowards Knowledge-Enabled Society
Towards Knowledge-Enabled Society
National Institute of Informatics (NII)
 
CSV-X
CSV-XCSV-X
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
National Institute of Informatics
 
図書館と人口分布の見える化
図書館と人口分布の見える化図書館と人口分布の見える化
図書館と人口分布の見える化
Yoshikazu Hosono
 
声優LOD
声優LOD声優LOD
声優LOD
Yusuke Sekii
 
Sakepediaの使い方(LODチャレンジ)
Sakepediaの使い方(LODチャレンジ)Sakepediaの使い方(LODチャレンジ)
Sakepediaの使い方(LODチャレンジ)
teru1118
 
Lod2016.key
Lod2016.keyLod2016.key
Lod2016.key
Mami Kajita
 
2016年 地域密着型本おすすめアプリ「なによも」
2016年 地域密着型本おすすめアプリ「なによも」2016年 地域密着型本おすすめアプリ「なによも」
2016年 地域密着型本おすすめアプリ「なによも」
Keiko Noda
 
Tutorial for RDF Graphs
Tutorial for RDF GraphsTutorial for RDF Graphs
Tutorial for RDF Graphs
Kishoj Bajracharya
 
Saveface - Save your Facebook content as RDF data
Saveface - Save your Facebook content as RDF dataSaveface - Save your Facebook content as RDF data
Saveface - Save your Facebook content as RDF data
Fuming Shih
 
Fosdem 2011 - A Common Graph Database Access Layer for .Net and Mono
Fosdem 2011 - A Common Graph Database Access Layer for .Net and MonoFosdem 2011 - A Common Graph Database Access Layer for .Net and Mono
Fosdem 2011 - A Common Graph Database Access Layer for .Net and Mono
Achim Friedland
 
Machine Learning Techniques for the Semantic Web
Machine Learning Techniques for the Semantic WebMachine Learning Techniques for the Semantic Web
Machine Learning Techniques for the Semantic Web
pauldix
 
Local karuta project
Local karuta projectLocal karuta project
Local karuta project
Nanako Takahashi
 
可視化法学-大和超券ステージ
可視化法学-大和超券ステージ可視化法学-大和超券ステージ
可視化法学-大和超券ステージ
(shibao)芝尾 (kouichiro)幸一郎
 
Graph-based Relational Data Visualization
Graph-based RelationalData VisualizationGraph-based RelationalData Visualization
Graph-based Relational Data Visualization
Universidade de São Paulo
 
Two graph data models : RDF and Property Graphs
Two graph data models : RDF and Property GraphsTwo graph data models : RDF and Property Graphs
Two graph data models : RDF and Property Graphs
andyseaborne
 
Swc2013 yamamoto
Swc2013 yamamotoSwc2013 yamamoto

Viewers also liked (20)

DS2014: Feature selection in hierarchical feature spaces
DS2014: Feature selection in hierarchical feature spacesDS2014: Feature selection in hierarchical feature spaces
DS2014: Feature selection in hierarchical feature spaces
 
DBpedia Japanese 運営の現状
DBpedia Japanese 運営の現状DBpedia Japanese 運営の現状
DBpedia Japanese 運営の現状
 
Access Control for RDF graphs using Abstract Models
Access Control for RDF graphs using Abstract ModelsAccess Control for RDF graphs using Abstract Models
Access Control for RDF graphs using Abstract Models
 
Towards Knowledge-Enabled Society
Towards Knowledge-Enabled SocietyTowards Knowledge-Enabled Society
Towards Knowledge-Enabled Society
 
CSV-X
CSV-XCSV-X
CSV-X
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
 
図書館と人口分布の見える化
図書館と人口分布の見える化図書館と人口分布の見える化
図書館と人口分布の見える化
 
声優LOD
声優LOD声優LOD
声優LOD
 
Sakepediaの使い方(LODチャレンジ)
Sakepediaの使い方(LODチャレンジ)Sakepediaの使い方(LODチャレンジ)
Sakepediaの使い方(LODチャレンジ)
 
Lod2016.key
Lod2016.keyLod2016.key
Lod2016.key
 
2016年 地域密着型本おすすめアプリ「なによも」
2016年 地域密着型本おすすめアプリ「なによも」2016年 地域密着型本おすすめアプリ「なによも」
2016年 地域密着型本おすすめアプリ「なによも」
 
Tutorial for RDF Graphs
Tutorial for RDF GraphsTutorial for RDF Graphs
Tutorial for RDF Graphs
 
Saveface - Save your Facebook content as RDF data
Saveface - Save your Facebook content as RDF dataSaveface - Save your Facebook content as RDF data
Saveface - Save your Facebook content as RDF data
 
Fosdem 2011 - A Common Graph Database Access Layer for .Net and Mono
Fosdem 2011 - A Common Graph Database Access Layer for .Net and MonoFosdem 2011 - A Common Graph Database Access Layer for .Net and Mono
Fosdem 2011 - A Common Graph Database Access Layer for .Net and Mono
 
Machine Learning Techniques for the Semantic Web
Machine Learning Techniques for the Semantic WebMachine Learning Techniques for the Semantic Web
Machine Learning Techniques for the Semantic Web
 
Local karuta project
Local karuta projectLocal karuta project
Local karuta project
 
可視化法学-大和超券ステージ
可視化法学-大和超券ステージ可視化法学-大和超券ステージ
可視化法学-大和超券ステージ
 
Graph-based Relational Data Visualization
Graph-based RelationalData VisualizationGraph-based RelationalData Visualization
Graph-based Relational Data Visualization
 
Two graph data models : RDF and Property Graphs
Two graph data models : RDF and Property GraphsTwo graph data models : RDF and Property Graphs
Two graph data models : RDF and Property Graphs
 
Swc2013 yamamoto
Swc2013 yamamotoSwc2013 yamamoto
Swc2013 yamamoto
 

Similar to RDF2Vec: RDF Graph Embeddings for Data Mining

A Comparison of Propositionalization Strategies for Creating Features from Li...
A Comparison of Propositionalization Strategies for Creating Features from Li...A Comparison of Propositionalization Strategies for Creating Features from Li...
A Comparison of Propositionalization Strategies for Creating Features from Li...
Petar Ristoski
 
Visualising Multi-objective Data: From League Tables to Optimisers, and back
Visualising Multi-objective Data: From League Tables to Optimisers, and backVisualising Multi-objective Data: From League Tables to Optimisers, and back
Visualising Multi-objective Data: From League Tables to Optimisers, and back
djw213
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
Heiko Paulheim
 
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014:  Social Network Benchmark (SNB) Graph GeneratorFOSDEM 2014:  Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
LDBC council
 
Mobile Visual Search: Object Re-Identification Against Large Repositories
Mobile Visual Search: Object Re-Identification Against Large RepositoriesMobile Visual Search: Object Re-Identification Against Large Repositories
Mobile Visual Search: Object Re-Identification Against Large Repositories
United States Air Force Academy
 
Analyzing large multimedia collections in an urban context - Prof. Marcel Wor...
Analyzing large multimedia collections in an urban context - Prof. Marcel Wor...Analyzing large multimedia collections in an urban context - Prof. Marcel Wor...
Analyzing large multimedia collections in an urban context - Prof. Marcel Wor...
Facultad de Informática UCM
 
Multilingual qa
Multilingual qaMultilingual qa
Multilingual qa
shakimov
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge Graphs
Heiko Paulheim
 
Scaling Dropbox
Scaling DropboxScaling Dropbox
Scaling Dropbox
C4Media
 
rips-hk-lenovo (1)
rips-hk-lenovo (1)rips-hk-lenovo (1)
rips-hk-lenovo (1)
Owen Richfield
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
Heiko Paulheim
 
Conceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQLConceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQL
MongoDB
 
Explicit Semantics in Graph DBs Driving Digital Transformation With Neo4j
Explicit Semantics in Graph DBs Driving Digital Transformation With Neo4jExplicit Semantics in Graph DBs Driving Digital Transformation With Neo4j
Explicit Semantics in Graph DBs Driving Digital Transformation With Neo4j
Connected Data World
 
Towards Versioning of Arbitrary RDF Data
Towards Versioning of Arbitrary RDF DataTowards Versioning of Arbitrary RDF Data
Towards Versioning of Arbitrary RDF Data
Linked Enterprise Date Services
 
Training di Base Neo4j
Training di Base Neo4jTraining di Base Neo4j
Training di Base Neo4j
Neo4j
 
Lec11 object-re-id
Lec11 object-re-idLec11 object-re-id
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics Researchers
Vitomir Kovanovic
 
Neo4j graphdatabaseforrecommendations-130531021030-phpapp02-converted
Neo4j graphdatabaseforrecommendations-130531021030-phpapp02-convertedNeo4j graphdatabaseforrecommendations-130531021030-phpapp02-converted
Neo4j graphdatabaseforrecommendations-130531021030-phpapp02-converted
snehapandey01
 
Neo4j - graph database for recommendations
Neo4j - graph database for recommendationsNeo4j - graph database for recommendations
Neo4j - graph database for recommendations
proksik
 
Serials & E-Books in RDA
Serials & E-Books in RDASerials & E-Books in RDA
Serials & E-Books in RDA
Renette Davis
 

Similar to RDF2Vec: RDF Graph Embeddings for Data Mining (20)

A Comparison of Propositionalization Strategies for Creating Features from Li...
A Comparison of Propositionalization Strategies for Creating Features from Li...A Comparison of Propositionalization Strategies for Creating Features from Li...
A Comparison of Propositionalization Strategies for Creating Features from Li...
 
Visualising Multi-objective Data: From League Tables to Optimisers, and back
Visualising Multi-objective Data: From League Tables to Optimisers, and backVisualising Multi-objective Data: From League Tables to Optimisers, and back
Visualising Multi-objective Data: From League Tables to Optimisers, and back
 
New Adventures in RDF2vec
New Adventures in RDF2vecNew Adventures in RDF2vec
New Adventures in RDF2vec
 
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014:  Social Network Benchmark (SNB) Graph GeneratorFOSDEM 2014:  Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
 
Mobile Visual Search: Object Re-Identification Against Large Repositories
Mobile Visual Search: Object Re-Identification Against Large RepositoriesMobile Visual Search: Object Re-Identification Against Large Repositories
Mobile Visual Search: Object Re-Identification Against Large Repositories
 
Analyzing large multimedia collections in an urban context - Prof. Marcel Wor...
Analyzing large multimedia collections in an urban context - Prof. Marcel Wor...Analyzing large multimedia collections in an urban context - Prof. Marcel Wor...
Analyzing large multimedia collections in an urban context - Prof. Marcel Wor...
 
Multilingual qa
Multilingual qaMultilingual qa
Multilingual qa
 
Machine Learning & Embeddings for Large Knowledge Graphs
Machine Learning & Embeddings  for Large Knowledge GraphsMachine Learning & Embeddings  for Large Knowledge Graphs
Machine Learning & Embeddings for Large Knowledge Graphs
 
Scaling Dropbox
Scaling DropboxScaling Dropbox
Scaling Dropbox
 
rips-hk-lenovo (1)
rips-hk-lenovo (1)rips-hk-lenovo (1)
rips-hk-lenovo (1)
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
Conceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQLConceptos básicos. Seminario web 1: Introducción a NoSQL
Conceptos básicos. Seminario web 1: Introducción a NoSQL
 
Explicit Semantics in Graph DBs Driving Digital Transformation With Neo4j
Explicit Semantics in Graph DBs Driving Digital Transformation With Neo4jExplicit Semantics in Graph DBs Driving Digital Transformation With Neo4j
Explicit Semantics in Graph DBs Driving Digital Transformation With Neo4j
 
Towards Versioning of Arbitrary RDF Data
Towards Versioning of Arbitrary RDF DataTowards Versioning of Arbitrary RDF Data
Towards Versioning of Arbitrary RDF Data
 
Training di Base Neo4j
Training di Base Neo4jTraining di Base Neo4j
Training di Base Neo4j
 
Lec11 object-re-id
Lec11 object-re-idLec11 object-re-id
Lec11 object-re-id
 
Introduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics ResearchersIntroduction to R for Learning Analytics Researchers
Introduction to R for Learning Analytics Researchers
 
Neo4j graphdatabaseforrecommendations-130531021030-phpapp02-converted
Neo4j graphdatabaseforrecommendations-130531021030-phpapp02-convertedNeo4j graphdatabaseforrecommendations-130531021030-phpapp02-converted
Neo4j graphdatabaseforrecommendations-130531021030-phpapp02-converted
 
Neo4j - graph database for recommendations
Neo4j - graph database for recommendationsNeo4j - graph database for recommendations
Neo4j - graph database for recommendations
 
Serials & E-Books in RDA
Serials & E-Books in RDASerials & E-Books in RDA
Serials & E-Books in RDA
 

Recently uploaded

Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
European Sustainable Phosphorus Platform
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
Carl Bergstrom
 
Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)
Sciences of Europe
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
Texas Alliance of Groundwater Districts
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
Advanced-Concepts-Team
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
AbdullaAlAsif1
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
Areesha Ahmad
 

Recently uploaded (20)

Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
 
Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
 

RDF2Vec: RDF Graph Embeddings for Data Mining

  • 1. 1 RDF2Vec: RDF Graph Embeddings for Data Mining Petar Ristoski and Heiko Paulheim
  • 2. 11/7/2016 2 Introduction Linking Exploration / Selection Consolidation / Cleansing Graph Data Transformation Data Mining Visualization / Explanation Ristoski, Paulheim
  • 3. Motivation • Standard data mining algorithms require propositional feature vector representation • Feature space: V={v1,v2,…, vn} • Each instance is represented as an n-dimensional feature vector (v1,v2,…,vn), where for each 1≤ vi ≤n : – vi ∈ {true, false}, or vi ∈ {1,0} – vi ∈ ℝ – vi ∈ S, where S is a finite set of symbols 11/7/2016 Ristoski, Paulheim 3
  • 4. 11/7/2016 Ristoski, Paulheim 4 Name Person Music Artist Instrument Genre Trent Reznor 1 1 1 0 Wolfgang A. Mozart 1 1 1 1 Barack Obama 1 0 0 0 Motivation
  • 5. Vision • Preserve the information given in the original graph • Unsupervised – task and dataset independent • Compatible with traditional data mining algorithms and tools • Efficient computation and application – Low dimensional representation 11/7/2016 Ristoski, Paulheim 5
  • 7. RDF2Vec • Adaptation of neural language models – Word2vec – Latent representation of words based on text corpus • Convert RDF graphs in sequences of entities and relations (sentences) – Graph Walks – Weisfeiler-Lehman Subtree RDF Graph Kernels • Train neural language model – Each entity and relation is represented as N-dimensional numerical vector – Semantically similar entities appear closer in the embedded space • Use entity vectors in different ML tasks 11/7/2016 Ristoski, Paulheim 7
  • 8. Word2vec – Neural Language Model • Two-layer neural net that converts raw text into vectors – Each word is represented into a numerical vector • Continuous Bag-of-Words (CBOW) – Predict target words from source context words – Tokyo is the capital of Japan • Skip-gram 11/7/2016 8 [1] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." NIPS, 2013. [2] Rong, Xin. "word2vec parameter learning explained." 2014. Ristoski, Paulheim
  • 10. Word Embedding 11/7/2016 10 • Japan • Russia • Germany • Austria • Berlin • Tokyo • Moscow • Vienna Tokyo = [f1, f2, f3, …, fn] Japan= [f1, f2, f3, …, fn] Ristoski, Paulheim ? v(Japan) - v(Tokyo) + v(Berlin) ≈ v(Germany)
  • 11. Word2vec – Neural Language Model • Two-layer neural net that converts raw text into vectors – Each word is represented into a numerical vector • Continuous Bag-of-Words (CBOW) – Predict target words from source context word – Tokyo is the capital of Japan • Skip-gram – Predict context words from the target word – Tokyo is the capital of Japan 11/7/2016 11 [1] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." NIPS, 2013. [2] Rong, Xin. "word2vec parameter learning explained." 2014. Ristoski, Paulheim
  • 13. RDF2vec 11/7/2016 13 • Convert the graph into sequence of tokens (sentences) – Graph walks – Weisfeiler-Lehman Subtree RDF Graph Kernels Ristoski, Paulheim
  • 14. Graph Walks RDF2vec • For each entity in the graph: – Extract a subgraph with depth d – Extract walks on the subgraph – Build word2vec model dbr:Trent_Reznor -> dbo:associatedBand -> dbr:Exotic_Birds -> dbo:bandMember -> dbr:Chris_Vrenna dbr:Trent_Reznor -> dbo:genre - > dbr:Dark_ambient -> dbo:instrument -> dbr:Field_recording 11/7/2016 14Ristoski, Paulheim
  • 15. Random Walks RDF2vec 11/7/2016 15 V*S Walks V Vectors Ristoski, Paulheim
  • 16. Entity Embedding 11/7/2016 16 • dbr:Berlin • dbr:Tokyo • dbr:Moscow • dbr:Vienna • dbr:Japan • dbr:Russia • dbr:Germany • dbr:Austria Ristoski, Paulheim dbr:Tokyo = [f1, f2, f3, …, fn] dbr:Japan= [f1, f2, f3, …, fn]
  • 18. WL Kernel RDF2vec • Construct sequences using random walks with depth d after each iteration for each entity in the graph • Graph G sequences after 1 iteration: – 1->6->11; 1->6->11->13; 1->6->11->10 … – 4->11->6; 4->11->13; 4->11->10; 4->11->10->8 … – … 11/7/2016 18 de Vries, Gerben KD. "A fast approximation of the Weisfeiler-Lehman graph kernel for RDF data.“ ECML, 2013. Ristoski, Paulheim
  • 19. WL Kernels RDF2vec 11/7/2016 19 V*S*I sequences V Vectors Ristoski, Paulheim
  • 21. Evaluation Setup • Datasets – 3 domain-specific RDF datasets – 2 large cross-domain RDF datasets with 5 evaluation datasets • Tasks – Classification: Naive Bayes, k-Nearest Neighbors (k=3), C4.5 decision tree and Support Vector Machines. – Regression: Linear Regression, M5Rules, and k-Nearest Neighbors (k=3). • Baselines – Features derived from incoming and outgoing relations and values – Features derived from graph substructures: WL and Walk-Count Kernels 11/7/2016 Ristoski, Paulheim 21
  • 22. Domain Specific RDF Datasets • Datasets • Results (accuracy) – Best scores per dataset 11/7/2016 Ristoski, Paulheim 22 Dataset Task #statements #instances #walks depth #sequences WL iter. WL depth #sequences AIFB C (c=4) 30K 176 all 10 360K 4 2 346K BGS C (c=2) 600K 146 all 10 2.4M 4 2 5.3M MUTAG C (c=2) 80K 340 all 10 168K 4 2 908K Dataset Baseline Walks2vec WL2vec (SG 500) AIFB 92.68 89.55 93.41 BGS 91.05 78.10 96.18 MUTAG 94.29 82.06 96.33
  • 23. Large Cross-Domain RDF Datasets • Datasets • Evaluation datasets 11/7/2016 Ristoski, Paulheim 23 Dataset #instances depth #sequences Vector size model DBpedia 5M 4/8 2.5B 200/500 CBOW/SG Wikidata 17M 4 8.5B 200/500 CBOW/SG Dataset #Instances ML Task Original Source Cities 212 R/C (c=3) Mercer Metacritic Albums 1,600 R/C (c=3) Metacritic Metacritic Movies 2,000 R/C (c=3) Metacritic AAUP 960 R/C(c=3) JSE Forbes 1,585 R/C (c=3) Forbes
  • 24. • Accuracy Results – Best scores only Results: classification Cities Movies Albums AAUP Forbes Best Baseline 75.13 79.30 77.94 93.44 76.75 DB2vec CBOW 200 8 77.39 83.65 78.44 92.23 88.30 DB2vec CBOW 500 8 76.84 83.25 77.25 90.61 89.86 DB2vec SG 200 8 78.92 83.30 79.72 91.04 90.10 DB2vec SG 500 8 89.73 82.80 78.20 94.48 88.53 WD2vec CBOW 200 4 75.56 52.20 51.44 90.18 81.08 WD2vec CBOW 500 4 85.56 51.04 53.28 89.74 80.74 WD2vec SG 200 4 75.48 75.39 64.76 90.50 81.17 WD2vec SG 500 4 83.20 76.30 63.42 90.60 81.17 11/7/2016 Ristoski, Paulheim 24
  • 25. • RMSE Results – Best scores only Results: regression Cities Movies Albums AAUP Forbes Best Baseline 17.04 19.19 12.81 6.16 18.32 1 DB2vec CBOW 200 8 12.55 15.90 11.79 6.47 17.43 DB2vec CBOW 500 8 12.54 15.81 11.30 6.54 17.62 DB2vec SG 200 8 12.85 15.12 10.90 6.22 17.85 DB2vec SG 500 8 10.19 15.45 10.89 6.26 16.61 WD2vec CBOW 200 4 17.52 23.39 14.55 6.60 21.77 WD2vec CBOW 500 4 18.33 22.18 14.00 6.08 21.92 WD2vec SG 200 4 18.69 19.10 13.51 6.52 21.59 WD2vec SG 500 4 19.23 19.19 13.23 6.05 21.58 11/7/2016 Ristoski, Paulheim 25
  • 26. Results Summary • RDF2vec outperform all the baseline approaches – Smaller feature vectors - more efficient training than bassline approaches • WL kernel sequences capture the graph structure better than walks – Not efficient on large graphs – Large number of sequences produced – not scalable • Increasing the depth of the paths increases the quality of the embeddings • The vector dimensionality doesn’t affect the performance • Skip-Gram models constantly outperforms CBOW models • DBpedia produces higher quality embeddings than Wikidata 11/7/2016 Ristoski, Paulheim 26
  • 27. Other Use-Cases • Recommender systems • Document modeling – Document similarity – Entity relatedness • Alignment of knowledge bases – DBpedia and Wikidata • Knowledge base relation prediction and error detection • Linking text and semi-structured knowledge to knowledge bases 11/7/2016 Ristoski, Paulheim 27
  • 28. Conclusion • RDF2Vec: an approach for learning latent numerical representations of entities in RDF graphs • Preserves the graph information • Compatible with all the traditional machine learning algorithms • More efficient ML models training • Task and dataset independent approach • Download the code and the models: http://data.dws.informatik.uni- mannheim.de/rdf2vec/ 11/7/2016 Ristoski, Paulheim 28

Editor's Notes

  1. igure~\ref{fig:lodKDDpipeline} gives an overview of a general LOD-enabled knowledge discovery process. Given a set of local data (such as a relational database), the first step is to link the data to the corresponding LOD concepts from the chosen LOD dataset. After the links are set, outgoing links to external LOD datasets can be explored. In the next step, various techniques for data consolidation and cleansing are applied. Next, transformations on the collected data need to be performed in order to represent the data in a way that it can be processed with any arbitrary data analysis algorithms. After the data transformation is done, a suitable data mining algorithm is applied on the data. In the final step, the results of the data mining process are presented to the user.
  2. Skip gram softmax is actually logistic regression It is unsupervised method
  3. On the input layer we get the context words and in the output layer we are trying to calculate the target word. For each of the input words the vector is retrieved from the input->hiden matrix and averaged which is represented in the hidden layer. This implies that the link (activation) function of the hidden layer units is simply linear r (i.e., directly passing its weighted sum of inputs to the next layer). Using the weights from the hidden-output layer weights, we can compute a score uj for each word in the vocabulary. Then the objective of model is to maximize the average log probability. Where the posterior probability is defined using the softmax function. The averaged vector representation from the context is computed as. With the softmax function we calculate the probabilyt for the target given the context, and we use the value to calculate the error (true is 1 or 0) and then we calculate the loss function using gradient descent and backpropagate through the network We take the input vectors, but you can also take the output vectors, or use sum of both, or empiracal evaluation has shown that concatattion is better
  4. If we represent the words in a low dimensional feature space we expect words semanticly similar words to be close to each other.
  5. Skip gram softmax is actually logistic regression It is unsupervised method
  6. The target word is now at the input layer, and the context words are on the output layer. The input is a one-hot encoded vector. which means h is simply copying a row of the input→hidden weight matrix The objective of the Skip-gram model is to maximize the average log probability. where vw and v ′ w are the “input” and “output” vector representations of w, and W is the number of words in the vocabulary. This formulation is impractical because the cost of computing ∇ log p(wO|wI ) is proportional to W, which is often large (105–107 terms). There are hierarchilcal softmax and negative sampling This inversion might seem like an arbitrary choice, but statistically it has the effect that CBOW smoothes over a lot of the distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for smaller datasets. However, skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets. We take the input vectors, but you can also take the output vectors, or use sum of both, or empiracal evaluation has shown that concatattion is better
  7. If we represent the words in a low dimensional feature space we expect words semantically similar words to be close to each other.