Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
NJVR: The NanJing Vocabulary Repository
1. NJVR: The NanJing Vocabulary Repository
Gong Cheng, Min Liu, Yuzhong Qu
Nanjing University
2. Motivation
summarization
ranking
matching
Ontology-related research topics A large and representative
collection of real-world vocabularies
3. State of the art
Top-down efforts Bottom-up efforts
Size: small (hundreds) Size: large (thousands)
Access: directly (via browsing) Access: indirectly (via searching)
Our goal
4. Contribution
• NJVR: A large and freely-accessible vocabulary repository
– Source: An index of 4.1 B RDF triples distributed in 15.9 M RDF
documents crawled from 5.8K pay-level domains (PLDs)
– Constitution:
• RDF descriptions of 2,996 dereferenceable vocabularies crawled from 261 PLDs
• Document-level statistical data on their instantiations (e.g. term frequency)
– Accessibility: Publicly downloadable
6. Crawling (2007—May 2011)
1. Initialization (of the URI pool)
– Other freely-accessible repositories, e.g. pingthesemanticweb.com
– LOD cloud
– Search results, e.g. Swoogle, Google
1. URI Dereference and document parsing
– java.net package
– Jena
1. Pool expansion
– URIs in parsed documents
– Submissions from the users of Falcons
7. Vocabulary identification
• Bottom-up strategy
1. Term: URI that identifies a class/property in its dereference
document
2. Vocabulary: Terms in a common namespace are grouped
8. Results
• 455,718 terms
– 396,023 classes, 59,868 properties, (many are in YAGO NS)
• 2,996 vocabularies
– From 261 PLDs , (many are from w3.org)
• Instantiation found for
– 115,707 classes (29.2%), e.g. foaf:Person
– 25,963 properties (43.4%), e.g. dc:creator
– 1,874 vocabularies (62.6%)
10. NJVR for vocabulary ranking
• Using NJVR as a test case for vocabulary ranking
11. Future work
• Removal of low-quality vocabularies from NJVR
• Comparative analysis of NJVR and other repositories
• …
12. Just use it!
ws.nju.edu.cn/njvr
ws.nju.edu.cn/falcons
Editor's Notes
Evaluations in these areas require not only systematically generated benchmarks but also, more importantly, a large and representative collection of real-world vocabularies.
search engines have indexed thousands of vocabularies, but unfortunately they only provide search services which are inconvenient for researchers to fully exploit the entire indexes.
experiment to compare different centrality measures by using NJVR