Large knowledge bases consisting of entities and relationships between them have become vital sources of information for many applications. Most of these knowledge bases adopt the Semantic-Web data model RDF as a representation model. Querying these knowledge bases is typically done using structured queries utilizing graph-pattern languages such as SPARQL. However, such structured queries require some expertise from users which limits the accessibility to such data sources. To overcome this, keyword search must be supported. In this paper, we propose a retrieval model for keyword queries over RDF graphs. Our model retrieves a set of subgraphs that match the query keywords, and ranks them based on statistical language models. We show that our retrieval model outperforms the-state-of-the-art IR and DB models for keyword search over structured data using experiments over two real-world datasets.
From April 1st, 2011 to March 31st 2012, the Partnership's SmartBusiness team consulted with 251 businesses in Halifax, NS, the majority of which were Small and Medium sized Enterprises (SMEs). This report is a compilation of the 213 retention visits to local businesses and their experiences with the Halifax economy. Retention visits cover a variety of issues ranging from: perceptions of the local business climate, to the company’s local workforce, to sales, even immigration issues.
#ForoEGovAR | Bases para las Políticas para las Sociedades del ConocimientoCESSI ArgenTIna
Documento elaborado por Susana Finquelievich y Paul Hector con motivo del Foro Argentino de Transformación Digital, organizado por CESSI y la United Nations University (UNU_EGOV). Buenos Aires, 7 de marzo de 2016.
Networking is the surest way to find a job and build a career. These Networking 101 tips can help you make a great impression and connections in your community.
Entity Linking via Graph-Distance MinimizationRoi Blanco
Entity-linking is a natural-language--processing task that consists in identifying strings of text that refer to a particular
item in some reference knowledge base.
One instance of entity-linking can be formalized as an optimization problem on the underlying concept graph, where the quantity to be optimized is the average distance between chosen items.
Inspired by this application, we define a new graph problem which is a natural variant of the Maximum Capacity Representative Set. We prove that our problem is NP-hard for general graphs; nonetheless, it turns out to be solvable in linear time under some more restrictive assumptions. For the general case, we propose several heuristics: one of these tries to enforce the above assumptions while the others try to optimize similar easier objective functions; we show experimentally how these approaches perform with respect to some baselines on a real-world dataset.
Comparing Index Structures for Completeness ReasoningFariz Darari
Data quality is a major issue in the development of knowledge graphs. Data completeness is a key factor in data quality that concerns the breadth, depth, and scope of information contained in knowledge graphs. As for large-scale knowledge graphs (e.g., DBpedia, Wikidata), it is conceivable that given the amount of information contained in there, they may be complete for a wide range of topics, such as children of Donald Trump, cantons of Switzerland, and presidents of Indonesia. Previous research has shown how one can augment knowledge graphs with statements about their completeness, stating which parts of data are complete. Such meta-information can be leveraged to check query completeness, that is, whether the answer returned by a query is complete. Yet, it is still unclear how such a check can be done in practice, especially when a large number of completeness statements are involved. We devise implementation techniques to make completeness reasoning in the presence of large sets of completeness statements feasible, and experimentally evaluate their effectiveness in realistic settings based on the characteristics of real-world knowledge graphs.
From April 1st, 2011 to March 31st 2012, the Partnership's SmartBusiness team consulted with 251 businesses in Halifax, NS, the majority of which were Small and Medium sized Enterprises (SMEs). This report is a compilation of the 213 retention visits to local businesses and their experiences with the Halifax economy. Retention visits cover a variety of issues ranging from: perceptions of the local business climate, to the company’s local workforce, to sales, even immigration issues.
#ForoEGovAR | Bases para las Políticas para las Sociedades del ConocimientoCESSI ArgenTIna
Documento elaborado por Susana Finquelievich y Paul Hector con motivo del Foro Argentino de Transformación Digital, organizado por CESSI y la United Nations University (UNU_EGOV). Buenos Aires, 7 de marzo de 2016.
Networking is the surest way to find a job and build a career. These Networking 101 tips can help you make a great impression and connections in your community.
Entity Linking via Graph-Distance MinimizationRoi Blanco
Entity-linking is a natural-language--processing task that consists in identifying strings of text that refer to a particular
item in some reference knowledge base.
One instance of entity-linking can be formalized as an optimization problem on the underlying concept graph, where the quantity to be optimized is the average distance between chosen items.
Inspired by this application, we define a new graph problem which is a natural variant of the Maximum Capacity Representative Set. We prove that our problem is NP-hard for general graphs; nonetheless, it turns out to be solvable in linear time under some more restrictive assumptions. For the general case, we propose several heuristics: one of these tries to enforce the above assumptions while the others try to optimize similar easier objective functions; we show experimentally how these approaches perform with respect to some baselines on a real-world dataset.
Comparing Index Structures for Completeness ReasoningFariz Darari
Data quality is a major issue in the development of knowledge graphs. Data completeness is a key factor in data quality that concerns the breadth, depth, and scope of information contained in knowledge graphs. As for large-scale knowledge graphs (e.g., DBpedia, Wikidata), it is conceivable that given the amount of information contained in there, they may be complete for a wide range of topics, such as children of Donald Trump, cantons of Switzerland, and presidents of Indonesia. Previous research has shown how one can augment knowledge graphs with statements about their completeness, stating which parts of data are complete. Such meta-information can be leveraged to check query completeness, that is, whether the answer returned by a query is complete. Yet, it is still unclear how such a check can be done in practice, especially when a large number of completeness statements are involved. We devise implementation techniques to make completeness reasoning in the presence of large sets of completeness statements feasible, and experimentally evaluate their effectiveness in realistic settings based on the characteristics of real-world knowledge graphs.
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via CrowdsourcingMaribel Acosta Deibe
Best Student Paper Award at the 8th International Conference on Knowledge Capture (K-CAP 2015).
http://tinyurl.com/hare-paper
Abstract:
Due to the semi-structured nature of RDF data, missing values affect answer completeness of queries that are posed against RDF. To overcome this limitation, we present HARE, a novel hybrid query processing engine that brings together machine and human computation to execute SPARQL queries. We propose a model that exploits the characteristics of RDF in order to estimate the complete- ness of portions of a data set. The completeness model complemented by crowd knowledge is used by the HARE query engine to on-the-fly decide which parts of a query should be executed against the data set or via crowd computing. To evaluate HARE, we created and executed a collection of 50 SPARQL queries against the DBpedia data set. Experimental results clearly show that our solution accurately enhances answer completeness.
(The HARE logo is based on the art work by icons8 (https://icons8.com/)
Introduction to Knowledge Graphs with Grakn and Graql Vaticle
Cognitive/AI systems process knowledge that is far too complex for current databases. They require an expressive data model and an intelligent query language to perform knowledge engineering over complex datasets.
In this talk, we will discuss how Grakn, a database to organise complex networks of data and make it queryable, provides the knowledge graph foundation for intelligent systems to manage complex data.
We will discuss how Graql, Grakn's reasoning (through OLTP) and analytics (through OLAP) query language, provides the tools required to do the job: a knowledge schema, a logical inference language, a distributed analytics framework.
And finally, we will discuss how Graql’s language serves as unified data representation of data for cognitive systems.
2011 Search Query Rewrites - Synonyms & AcronymsBrian Johnson
July 27, 2011 Bay Area Search Presentation
Brian Johnson, Engineering Director, Query Services @ eBay
Query expansion is an important part of of the search recall for all search engines. In this talk I'll discuss some of the general trend driving Hadoop adoption within the Search Query Services team at eBay, and the types of algorithms/techniques we've moved to Hadoop at eBay. Over time we've moved from smaller, editorial data sets to large machine generated data sets mined from behavior log data, items/listings, catalogs, etc. One common workflow is to mine large candidate rewrites/expansions data sets from multiple data sources, use crowd sourced human judgment to classify a subset of the candidates (true positive, false positive), use machine learning techniques discard false positives, run automated validation on the final data set, and automatically push to production.
Ravi Jammalakadaka, Senior Applied Researcher, Query Services @ eBay
Ravi is a real engineer. Not a pointy haired manager like the previous speaker. Expect some real engineering:-) He'll be doing a literature review for acronym mining and discussing a real world implementation.
Title: Mining Acronyms From Raw Text
Abstract: Significant number of eBay products are known by their acronyms. eBay query expansion service expands user queries by their acronym equivalents to increase recall. The challenge is to mine acronyms from either seller ( ex. item descriptions, titles) or buyer ( ex. queries) data.
Ravi will present the state of the art algorithms from recent conferences that mine acronyms from raw text and present their limitations. He will present a new acronym mining algorithm that seeks to address the limitations identified with previous algorithms. He will present a machine learning classifier that seeks to remove the false positives generated from the acronym mining algorithm.
Presentation of the paper titled "Leveraging Semantic Parsing for Relation Linking over Knowledge Bases" at the ISWC 2020 - Research Track.
@inproceedings{mihindu-sling-2020,
title = "Leveraging Semantic Parsing for Relation Linking over Knowledge Bases",
author = "Mihindukulasooriya, Nandana and Rossiello, Gaetano and Kapanipathi, Pavan and Abdelaziz, Ibrahim and Ravishankar, Srinivas and Yu, Mo and Gliozzo, Alfio and Roukos, Salim and Gray, Alexander",
booktitle="The Semantic Web -- ISWC 2020",
year="2020",
publisher="Springer International Publishing",
address="Cham",
pages="402--419",
url = "https://link.springer.com/chapter/10.1007/978-3-030-62419-4_23",
doi = "10.1007/978-3-030-62419-4_23"
}
Mining Interesting Trivia for Entities from Wikipedia PART-IIAbhay Prakash
The following presentation is on my Masters Graduate Thesis Work - "Mining Interesting Trivia for Entities from Wikipedia".
This presentation is the second part and in continuation of my another presentation, which is having the same title but with 'PART-I' in end
Slides used for the keynote at the even Big Data & Data Science http://eventos.citius.usc.es/bigdata/
Some slides are borrowed from random hadoop/big data presentations
Slides for the iDB summer school (Sapporo, Japan) http://db-event.jpn.org/idb2013/
Typically, Web mining approaches have focused on enhancing or learning about user seeking behavior, from query log analysis and click through usage, employing the web graph structure for ranking to detecting spam or web page duplicates. Lately, there's a trend on mining web content semantics and dynamics in order to enhance search capabilities by either providing direct answers to users or allowing for advanced interfaces or capabilities. In this tutorial we will look into different ways of mining textual information from Web archives, with a particular focus on how to extract and disambiguate entities, and how to put them in use in various search scenarios. Further, we will discuss how web dynamics affects information access and how to exploit them in a search context.
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
Nowadays, successful applications are those which contain features that captivate and engage users. Using an interactive news retrieval system as a use case, in this paper we study the effect of timeline and named-entity components on user engagement. This is in contrast with previous studies where the importance of these components were studied from a retrieval effectiveness point of view. Our experimental results show significant improvements in user engagement when named-entity and timeline components were installed. Further, we investigate if we can predict user-centred metrics through user's interaction with the system. Results show that we can successfully learn a model that predicts all dimensions of user engagement and whether users will like the system or not. These findings might steer systems that apply a more personalised user experience, tailored to the user's preferences.
Broad introduction to information retrieval and web search, used to teaching at the Yahoo Bangalore Summer School 2013. Slides are a mash-up from my own and other people's presentations.
Beyond document retrieval using semantic annotations Roi Blanco
Traditional information retrieval approaches deal with retrieving full-text document as a response to a user's query. However, applications that go beyond the "ten blue links" and make use of additional information to display and interact with search results are becoming increasingly popular and adopted by all major search engines. In addition, recent advances in text extraction allow for inferring semantic information over particular items present in textual documents. This talks presents how enhancing a document with structures derived from shallow parsing is able to convey a different user experience in search and browsing scenarios, and what challenges we face as a consequence.
Extending BM25 with multiple query operatorsRoi Blanco
Traditional probabilistic relevance frameworks for informational retrieval refrain from taking positional information into account, due to the hurdles of developing a sound model while avoiding an explosion in the number of parameters. Nonetheless, the well-known BM25F extension of the successful Okapi ranking function can be seen as an embryonic attempt in that direction. In this paper, we proceed along the same line, defining the notion of virtual region: a virtual region is a part of the document that, like a BM25F-field, can provide a (larger or smaller, depending on a tunable weighting parameter) evidence of relevance of the document; differently from BM25F fields, though, virtual regions are generated implicitly by applying suitable (usually, but not necessarily, positional-aware) operators to the query. This technique fits nicely in the eliteness model behind BM25 and provides a principled explanation to BM25F; it specializes to BM25(F) for some trivial operators, but has a much more general appeal. Our experiments (both on standard collections, such as TREC, and on Web-like repertoires) show that the use of virtual regions is beneficial for retrieval effectiveness.
Energy-Price-Driven Query Processing in Multi-center WebSearch EnginesRoi Blanco
Concurrently processing thousands of web queries, each with a response time under a fraction of a second, necessitates maintaining and operating massive data centers. For large-scale web search engines, this translates into high energy consumption and a huge electric bill. This work takes the challenge to reduce the electric bill of commercial web search engines operating on data centers that are geographically far apart. Based on the observation that energy prices and query workloads show high spatio-temporal variation, we propose a technique that dynamically shifts the query workload of a search engine between its data centers to reduce the electric bill. Experiments on real-life query workloads obtained from a commercial search engine show that significant financial savings can be achieved by this technique.
Effective and Efficient Entity Search in RDF dataRoi Blanco
Triple stores have long provided RDF storage as well as data access using expressive, formal query languages such as SPARQL. The new end users of the Semantic Web, however, are mostly unaware of SPARQL and overwhelmingly prefer imprecise, informal keyword queries for searching over data. At the same time, the amount of data on the Semantic Web is approaching the limits of the architectures that provide support for the full expressivity of SPARQL. These factors combined have led to an increased interest in semantic search, i.e. access to RDF data using Information Retrieval methods. In this work, we propose a method for effective and efficient entity search over RDF data. We describe an adaptation of the BM25F ranking function for RDF data, and demonstrate that it outperforms other state-of-the-art methods in ranking RDF resources. We also propose a set of new index structures for efficient retrieval and ranking of results. We implement these results using the open-source MG4J framework.
Caching Search Engine Results over Incremental IndicesRoi Blanco
A Web search engine must update its index periodically to incorporate changes to the Web. We argue in this paper that index updates fundamentally impact the design of search engine result caches, a performance-critical component of modern search engines. Index updates lead to the problem of cache invalidation: invalidating cached entries of queries whose results have changed. Naive approaches, such as flushing the entire cache upon every index update, lead to poor performance and in fact, render caching futile when the frequency of updates is high. Solving the invalidation problem efficiently corresponds to predicting accurately which queries will produce different results if re-evaluated, given the actual changes to the index.
To obtain this property, we propose a framework for developing invalidation predictors and define metrics to evaluate invalidation schemes. We describe concrete predictors using this framework and compare them against a baseline that uses a cache invalidation scheme based on time-to-live (TTL). Evaluation over Wikipedia documents using a query log from the Yahoo! search engine shows that selective invalidation of cached search results can lower the number of unnecessary query evaluations by as much as 30% compared to a baseline scheme, while returning results of similar freshness. In general, our predictors enable fewer unnecessary invalidations and fewer stale results compared to a TTL-only scheme for similar freshness of results.
We study the problem of finding sentences that explain the relationship between a named entity and an ad-hoc query, which we refer to as entity support sentences. This is an important sub-problem of entity ranking which, to the best of our knowledge, has not been addressed before. In this paper we give the first formalization of the problem, how it can be evaluated, and present a full evaluation dataset. We propose several methods to rank these sentences, namely retrieval-based, entity-ranking based and position-based. We found that traditional bag-of-words models perform relatively well when there is a match between an entity and a query in a given sentence, but they fail to find a support sentence for a substantial portion of entities. This can be improved by incorporating small windows of context sentences and ranking them appropriately.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
3. Shady Elbassuoni, Keyword Search over RDF Graphs, CIKM 2011
Searching RDF Data
Structured triple-pattern queries (SPARQL)
Example: comedies that have won an
academy award
SELECT ?m
WHERE {?m hasGenre Comedy . ?m hasWonPrize Academy_Award}
4. Shady Elbassuoni, Keyword Search over RDF Graphs, CIKM 2011
Searching RDF Data
Triple-pattern queries are very expressive
but are not that useable
Most users/ Search APIs prefer keyword queries
Support keyword search over RDF graphs
5. Shady Elbassuoni, Keyword Search over RDF Graphs, CIKM 2011
Keyword Search over RDF Data
How to process keyword queries?
Translate keyword queries into SPARQL
Directly process the queries over the RDF graph
What are the results to a keyword query?
Resources
Triples
Tuples of triples (subgraphs)
6. Shady Elbassuoni, Keyword Search over RDF Graphs, CIKM 2011
Keyword Search over RDF Data
How to process keyword queries?
Translate keyword queries into SPARQL
Directly process the queries over the RDF graph
What are the results to a keyword query?
Resources
Triples
Tuples of triples (subgraphs)
7. Shady Elbassuoni, Keyword Search over RDF Graphs, CIKM 2011
Keyword Search over RDF Data
How to process keyword queries?
Translate keyword queries into SPARQL
Directly process the queries over the RDF graph
What are the results to a keyword query?
Resources
Triples
Tuples of triples (subgraphs)
8. Shady Elbassuoni, Keyword Search over RDF Graphs, CIKM 2011
Keyword Search over RDF Data
How to process keyword queries?
Translate keyword queries into SPARQL
Directly process the queries over the RDF graph
What are the results to a keyword query?
Resources
Triples
Tuples of triples (subgraphs)
9. Shady Elbassuoni, Keyword Search over RDF Graphs, CIKM 2011
Processing Keyword Queries
Construct a document D(t) for each triple t
D(t) contains all literals in t and any text
associated with the URIs in t
Example:
t: Innerspace hasGenre Comedy
innerspace USA1987 science fiction comedy film Joe
Dante Michael Finnell Dennis Quaid Martin Short Meg
Ryan academy award best visual effects …
innerspace USA1987 science fiction comedy film Joe
Dante Michael Finnell Dennis Quaid Martin Short Meg
Ryan academy award best visual effects …
We can now create triple-term indexes
10. Shady Elbassuoni, Keyword Search over RDF Graphs, CIKM 2011
Retrieving Query Results
For each query keyword, retrieve a list of triples
Join the triples from different lists based on their URIs
comedy award
Innerspace hasGenre Comedy
Road_Trip hasGenre Comedy
Toy_Story hasGenre Comedy
Diner type Comedy_films
Police_Academy type Comedy_films
The_Darwin_Awards type Comedy_films
...
Traffic hasWonPrize Academy_Award
Innerspace hasWonPrize Academy_Award
Toy_Story hasWonPrize Academy_Award
Diner hasWonPrize Academy_Award
The_Darwin_Awards type Comedy_films
...
`
T: Innerspace hasGenre Comedy . Innerspace hasWonPrize Academy_Award
11. Shady Elbassuoni, Keyword Search over RDF Graphs, CIKM 2011
Retrieving Query Results
Retrieve a list of triples matching a query keyword
Join the triples from different lists based on their URIs
comedy award
Innerspace hasGenre Comedy
Road_Trip hasGenre Comedy
Toy_Story hasGenre Comedy
Diner type Comedy_films
Police_Academy type Comedy_films
The_Darwin_Awards type Comedy_films
...
Traffic hasWonPrize Academy_Award
Innerspace hasWonPrize Academy_Award
Toy_Story hasWonPrize Academy_Award
Diner hasWonPrize Academy_Award
The_Darwin_Awards type Comedy_films
...
`
T: Innerspace hasGenre Comedy . Innerspace hasWonPrize Academy_Award
T: Toy_Story hasGenre Comedy . Toy_Story hasWonPrize Academy_Award
12. Shady Elbassuoni, Keyword Search over RDF Graphs, CIKM 2011
Retrieving Query Results
Retrieve a list of triples matching a query keyword
Join the triples from different lists based on their URIs
comedy award
Innerspace hasGenre Comedy
Road_Trip hasGenre Comedy
Toy_Story hasGenre Comedy
Diner type Comedy_films
Police_Academy type Comedy_films
The_Darwin_Awards type Comedy_films
...
Traffic hasWonPrize Academy_Award
Innerspace hasWonPrize Academy_Award
Toy_Story hasWonPrize Academy_Award
Diner hasWonPrize Academy_Award
The_Darwin_Awards type Comedy_films
...
`
Result Ranking is crucial!!
T: Innerspace hasGenre Comedy . Innerspace hasWonPrize Academy_Award
T: Toy_Story hasGenre Comedy . Toy_Story hasWonPrize Academy_Award
T: Police_Academy type Comedy_Films . The_Darwin_Awards type Comedy_Films
13. Shady Elbassuoni, Keyword Search over RDF Graphs, CIKM 2011
Language Models for Triples
D(t)
t:Innerspace hasGenre Comedy
Esitmate from
w P(w|D(t))
innerspace 0.234
1987 0.123
science 0.012
fiction 0.020
comedy 0.111
film 0.179
classic 0.111
meg 0.019
ryan 0.019
oscar 0.148
. . . . . .
w
P(w)
14. Shady Elbassuoni, Keyword Search over RDF Graphs, CIKM 2011
Ranking Model
comedy award
T: Innerspace hasGenre Comedy . Innerspace hasWonPrize Academy_Award
but we treat triples as bags of words!
15. Shady Elbassuoni, Keyword Search over RDF Graphs, CIKM 2011
Ranking Model
comedy award
T: Innerspace hasGenre Comedy . Innerspace hasWonPrize Academy_Award
probability of the structure of triple t
being relevant to keyword w
16. Shady Elbassuoni, Keyword Search over RDF Graphs, CIKM 2011
Estimating Structural Relevance
For each keyword, construct a probability
distribution over predicates
Example: award
r P(r|w)
hasWonPrize 0.459
wasNominatedFor 0.387
type 0.112
directed 0.020
actedIn 0.021
producedIn 0.025
bornIn 0.008
. . . . . .
estimated from the whole dataset
P(Innerspace hasWonPrize Academy_Award|award) = P(hasWonPrize|award)
17. Shady Elbassuoni, Keyword Search over RDF Graphs, CIKM 2011
Example Ranked Query Results
comedy award
Bag of Words
Combat_Academy type Comedy_films . The_Darwin_Awards type Comedy_films
Police_Academy type Comedy_films . The_Darwin_Awards type Comedy_films
Innerspace hasGenre Comedy . Innerspace hasWonPrize Academy_Award
Structure Aware
Innerspace hasGenre Comedy . Innerspace hasWonPrize Academy_Award
Toy_Story hasGenre Comedy . Toy_Story hasWonPrize Academy_Award
Shrek hasWonPrize Academy_Award_Best_Animated_Feature . Shrek hasGenre Comedy
18. Shady Elbassuoni, Keyword Search over RDF Graphs, CIKM 2011
Experimental Setup
User study over two RDF datasets:
movies from IMDB
books from LibraryThing
Models compared:
Structure Aware Approach
Bag of Words Approach
Language-model-based Object Retrieval
BANKS (keyword search over databases)
19. Shady Elbassuoni, Keyword Search over RDF Graphs, CIKM 2011
Experimental Setup
30 evaluation queries
Gathered relevance assessments for the top-
50 results retrieved by each model
21. Shady Elbassuoni, Keyword Search over RDF Graphs, CIKM 2011
Conclusion
Keyword Search over RDF data is crucial
To support keyword search over RDF data
Combine structured triples with text
Construct a document for each triple
Retrieve meaningful query results
Tuples of joined triples
Can be extended to larger subgraphs of the RDF
graph
Rank the retrieved results
A language model approach that uses both text and
structure
If we zoom in on one of these datesets, they are basically just a set of triples with three fields : subject, predicate & object. RDF is a very flexible means to encode structured information in a machine readable format … for instance the first triple hear states that the movie trafiic has won an Academy award. Note, that subjects and objects are URIs or literals and predicates are URIs.
So how do we search RDF data? We use structured query languages like sparql where a query is a set of triple patterns. A triple pattern is just a triple with one variable. Let’s look at an example. Consider we are looking for comedies that have won an academy award. This can be expressed using the triple-pattern query in the pink or simon box … the ?m is a variable and the triples in the curly braces are triple patterns. In particular, the first one has predicate hasGenre and
So triple patterns are really powerful and can be used to find very interesting information but do we really expect the regular users to use it? Unfortunately not! And since we computer scientists are really nice people and we always try to make the lives of the poor casual users easy, we need to enable them to search RDF data using keywords
An RDF dataset can also be viewed as a graph where subjects and objects are nodes and predicates represent labeled edges. For example, the triple about Traffic winning an academy award is represented using this edge.