The document proposes an approach called XReal to improve the effectiveness of XML keyword search by:
1. Identifying the user's desired search target and search criteria nodes in the XML data based on heuristics that consider the relatedness and informativeness of nodes.
2. Defining XML-specific term frequency (TF) and document frequency (DF) measures to extend the traditional TF-IDF model to the XML domain while accounting for hierarchical structure and semantics.
3. Designing a keyword search engine that ranks query results based on the proposed XML TF-IDF similarity measure to better capture user search intention and address keyword ambiguity issues.
Best Practices for Large Scale Text Mining ProcessingOntotext
Q&A:
NOW facilitates semantic search by having annotations attached to search strings. How compolex does that get, e.g. with wildcards between annotated strings?
NOW’s searchbox is quite basic at the moment, but still supports a few scenarios.
1. Pure concept/faceted search - search for all documents containing a concept or where a set of concepts are co-occurring. Ranking is based on frequence of occurrence.
2. Concept/faceted + Full Text search - search for both concepts and particular textual term of phrase.
3. Full text search
With search, pretty much anything can be done to customise it. For the NOW showcase we’ve kept it fairly simple, as usually every client has a slightly different case and wants to tune search in a slightly different direction.
The search in NOW is faceted which means that you search with concepts (facets) and you retrieve all documents which contain mentions of the searched concept. If you search by more than one facet the engine retrieves documents which contain mentions of both concepts but there is no restriction that they occur next to each other.
Is the tagging service expandable (say with custom ontologies)? also is it a something you offer as a service? it is unclear to me from the website.
The TAG service is used for demonstration purposes only. The models behind it are trained for annotating news articles. The pipeline is customizable for every concrete scenario, different domains and entities of interest. You can access several of our pipelines as a service through the S4 platform or you can have them hosted as an on premise solution. In some cases our clients want domain adaptation or improvements in particular area, or to tag with their internal dataset - in this case we offer again an on premise deployment and also a managed service hosted on our hardware.
Hdoes your system accomodate cluster analysis using unsupervised keyword/phrase annotation for knowledge discovery?
As much as the patterns of user behaviour are also considered knowledge discovery we employ these for suggesting related reads. Apart from these we have experience tailoring custom clustering pipelines which also rely on features like keyword and named entities.
For topic extraction how many topics can we extract? from twitter corpus wgat csn we infer?
For topic extraction we have determined that we obtain best results when suggesting 3 categories. These are taken from IPTC but only the uppermost levels which are less than 20.
The twitter corpus example is from a project Ontotext participates in called Pheme. The goal of the project is to detect rumours and to check their veracity, thus help journalists in their hunt for attractive news.
Do you provide Processing Resources and JAPE rules for GATE framework and that can be used with GATE embedded?
We are contributing to the GATE framework and everything which has been wrapped up as PRs has been included the corresponding GATE distributions.
An ontology is a computational artifact used to describe a conceptualization of some part of the world via precise, descriptive statements. In this presentation, we discuss the features of the W3C's Ontology Web Language (OWL) and how it can be used to reduce ambiguity in the semantics (i.e., the meaning) of Data Dictionary terminology.
Using modern machine learning methods, Christian shows how a million comments can be structured and information can be extracted.
He will use Python and Jupyter notebooks and visualizations as results.
Trainer: Christian Winkler holds a PhD in Theoretical Physics. He has worked in software and AI for 20 years, specializing in intelligent algorithms for unstructured data and text. He is a frequent speaker at conference and author of many articles and tutorials.
Best Practices for Large Scale Text Mining ProcessingOntotext
Q&A:
NOW facilitates semantic search by having annotations attached to search strings. How compolex does that get, e.g. with wildcards between annotated strings?
NOW’s searchbox is quite basic at the moment, but still supports a few scenarios.
1. Pure concept/faceted search - search for all documents containing a concept or where a set of concepts are co-occurring. Ranking is based on frequence of occurrence.
2. Concept/faceted + Full Text search - search for both concepts and particular textual term of phrase.
3. Full text search
With search, pretty much anything can be done to customise it. For the NOW showcase we’ve kept it fairly simple, as usually every client has a slightly different case and wants to tune search in a slightly different direction.
The search in NOW is faceted which means that you search with concepts (facets) and you retrieve all documents which contain mentions of the searched concept. If you search by more than one facet the engine retrieves documents which contain mentions of both concepts but there is no restriction that they occur next to each other.
Is the tagging service expandable (say with custom ontologies)? also is it a something you offer as a service? it is unclear to me from the website.
The TAG service is used for demonstration purposes only. The models behind it are trained for annotating news articles. The pipeline is customizable for every concrete scenario, different domains and entities of interest. You can access several of our pipelines as a service through the S4 platform or you can have them hosted as an on premise solution. In some cases our clients want domain adaptation or improvements in particular area, or to tag with their internal dataset - in this case we offer again an on premise deployment and also a managed service hosted on our hardware.
Hdoes your system accomodate cluster analysis using unsupervised keyword/phrase annotation for knowledge discovery?
As much as the patterns of user behaviour are also considered knowledge discovery we employ these for suggesting related reads. Apart from these we have experience tailoring custom clustering pipelines which also rely on features like keyword and named entities.
For topic extraction how many topics can we extract? from twitter corpus wgat csn we infer?
For topic extraction we have determined that we obtain best results when suggesting 3 categories. These are taken from IPTC but only the uppermost levels which are less than 20.
The twitter corpus example is from a project Ontotext participates in called Pheme. The goal of the project is to detect rumours and to check their veracity, thus help journalists in their hunt for attractive news.
Do you provide Processing Resources and JAPE rules for GATE framework and that can be used with GATE embedded?
We are contributing to the GATE framework and everything which has been wrapped up as PRs has been included the corresponding GATE distributions.
An ontology is a computational artifact used to describe a conceptualization of some part of the world via precise, descriptive statements. In this presentation, we discuss the features of the W3C's Ontology Web Language (OWL) and how it can be used to reduce ambiguity in the semantics (i.e., the meaning) of Data Dictionary terminology.
Using modern machine learning methods, Christian shows how a million comments can be structured and information can be extracted.
He will use Python and Jupyter notebooks and visualizations as results.
Trainer: Christian Winkler holds a PhD in Theoretical Physics. He has worked in software and AI for 20 years, specializing in intelligent algorithms for unstructured data and text. He is a frequent speaker at conference and author of many articles and tutorials.
Multi-Model Data Query Languages and Processing ParadigmsJiaheng Lu
Specifying users' interests with a formal query language is a typically challenging task, which becomes even harder in the context of multi-model data management because we have to deal with data variety. It usually lacks a unified schema to help the users issuing their queries, or has an incomplete schema as data come from disparate sources. Multi-Model DataBases (MMDBs) have emerged as a promising approach for dealing with this task as they are capable of accommodating and querying the multi-model data in a single system. This tutorial aims to offer a comprehensive presentation of a wide range of query languages for MMDBs and to make comparisons of their properties from multiple perspectives. We will discuss the essence of cross-model query processing and provide insights on the research challenges and directions for future work. The tutorial will also offer the participants hands-on experience in applying MMDBs to issue multi-model data queries.
Lecture at the advanced course on Data Science of the SIKS research school, May 20, 2016, Vught, The Netherlands.
Contents
-Why do we create Linked Open Data? Example questions from the Humanities and Social Sciences
-Introduction into Linked Open Data
-Lessons learned about the creation of Linked Open Data (link discovery, knowledge representation, evaluation).
-Accessing Linked Open Data
The first part of a day-long presentation made on November 3, 2009, covering various aspects of library cataloging, MARC records, FRBR, RDA, authority control, etc.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Multi-Model Data Query Languages and Processing ParadigmsJiaheng Lu
Specifying users' interests with a formal query language is a typically challenging task, which becomes even harder in the context of multi-model data management because we have to deal with data variety. It usually lacks a unified schema to help the users issuing their queries, or has an incomplete schema as data come from disparate sources. Multi-Model DataBases (MMDBs) have emerged as a promising approach for dealing with this task as they are capable of accommodating and querying the multi-model data in a single system. This tutorial aims to offer a comprehensive presentation of a wide range of query languages for MMDBs and to make comparisons of their properties from multiple perspectives. We will discuss the essence of cross-model query processing and provide insights on the research challenges and directions for future work. The tutorial will also offer the participants hands-on experience in applying MMDBs to issue multi-model data queries.
Lecture at the advanced course on Data Science of the SIKS research school, May 20, 2016, Vught, The Netherlands.
Contents
-Why do we create Linked Open Data? Example questions from the Humanities and Social Sciences
-Introduction into Linked Open Data
-Lessons learned about the creation of Linked Open Data (link discovery, knowledge representation, evaluation).
-Accessing Linked Open Data
The first part of a day-long presentation made on November 3, 2009, covering various aspects of library cataloging, MARC records, FRBR, RDA, authority control, etc.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Effective XML Keyword Search with Relevance Oriented Ranking
1. Effective XML KeywordEffective XML Keyword
Search with RelevanceSearch with Relevance
Oriented RankingOriented Ranking
Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu
1
2. Introduction
• XML Keyword search
– Inspired by IR style keyword search on the
web
– Enables user to access information in XML
database
– XML data modeled as a rooted, labeled tree
– Recent research efforts
• Efficiency
• Effectiveness
2
3. Effectiveness
• Capture user’s search intention
– Identify the target that user intends to search for
– Infer the predicate constraint that user intends to
search via
• Result ranking
–Rank the query results according to their
objective relevance to user search intention
3
4. State of the Art
• Search semantics design
– LCA (Lowest Common Ancestor)
• Node v is a LCA of keyword set K={w1, w2,…,wk} if the sub-tree
rooted at v contains at least one occurrence of all keywords in K,
after excluding the sub-elements that already contain all
keywords in K
– SLCA (Smallest LCA)
• Node v is a SLCA of keyword set K={w1, w2,…,wk} if
– (1) v is a LCA of K
– (2) no proper descendant of v is LCA of K
– XSeek
• Infers the search intention based on the concept of objects and
an analysis of the matching between keyword and data node
4
5. State of the Art (cont)
• Efficient result retrieval
– Designed based on a certain search semantics
– XKSearch, Multiway SLCA etc.
• Result ranking
– XRANK, XKSEarch, EASE
– They only consider
• Structural compactness of matching results
• Keyword proximity
• Similarity at node level
5
6. Problems Unaddressed
• Not address the user search intention
adequately!
– Meaningfulness of query result
• SLCA is less meaningful in many cases
– Keyword Ambiguity Problems
1. A keyword can appear both as an xml node type and as
the text value of some other nodes
2. A keyword can appear in the text values of different xml
node types and carry different meaningsNeither SLCA nor Xseek can well address keyword ambiguity
6
7. Meaningfulness
• Keyword query “rock music”
– Search intention: find customers interested in “rock music”
– C3
– SLCA returns: interest node of C3
customers
storeDB
books
... ...book
title publisherID
authors
author
“B2”
...
“Edward Martin”
“Sophia Jones”
author
customer
ID
name
interest
interests
...
“art”“Rock Davis”
“C4”
...
“Daniel Jones”
“John Williams”
book
title
...ID
authors
author
“B1”
author
“Art of Customer
Interest Care”
customer
ID
name
address
interest
street
city
interests
contact
no.
“1”
“Art Street”
...
...
“fashion”
“Mary Smith”
“C1”
customer
ID
name
interest
interests
“rock music”
“Art Smith”
“C3”
purchase
purchases
customer
ID name
interest
interests
“street art”“John Martin”
“C2”
...
......
... name
“Oxford”
Problems
7
8. Keyword Ambiguity
• Q = “customer, interest, art”
– Ambiguity 1: customer, interest; Ambiguity 2: art
– Intention: find customer whose interest is art
– less relevant or irrelevant result to be returned also --- C1,C3, B1’s title
customers
storeDB
books
... ...book
title publisherID
authors
author
“B2”
...
“Edward Martin”
“Sophia Jones”
author
customer
ID
name
interest
interests
...
“art”“Rock Davis”
“C4”
...
“Daniel Jones”
“John Williams”
book
title
...ID
authors
author
“B1”
author
“Art of Customer
Interest Care”
customer
ID
name
address
interest
street
city
interests
contact
no.
“1”
“Art Street”
...
...
“fashion”
“Mary Smith”
“C1”
customer
ID
name
interest
interests
“rock music”
“Art Smith”
“C3”
purchase
purchases
customer
ID name
interest
interests
“street art”“John Martin”
“C2”
...
......
... name
“Oxford”
8
Problems
9. Keyword Ambiguity (cont)
• Q = “customer, art”
– “art” can be the value of interest node(C2, C4), name node(C3), or
street node of customer(C1), or title node of book(B1)
– “customer” can be tag name of customer node, or (part of) value of
title of(B1)
- How to rank C1 to C4 and B1?
customers
storeDB
books
... ...book
title publisherID
authors
author
“B2”
...
“Edward Martin”
“Sophia Jones”
author
customer
ID
name
interest
interests
...
“art”“Rock Davis”
“C4”
...
“Daniel Jones”
“John Williams”
book
title
...ID
authors
author
“B1”
author
“Art of Customer
Interest Care”
customer
ID
name
address
interest
street
city
interests
contact
no.
“1”
“Art Street”
...
...
“fashion”
“Mary Smith”
“C1”
customer
ID
name
interest
interests
“rock music”
“Art Smith”
“C3”
purchase
purchases
customer
ID name
interest
interests
“street art”“John Martin”
“C2”
...
......
... name
“Oxford”
9
Problems
10. Objectives & Challenges
• Challenges
I. How to decide which sub-tree(s) with appropriate node types can
capture user desired information
II. How to return sub-trees of an appropriate size (i.e. contain enough
but non-overwhelming information)
III. How to rank those sub-trees by their relevance
• Address the below as a single problem
– Search intention identification
– Query result retrieval
– Result ranking
– Extend original TF*IDF from text database to XML database,
while capture the hierarchical structure of XML data
10
11. Challenges
Difficulty in applying TF*IDF to XML
XML DB carries semantic information while text DB
contains pure text information. XML TF*IDF must be
aware of the underlying semantics.
All contents of XML data are stored in leaf nodes only
What is analogy of “flat document” in XML?
o Sub-tree classified according to its prefix path
Normalization factor is not simply the size of sub-tree
o Structure of sub-trees may also infest the ranks
11
12. TF*IDF Recap
• Rule 1: A keyword appearing in many documents should
not be regarded as more important than a keyword
appearing in a few. --- IDF
• Rule 2: A document with more occurrences of a query
keyword should not be regarded as less important for
that keyword than a document that has less. --- TF
• Rule 3: A normalization factor is needed to balance
between long and short documents
– as Rule 2 discriminates against short documents which may
have less chance to contain more occurrences of keywords.
12
13. Our Approach
– Extend IR-style keyword search techniques (like TF*IDF)
from text database to XML database, in order to capture the
hierarchical structure of xml document
• by analyzing the knowledge of statistics of underlying XML data
– Major Contributions
1. Identify user’s desired search-for node and search-via node(s) in
a heuristic way
Define XML TF (term frequency) and XML DF (document frequency)
Confidence Formulas for search for/via candidates
2. Define XML TF*IDF Similarity
Propose 3 guidelines specifically for xml keyword search
Take keyword ambiguity problems into account
3. Design a Keyword Search Engine XReal 13
14. Data Model
• Node type - Two nodes are of same node type if they share the same
prefix path
/storeDB/customers/customer/name vs.
/storeDB/books/book/publisher/name
customers
storeDB
books
... ...book
title publisherID
authors
author
“B2”
...
“Edward Martin”
“Sophia Jones”
author
customer
ID
name
interest
interests
...
“art”“Rock Davis”
“C4”
...
“Daniel Jones”
“John Williams”
book
title
...ID
authors
author
“B1”
author
“Art of Customer
Interest Care”
customer
ID
name
address
interest
street
city
interests
contact
no.
“1”
“Art Street”
...
...
“fashion”
“Mary Smith”
“C1”
customer
ID
name
interest
interests
“rock music”
“Art Smith”
“C3”
purchase
purchases
customer
ID name
interest
interests
“street art”“John Martin”
“C2”
...
......
... name
“Oxford”
• Value node – text values contained in leaf node
• Structural node
Single-valued node type, multi-valued node type
Grouping type – all its children are of same multi-valued type
14
15. XML TF and IDF
• XML DF (document frequency)
– The number of T-typed nodes that contain keyword
k in their sub-trees in XML database.
• Granularity of similarity measurement is sub-trees of
certain node type T
• XML TF (term frequency)
– The number of occurrences of a keyword k in a
given value node a in XML database.
T
kf
,a kf
15
16. Infer the desired search-for node
• Guidelines: A node type T is considered as a desired
search for node if
1. T is intuitively related to every query keyword
2. XML nodes of type T should be informative enough to contain
enough relevant information
3. XML nodes of type T should be not overwhelming to contain too
much irrelevant information
• Confidence of T as the search for node w.r.t. query q.
• product instead of sum is used to follow 1st
guideline
• log part designed to follow 3rd
guideline
• exponential part designed to follow 2nd
guideline
• r is a decay factor in (0,1].
( )
( , ) log (1 )*T depth T
for e k
k q
C T q f r
∈
= +∏
16
17. Infer the Search-Via Nodes
• Infer structural node to search via
– Structural node n is a good candidate if it is related to as many
(but not necessarily all) keywords as possible
• Search via node type normally is not unique
• Infer individual value node to search via
– Statistics alone is not adequate to infer the likelihood of a value
node as (part of) search via node
– Capture keyword co-occurrence
( , ) log (1 )T
via e k
k q
C T q f
∈
= + ∑
17
18. customers
storeDB
books
... ...book
title publisherID
authors
author
“B2”
...
“Edward Martin”
“Sophia Jones”
author
customer
ID
name
interest
interests
...
“art”“Rock Davis”
“C4”
...
“Daniel Jones”
“John Williams”
book
title
...ID
authors
author
“B1”
author
“Art of Customer
Interest Care”
customer
ID
name
address
interest
street
city
interests
contact
no.
“1”
“Art Street”
...
...
“fashion”
“Mary Smith”
“C1”
customer
ID
name
interest
interests
“rock music”
“Art Smith”
“C3”
purchase
purchases
customer
ID name
interest
interests
“street art”“John Martin”
“C2”
...
......
... name
“Oxford”
• E.g. Q = “ customer, name, rock, interest, art ”
Easy to find name and interest have high confidence to be the
search via nodes
But hard to know rock is value of name or interest,
art is value of interest or name
How to differ customer C4 from
C3?
Capture keyword co-occurrence
18
19. Capture keyword co-occurrence
• Proximity factors for a value node v of type kt
containing keyword k
– Given a query q and a certain value node v, if there are two
keywords kt and k in q, s.t. kt matches the type of an
ancestor node of v and k matches a keyword in v
– In-Query distance
• Distance between keyword k and node type kt in query q
• Favors: kt appears before k
– Structural distance
• Depth distance between v and the nearest kt typed
ancestor node of v
– Value-Type distance
• Max of the above two
19
( )
1
( , , ) 1
( , , , )t
via
tk q ancType v
C q v k
Dist q v k k∈ ∩
= + ∑
20. Principles of XML keyword search
• Principle 1
– When searching for D-typed nodes via a single-valued type V,
ideally only the values and structures nested in V-typed nodes
can affect the relevance, regardless of the size of other typed
nodes nested in D-typed nodes.
• However, TF*IDF similarity in IR normalizes the relevance score of
each document w.r.t. its size
• Principle 2 – address keyword Ambiguity 2
– When searching for nodes of type D via a multi-valued type V’,
the relevance of a D-typed node which contains a query
relevant V’-typed node should not be affected (i.e. normalized)
too much by other query-irrelevant V’-typed nodes.
• Example: query “art” - C4 should not be less relevant than C1
20
21. Principles of XML keyword search
• Principle 1 and 2
– Especially useful for interpreting pure keyword query -
find search via node correctly
• Principle 3
– The order of keywords in a query is important to
indicate the search intention
• Incorporate the search via confidence Cvia we defined
before
21
22. XML TF*IDF Similarity
• To calculate the similarity between the search for
node and the query q
– Base case: similarity between value node a and q
• Apply original TF*IDF directly since a contains keywords
only without any structure
– Recursive case: similarity between structural node n
and q
• Based on similarities of its children c and the confidence
level of c as the node type to search via
( , )similarity q a =
,, *
*
Ta
a kq k
k q a
Ta
q a
W W
W W
∈ ∩
∑
IDF TFNormalization
factor
, ( , , )*ln(1 / (1 ))a a
a
T T
q k via T kW C q a k N f= + +
, ,1 ln( )a k a kW f= + 2
,( )a aT T
q q k
k q
W W
∈
= ∑ 2
,a a k
k a
W W
∈
= ∑
22
23. XML TF*IDF Similarity (cont.)
• Recursive Case
– Intuition 2. An internal node n is relevant to q, if n has a
child c such that the type of c has high confidence to be
a search via node w.r.t. q (i.e. large Cvia(Tc , q)), and c is
highly relevant to q (i.e. large sim(q, c)).
– Intuition 3. An internal node n is more relevant to q if n
has more query-relevant children when all others being
equal.
( )
( , )* ( , )
( , )
via c
c chd n
q
n
sim q c C T q
similarity q n
W
∈
=
∑
Weighted sum of all n’s
children’s similarity and their
confidence to be the search
via node
Overall weight of node n w.r.t
query q which essentially
plays the role of a
normalization factor 23
24. Flowchart of answering a query
1. Identify user search intention
– Compute the confidence of all possible candidate
node types and choose desired search for node Tfor
2. Relevance-oriented ranking
– Compute XML TF*IDF similarity in a bottom-up
approach from value nodes containing keywords up to
nodes of type Tfor
– Return a ranked list of sub-trees rooted at nodes of
type Tfor
• If more than one search for node type have comparable
confidence, a ranked list for each search for node is returned
24
25. Experimental Result
• Data set
– DBLP, XMark, WSU, eBay
• Comparison
– Compare XReal with SLCA, Xseek
• Equipment
– Implement in Java
– Run on 3.6GHz pentium IV, 1 GB memory PC with
Windows XP
– Berkeley DB java edition for storing keyword inverted
lists and keyword frequency table
25
26. Search Effectiveness
• Accuracy in inferring the search for node
– Conducted by user survey
– Tested queries contain at least one of the two
ambiguity problems
– Conclusion
• XReal works well, especially when the search for
node is not given explicitly in the query
26
27. Search Effectiveness
• Result effectiveness
– Measured by precision, recall, F-measure
– Observations
• XReal achieves higher precision than SLCA and
Xseek for queries that contain ambiguities
• XReal Performs as well as XSeek when queries
have no ambiguity in XML data
• XReal: Top-100 precision higher than overall
precision
• F-measure also shows good overall effectiveness
of both XReal and XSeek 27
28. Ranking Effectiveness
• Metrics
– Number of Top-1 answers that are relevant
– Reciprocal Rank (R-Rank)
– Mean Average Precision (MAP)
28
29. Efficiency & Scalability
• Compare three adoptions of indices for
XReal, and SLCA
– Dup
• Store only the dewey id and XML TF
– DupType
• Stores an extra node type (i.e. its prefix path)
– DupTypeNorm
• Stores an extra normalization factor Wa for value
node
,a kf
29
32. 32
customers
storeDB
books
... ...book
title publisherID
authors
author
...
“Edward Martin”
“Sophia Jones”
author
customer
ID
name
interest
interests
...
“art”“Rock Davis”
...
“Daniel Jones”
“John Williams”
book
title
...
ID
authors
authorauthor
“Art of Customer
Interest Care”
customer
ID
name
address
interest
street
city
interests
contact
no.
“1”
“Art Street”
...
...
“fashion”
“Mary Smith”
“C1”
customer
ID
name
interest
interests
“rock music”
“Art Smith”
purchase
purchases
customer
ID name
interest
interests
“street art”“John Martin”
...
......
... name
“Oxford”
“C2”
“C3”
“C4”
“B1”
“B2”
Editor's Notes
The underlying reason is: all these SLCA-based approaches do not address the user search intention adequately.
Here, in order not to confuse reader, you can understand it as the node type specified in DTD.
Note that, the purpose of introducing “value node” is just to simplify the explanation and formula design in later sections, as leaf nodes play dual roles in xml document: (1) contain values; (2) carries tag name which can be viewed as a structural node.
An xml node with a labeled name is called a “structural node”.
1. Note that the 2nd and 3rd guideline restrict each other, which is analogous to dilemma.
An internal node at an appropriate height is most preferred.
2. In the above formula, r is some reduction factor with range (0,1] and is chosen to be 0.8 and show a good performance in our experiments.
Mention: Actually most queries contain only value nodes without structural node. So after locating the appropriate structural node SN (even though when they do not occur in the keywords) as the search via node, it is also important to find the corresponding value node (if they are specified in the query keywords) associated with each SN (or find which one is more matchable to a given structural node). However, statistics alone cannot handle this matching job. That is to say, search engine cannot differentiate C4 and C3 in this case. In order to let it be able to differ these two, we take into account keywords co-occurrence into account when designing a well-formed confidence formula for value node; and this confidence formula will be incorporated into the TF*IDF similarity formula for base case.
the pattern of keyword co-occurrence in a query provides a micro way to measure the likelihood of an individual value node to search via, as a compliment of statistics.
Given a keyword query q and a certain value node v, if there are two keywords kt and k in q, such that kt matches the type of an ancestor node of v and k matches a keyword in v, then we define the following distances.
These principles classify the difference on designing ranking functions between text database and xml database.
Principle 1, in other words, denotes that the size of the subtree rooted at a D-typed node d (except the subtree rooted at the search via node) should not affect d’s relevance to the query.
Principle 1 and 2 Look trivial when the search via node is explicit.
We will incorporate all three principles into the design of XML TF*IDF formulas.
Explain the reason we incorporate Cvia(q,a,k).
Normalization factors play the role of balancing between XML leaf nodes containing many keywords and those with a few keywords.
1. The weighted sum in the numerator part follows closely to Intuition 2 and 3.
2. Besides, since Intuition 3 usually favors internal nodes with more children, we need to normalize the
relevance of a to q. That naturally leads to the use of Wq,a (computed via Formula 14) as the denominator.