SlideShare a Scribd company logo
1

Change Detection in XML Documents
using Semantic Identifiers
BY
KAILAASH BALACHANDRAN
Outline


Motivation



Introduction



The Approach
•
•

2-step Algorithm

•


Identifiers
Axioms

Semantic Change Detection
•

Finding Identifiers

•

Matching Nodes



Examples



Conclusion

2
Motivation(1)

3

Fig.1. Version 1

Fig.2. Version 2

<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>

<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<salesprice>$35</salesprice>
<isbn>0385504209</isbn>
</book>
<book>
<title>Angels & Demons</title>
<publisher>Pocket Star</publisher>
<price>$56</price>
</book>
</author>
Motivation(1)

4

Fig.1. Version 1

Fig.2. Version 2

<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>

<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<salesprice>$35</salesprice>
<isbn>0385504209</isbn>
</book>
<book>
<title>Angels & Demons</title>
<publisher>Pocket Star</publisher>
<price>$56</price>
</book>
</author>
Motivation(2)
Fig.1. Version 1
<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>

Fig.3. Version 3

5

<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author>
<price> $35</price>
</book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author>
<price> $56</price>
</book> </publisher>
Motivation(2)

6

Fig.1. Version 1

Fig.3. Version 3

<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>

<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author>
<price> $35</price>
</book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author>
<price> $56</price>
</book> </publisher>
Motivation(3)
Disadvantages of Structural detection approach:

 Difficult to associate elements in different versions.
 Break down when the changes are significant.

 Affects Incremental Evaluation.
 High cost of change of data.

7
Introduction
What is Semantic Based Change Detection?
A process of Identifying changes between successive versions of a document
based on its semantics, rather than on the structure of the document.
The Approach:
1. Find Semantic Identifier for each node in the XML model.
2. Compute these Identifiers to associate nodes across multiple versions.

8
Identifiers

9

 Type is list of labels from root to element separated by a ‘/’.

 Identifier serves to distinguish elements of same type.
 Two nodes x and y, are semantically the same if and only if their identifiers evaluate to
the same result.
Eval(x,L) = Eval(y,L)

Node
x

Same Result
Node
y

where,
• x,y are the nodes,
• List of Expressions L = { E1,E2…En}
Identifiers

10

Local Identifier: An identifier is local if it evaluates to descendants of the context
node, otherwise it is non-local.
Version 1:

Version 3:

<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>

<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author><price> $35</price>
</book>
</publisher>
<publisher>Pocket Star <book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author><price> $56</price>
</book> </publisher>
Identifiers

11

Local Identifier: An identifier is local if it evaluates to descendants of the context
node, otherwise it is non-local.
Version 1:

<name> is
local

<author>
<name>Dan Brown</name>
<book>
<title>The Da Vinci Code</title>
<publisher>Doubleday</publisher>
<price> $35 </price>
</book>
<book>
<title>Angels and Demons</title>
<publisher>Pocket Star</publisher>
<price> $56</price>
</book>
</author>

Version 3:

<name> is
non-local

<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author><price> $35</price>
</book>
</publisher>
<publisher>Pocket Star <book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author><price> $56</price>
</book> </publisher>
Identify nodes based on its
Semantics

12

The Algorithm
Phase 1:
 Bottom up fashion.
 Identifies all local identifiers.
 Semantically different nodes are identified.
Phase 2:
 Runs recursively and identifies non-local identifiers.
 All semantically distinct nodes are found.
Any remaining node is a redundant copy of another node in the document.
Identify nodes based on its
Semantics(Phase 1)
Axiom 1: Nodes that are structurally different are semantically different.
<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author>
</book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author>
</book> </publisher>

Semantically different.

13
Identify nodes based on its
Semantics(Phase 1)
Axiom 1: Nodes that are structurally different are semantically different.
<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author>
</book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author>
</book> </publisher>

Are they semantically the same?

14
Identify nodes based on its
Semantics(Phase 2)
<publisher>Doubleday
<book>
<title>The Da Vinci Code</title>
<author>
<name>Dan Brown</name>
</author>
</book>
</publisher>
<publisher>Pocket Star
<book>
<title>Angels and Demons</title>
<author>
<name>Dan Brown</name>
</author>
</book> </publisher>

15

Axiom 2: Nodes that are structurally
identical are semantically identical
if and only if their respective parents
are semantically identical or if they
are both root nodes.

No, because they’re in context of two
different books
Semantic Change Detection

16

How to handle structural changes ?
A

X
Y

Z

Version 1

Y

X
Version 2

Assumption: Identifying information will remain nearby.

Z
Semantic Change Detection
 Type Territory : The territory of a type T is the set of all text nodes that are
descendants of the least common ancestor (lca) of all of the type T nodes.
 Within the type territory is the territory controlled by individual nodes of that
type.
 Node Territory : The territory of a type T node p is the type territory of T
excluding all text nodes that are descendants of other type T nodes.

17
Node and Type Territory

18

document root
type territory of p

lca (p)

node territory of p1

node territory of p2

p2
p1

p3

Node territory
Finding Identifiers

19

Version 1:

Version 2:

<bib>
<author><name>n1</name>
<book>
<title>t1</title>
<publisher>p1</publisher>
</book>
</author>
<author><name>n2</name>
<book>
<title>t2</title>
<publisher>p2</publisher>
</book>
<book>
<title>t1</title>
<publisher>p1</publisher>
</book></author>
</bib>

<bib>
<pub> p1
<book>
<title>t1</title>
<author>
<name>n1</name>
</author>
<book>
<pub> p2
<book>
<title>t2</title>
<author>
<name>n2</name>
</author>
<book>
Identifiers
<bib>
<author><name>n1</name>
<book>
<title>t1</title>
<publisher>p1</publisher>
</book>
</author>
<author><name>n2</name>
<book>
<title>t2</title>
<publisher>p2</publisher>
</book>
<book>
<title>t1</title>
<publisher>p1</publisher>
</book></author>
</bib>

20
Node

IDENTIFIER

book

(../author/name/text(),
title/text())
Identifiers

21

Values of Identifiers for <book> in Version 1
<bib>
<author><name>n1</name>
<book>
<title>t1</title>
<publisher>p1</publisher>
</book>
</author>
<author><name>n2</name>
<book>
<title>t2</title>
<publisher>p2</publisher>
</book>
<book>
<title>t1</title>
<publisher>p1</publisher>
</book></author>
</bib>

Value of Identifier = n1, t1

Value of Identifier = n2, t2

Value of Identifier = n2, t1
Identifiers
Values of Identifiers for <book> in Version 2
<bib>
<pub> p1
<book>
<title>t1</title>
<author>
<name>n1</name>
</author>
</book>
</pub>
<pub> p2
<book>
<title>t2</title>
<author>
<name>n2</name>
</author>
</book></pub>
</bib>

22
Identifiers
Values of Identifiers for <book> in Version 2
<bib>
<pub> p1
<book>
<title>t1</title>
<author>
<name>n1</name>
</author>
</book>
</pub>
<pub> p2
<book>
<title>t2</title>
<author>
<name>n2</name>
</author>
</book></pub>
</bib>

Value of Identifier = p1, t1

Value of Identifier = p2, t2

23
Identifiers

24

Values of Identifiers for <book> in both versions:

Version 1

Version 2

Node

IDENTIFIER

Node

IDENTIFIER

book (top)

n1 , t1

book 1 (top)

p1 , t1

book 2
(bottom)

p2 , t2

How to map both ?

book
(middle)

n2 , t2

book
(bottom)

n2 , t1
Matching

25

 Admits: q admits p if and only if q is in the node territory of p.
 Nodes p and q are matched if and only if p and q admit each other.
 Consider nodes p and q that reside in different versions Vp and Vq.

q1,
q2….qn

q1,
q2….qn

Node q in Vq

Node p in Vp
Semantic Change Detection

26
bib

Book matches:
pub
Version 1

p1

bib
author
name

n1

book

name

title pub n2
t1

p1

t1

book

book

title pub title
t2

p2

t1

pub
p1

p2

book

title
author

pub

author author
name name

n1

book
title

author

t2

name

n2
Version 2

n2
Semantic Change Detection
bib

Book matches:
pub

admits
Version 1

p1

bib
author
name

n1

book

27

name

t1

book

title

pub n2

title

pub title

t1

p1

t2

p2

t1

author author
name name

n1

book

pub
p1

p2

book

title
author

pub

book
title
t2

n2
Version 2

author
name

n2
Semantic Change Detection
bib

Book matches:
pub

Node match
Version 1

p1

bib
author
name
n1

book

name

t1

p1

t1

book

book

title pub title
t2

p2

t1

pub
p1

pub

p2

book

title
author

title pub n2

28

author author
name name

n1

book
title

t2

n2
Version 2

author
name
n2
Semantic Change Detection
bib

Book matches:
pub

Node match
Version 1

p1

bib
author
name
n1

book

name

t1

p1

t1

book

book

title pub title
t2

p2

t1

pub
p1

pub

p2

book

title
author

title pub n2

29

author author
name name

n1

book
title

t2

n2
Version 2

author
name
n2
Semantic Change Detection

30
bib

Author matches:
pub
Version 1

p1

bib
author
name

n1

book

name

t1

book

title

pub n2

title

pub title

t1

p1

t2

p2

t1

author author
name name

n1

book

pub
p1

p2

book

title
author

pub

book
title
t2

n2
Version 2

author
name

n2
Conclusion


Semantic change detection technique.
•

Find identifiers for each node in the XML document

•

Associate nodes across versions.



Information that identifies an element is conserved across changes.



Time complexity is O(n*log(n))



We can match nodes even when structural changes are significant.

31

More Related Content

Viewers also liked

Testing Taxonomies: Beyond Card Sorting
Testing Taxonomies: Beyond Card SortingTesting Taxonomies: Beyond Card Sorting
Testing Taxonomies: Beyond Card Sorting
Alberta Soranzo
 
Eltra Opulent Associates Ltd Powerpoint Presentation Web Company Profile
Eltra  Opulent Associates Ltd   Powerpoint Presentation Web  Company ProfileEltra  Opulent Associates Ltd   Powerpoint Presentation Web  Company Profile
Eltra Opulent Associates Ltd Powerpoint Presentation Web Company Profile
Eltra Consultants
 
Introduction to web designing
Introduction to web designingIntroduction to web designing
Introduction to web designing
Rajat Shah
 
Information Architecture. Card Sorting
Information Architecture. Card SortingInformation Architecture. Card Sorting
Information Architecture. Card Sorting
DCU_MPIUA
 
Life at Siegel+Gale
Life at Siegel+Gale Life at Siegel+Gale
Life at Siegel+Gale
Siegel+Gale
 
THANATOS Digital Agency | Company Profile ENG
THANATOS Digital Agency | Company Profile ENGTHANATOS Digital Agency | Company Profile ENG
THANATOS Digital Agency | Company Profile ENG
THANATOS Digital Agency
 
Company Profile Design: Best Practices 2016
Company Profile Design: Best Practices 2016Company Profile Design: Best Practices 2016
Company Profile Design: Best Practices 2016
Company Profile Design
 
eXo Digital Agency - Company Profile
eXo Digital Agency - Company ProfileeXo Digital Agency - Company Profile
eXo Digital Agency - Company Profile
eXo Digital Agency
 
TEN Creative Design Agency Creds
TEN Creative Design Agency CredsTEN Creative Design Agency Creds
TEN Creative Design Agency Creds
TEN Creative
 
LEAP Agency Company Profile
LEAP Agency Company ProfileLEAP Agency Company Profile
LEAP Agency Company Profile
Precision Group
 
Mix Digital Marketing Agency Credentials
Mix Digital Marketing Agency CredentialsMix Digital Marketing Agency Credentials
Mix Digital Marketing Agency Credentials
Mix Digital Marketing Agency
 
Ppt of company profile in project
Ppt of company profile in projectPpt of company profile in project
Ppt of company profile in project
shivakumaranupama
 
Tcs company profile presentation -sample
Tcs company profile presentation  -sampleTcs company profile presentation  -sample
Tcs company profile presentation -sample
Sivaraj Ganapathy
 
Company Profile Sample
Company Profile SampleCompany Profile Sample
Company Profile Sample
Yagika Madan
 

Viewers also liked (14)

Testing Taxonomies: Beyond Card Sorting
Testing Taxonomies: Beyond Card SortingTesting Taxonomies: Beyond Card Sorting
Testing Taxonomies: Beyond Card Sorting
 
Eltra Opulent Associates Ltd Powerpoint Presentation Web Company Profile
Eltra  Opulent Associates Ltd   Powerpoint Presentation Web  Company ProfileEltra  Opulent Associates Ltd   Powerpoint Presentation Web  Company Profile
Eltra Opulent Associates Ltd Powerpoint Presentation Web Company Profile
 
Introduction to web designing
Introduction to web designingIntroduction to web designing
Introduction to web designing
 
Information Architecture. Card Sorting
Information Architecture. Card SortingInformation Architecture. Card Sorting
Information Architecture. Card Sorting
 
Life at Siegel+Gale
Life at Siegel+Gale Life at Siegel+Gale
Life at Siegel+Gale
 
THANATOS Digital Agency | Company Profile ENG
THANATOS Digital Agency | Company Profile ENGTHANATOS Digital Agency | Company Profile ENG
THANATOS Digital Agency | Company Profile ENG
 
Company Profile Design: Best Practices 2016
Company Profile Design: Best Practices 2016Company Profile Design: Best Practices 2016
Company Profile Design: Best Practices 2016
 
eXo Digital Agency - Company Profile
eXo Digital Agency - Company ProfileeXo Digital Agency - Company Profile
eXo Digital Agency - Company Profile
 
TEN Creative Design Agency Creds
TEN Creative Design Agency CredsTEN Creative Design Agency Creds
TEN Creative Design Agency Creds
 
LEAP Agency Company Profile
LEAP Agency Company ProfileLEAP Agency Company Profile
LEAP Agency Company Profile
 
Mix Digital Marketing Agency Credentials
Mix Digital Marketing Agency CredentialsMix Digital Marketing Agency Credentials
Mix Digital Marketing Agency Credentials
 
Ppt of company profile in project
Ppt of company profile in projectPpt of company profile in project
Ppt of company profile in project
 
Tcs company profile presentation -sample
Tcs company profile presentation  -sampleTcs company profile presentation  -sample
Tcs company profile presentation -sample
 
Company Profile Sample
Company Profile SampleCompany Profile Sample
Company Profile Sample
 

Recently uploaded

Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 

Recently uploaded (20)

Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 

Schemaless Change detection in XML Documents using Semantic Identifiers

  • 1. 1 Change Detection in XML Documents using Semantic Identifiers BY KAILAASH BALACHANDRAN
  • 2. Outline  Motivation  Introduction  The Approach • • 2-step Algorithm •  Identifiers Axioms Semantic Change Detection • Finding Identifiers • Matching Nodes  Examples  Conclusion 2
  • 3. Motivation(1) 3 Fig.1. Version 1 Fig.2. Version 2 <author> <name>Dan Brown</name> <book> <title>The Da Vinci Code</title> <publisher>Doubleday</publisher> <price> $35 </price> </book> <book> <title>Angels and Demons</title> <publisher>Pocket Star</publisher> <price> $56</price> </book> </author> <author> <name>Dan Brown</name> <book> <title>The Da Vinci Code</title> <publisher>Doubleday</publisher> <salesprice>$35</salesprice> <isbn>0385504209</isbn> </book> <book> <title>Angels & Demons</title> <publisher>Pocket Star</publisher> <price>$56</price> </book> </author>
  • 4. Motivation(1) 4 Fig.1. Version 1 Fig.2. Version 2 <author> <name>Dan Brown</name> <book> <title>The Da Vinci Code</title> <publisher>Doubleday</publisher> <price> $35 </price> </book> <book> <title>Angels and Demons</title> <publisher>Pocket Star</publisher> <price> $56</price> </book> </author> <author> <name>Dan Brown</name> <book> <title>The Da Vinci Code</title> <publisher>Doubleday</publisher> <salesprice>$35</salesprice> <isbn>0385504209</isbn> </book> <book> <title>Angels & Demons</title> <publisher>Pocket Star</publisher> <price>$56</price> </book> </author>
  • 5. Motivation(2) Fig.1. Version 1 <author> <name>Dan Brown</name> <book> <title>The Da Vinci Code</title> <publisher>Doubleday</publisher> <price> $35 </price> </book> <book> <title>Angels and Demons</title> <publisher>Pocket Star</publisher> <price> $56</price> </book> </author> Fig.3. Version 3 5 <publisher>Doubleday <book> <title>The Da Vinci Code</title> <author> <name>Dan Brown</name> </author> <price> $35</price> </book> </publisher> <publisher>Pocket Star <book> <title>Angels and Demons</title> <author> <name>Dan Brown</name> </author> <price> $56</price> </book> </publisher>
  • 6. Motivation(2) 6 Fig.1. Version 1 Fig.3. Version 3 <author> <name>Dan Brown</name> <book> <title>The Da Vinci Code</title> <publisher>Doubleday</publisher> <price> $35 </price> </book> <book> <title>Angels and Demons</title> <publisher>Pocket Star</publisher> <price> $56</price> </book> </author> <publisher>Doubleday <book> <title>The Da Vinci Code</title> <author> <name>Dan Brown</name> </author> <price> $35</price> </book> </publisher> <publisher>Pocket Star <book> <title>Angels and Demons</title> <author> <name>Dan Brown</name> </author> <price> $56</price> </book> </publisher>
  • 7. Motivation(3) Disadvantages of Structural detection approach:  Difficult to associate elements in different versions.  Break down when the changes are significant.  Affects Incremental Evaluation.  High cost of change of data. 7
  • 8. Introduction What is Semantic Based Change Detection? A process of Identifying changes between successive versions of a document based on its semantics, rather than on the structure of the document. The Approach: 1. Find Semantic Identifier for each node in the XML model. 2. Compute these Identifiers to associate nodes across multiple versions. 8
  • 9. Identifiers 9  Type is list of labels from root to element separated by a ‘/’.  Identifier serves to distinguish elements of same type.  Two nodes x and y, are semantically the same if and only if their identifiers evaluate to the same result. Eval(x,L) = Eval(y,L) Node x Same Result Node y where, • x,y are the nodes, • List of Expressions L = { E1,E2…En}
  • 10. Identifiers 10 Local Identifier: An identifier is local if it evaluates to descendants of the context node, otherwise it is non-local. Version 1: Version 3: <author> <name>Dan Brown</name> <book> <title>The Da Vinci Code</title> <publisher>Doubleday</publisher> <price> $35 </price> </book> <book> <title>Angels and Demons</title> <publisher>Pocket Star</publisher> <price> $56</price> </book> </author> <publisher>Doubleday <book> <title>The Da Vinci Code</title> <author> <name>Dan Brown</name> </author><price> $35</price> </book> </publisher> <publisher>Pocket Star <book> <title>Angels and Demons</title> <author> <name>Dan Brown</name> </author><price> $56</price> </book> </publisher>
  • 11. Identifiers 11 Local Identifier: An identifier is local if it evaluates to descendants of the context node, otherwise it is non-local. Version 1: <name> is local <author> <name>Dan Brown</name> <book> <title>The Da Vinci Code</title> <publisher>Doubleday</publisher> <price> $35 </price> </book> <book> <title>Angels and Demons</title> <publisher>Pocket Star</publisher> <price> $56</price> </book> </author> Version 3: <name> is non-local <publisher>Doubleday <book> <title>The Da Vinci Code</title> <author> <name>Dan Brown</name> </author><price> $35</price> </book> </publisher> <publisher>Pocket Star <book> <title>Angels and Demons</title> <author> <name>Dan Brown</name> </author><price> $56</price> </book> </publisher>
  • 12. Identify nodes based on its Semantics 12 The Algorithm Phase 1:  Bottom up fashion.  Identifies all local identifiers.  Semantically different nodes are identified. Phase 2:  Runs recursively and identifies non-local identifiers.  All semantically distinct nodes are found. Any remaining node is a redundant copy of another node in the document.
  • 13. Identify nodes based on its Semantics(Phase 1) Axiom 1: Nodes that are structurally different are semantically different. <publisher>Doubleday <book> <title>The Da Vinci Code</title> <author> <name>Dan Brown</name> </author> </book> </publisher> <publisher>Pocket Star <book> <title>Angels and Demons</title> <author> <name>Dan Brown</name> </author> </book> </publisher> Semantically different. 13
  • 14. Identify nodes based on its Semantics(Phase 1) Axiom 1: Nodes that are structurally different are semantically different. <publisher>Doubleday <book> <title>The Da Vinci Code</title> <author> <name>Dan Brown</name> </author> </book> </publisher> <publisher>Pocket Star <book> <title>Angels and Demons</title> <author> <name>Dan Brown</name> </author> </book> </publisher> Are they semantically the same? 14
  • 15. Identify nodes based on its Semantics(Phase 2) <publisher>Doubleday <book> <title>The Da Vinci Code</title> <author> <name>Dan Brown</name> </author> </book> </publisher> <publisher>Pocket Star <book> <title>Angels and Demons</title> <author> <name>Dan Brown</name> </author> </book> </publisher> 15 Axiom 2: Nodes that are structurally identical are semantically identical if and only if their respective parents are semantically identical or if they are both root nodes. No, because they’re in context of two different books
  • 16. Semantic Change Detection 16 How to handle structural changes ? A X Y Z Version 1 Y X Version 2 Assumption: Identifying information will remain nearby. Z
  • 17. Semantic Change Detection  Type Territory : The territory of a type T is the set of all text nodes that are descendants of the least common ancestor (lca) of all of the type T nodes.  Within the type territory is the territory controlled by individual nodes of that type.  Node Territory : The territory of a type T node p is the type territory of T excluding all text nodes that are descendants of other type T nodes. 17
  • 18. Node and Type Territory 18 document root type territory of p lca (p) node territory of p1 node territory of p2 p2 p1 p3 Node territory
  • 19. Finding Identifiers 19 Version 1: Version 2: <bib> <author><name>n1</name> <book> <title>t1</title> <publisher>p1</publisher> </book> </author> <author><name>n2</name> <book> <title>t2</title> <publisher>p2</publisher> </book> <book> <title>t1</title> <publisher>p1</publisher> </book></author> </bib> <bib> <pub> p1 <book> <title>t1</title> <author> <name>n1</name> </author> <book> <pub> p2 <book> <title>t2</title> <author> <name>n2</name> </author> <book>
  • 21. Identifiers 21 Values of Identifiers for <book> in Version 1 <bib> <author><name>n1</name> <book> <title>t1</title> <publisher>p1</publisher> </book> </author> <author><name>n2</name> <book> <title>t2</title> <publisher>p2</publisher> </book> <book> <title>t1</title> <publisher>p1</publisher> </book></author> </bib> Value of Identifier = n1, t1 Value of Identifier = n2, t2 Value of Identifier = n2, t1
  • 22. Identifiers Values of Identifiers for <book> in Version 2 <bib> <pub> p1 <book> <title>t1</title> <author> <name>n1</name> </author> </book> </pub> <pub> p2 <book> <title>t2</title> <author> <name>n2</name> </author> </book></pub> </bib> 22
  • 23. Identifiers Values of Identifiers for <book> in Version 2 <bib> <pub> p1 <book> <title>t1</title> <author> <name>n1</name> </author> </book> </pub> <pub> p2 <book> <title>t2</title> <author> <name>n2</name> </author> </book></pub> </bib> Value of Identifier = p1, t1 Value of Identifier = p2, t2 23
  • 24. Identifiers 24 Values of Identifiers for <book> in both versions: Version 1 Version 2 Node IDENTIFIER Node IDENTIFIER book (top) n1 , t1 book 1 (top) p1 , t1 book 2 (bottom) p2 , t2 How to map both ? book (middle) n2 , t2 book (bottom) n2 , t1
  • 25. Matching 25  Admits: q admits p if and only if q is in the node territory of p.  Nodes p and q are matched if and only if p and q admit each other.  Consider nodes p and q that reside in different versions Vp and Vq. q1, q2….qn q1, q2….qn Node q in Vq Node p in Vp
  • 26. Semantic Change Detection 26 bib Book matches: pub Version 1 p1 bib author name n1 book name title pub n2 t1 p1 t1 book book title pub title t2 p2 t1 pub p1 p2 book title author pub author author name name n1 book title author t2 name n2 Version 2 n2
  • 27. Semantic Change Detection bib Book matches: pub admits Version 1 p1 bib author name n1 book 27 name t1 book title pub n2 title pub title t1 p1 t2 p2 t1 author author name name n1 book pub p1 p2 book title author pub book title t2 n2 Version 2 author name n2
  • 28. Semantic Change Detection bib Book matches: pub Node match Version 1 p1 bib author name n1 book name t1 p1 t1 book book title pub title t2 p2 t1 pub p1 pub p2 book title author title pub n2 28 author author name name n1 book title t2 n2 Version 2 author name n2
  • 29. Semantic Change Detection bib Book matches: pub Node match Version 1 p1 bib author name n1 book name t1 p1 t1 book book title pub title t2 p2 t1 pub p1 pub p2 book title author title pub n2 29 author author name name n1 book title t2 n2 Version 2 author name n2
  • 30. Semantic Change Detection 30 bib Author matches: pub Version 1 p1 bib author name n1 book name t1 book title pub n2 title pub title t1 p1 t2 p2 t1 author author name name n1 book pub p1 p2 book title author pub book title t2 n2 Version 2 author name n2
  • 31. Conclusion  Semantic change detection technique. • Find identifiers for each node in the XML document • Associate nodes across versions.  Information that identifies an element is conserved across changes.  Time complexity is O(n*log(n))  We can match nodes even when structural changes are significant. 31