XML Schema Computations: Schema Compatibility Testing and Subschema Extraction


Published on

In this paper, we propose new models and algorithms to perform practical computations on W3C XML Schemas, which are schema minimization, schema equivalence testing, subschema testing and subschema extraction. We have conducted experiments on an e-commerce standard XSD called xCBL to demonstrate the e?ectiveness of our algorithms. One experiment has refuted the claim that the xCBL 3.5 XSD is compatible with the xCBL 3.0 XSD. Another experiment has shown that the xCBL XSDs can be effectively trimmed into small subschemas for specific applications, which has significantly reduced schema processing time.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

XML Schema Computations: Schema Compatibility Testing and Subschema Extraction

  1. 1. XML Schema Computations: Schema Compatibility Testing and Subschema Extraction Thomas Y.T. LEE and David W.L. Cheung Department of Computer Science The University of Hong Kong October 28, 2010 CIKM 2010 Toronto, Canada1
  2. 2. Outline Introduction and motivation Formal models for XML data and schemas Schema computational algorithms Experiments and conclusions2
  3. 3. Outline Introduction and motivation Formal models for XML data and schemas Schema computational algorithms Experiments and conclusions3
  4. 4. Data interoperability on web services In order for two web services to be interoperable , the XML schema on the message receiving end must accept all possible XML messages from the sending end. The sending schema must be a subschema of the receiving schema. _ ∩ XML XML Instances Instances Schema A Schema B Web Web Service Service A B4
  5. 5. W3C XML Schema and data standards 1. W3C XML Schema (XSD) is the most popular schema language to define data standards. 2. In order for the new version of an XSD to be backward-compatible with the old version, the new version must be a superschema of the old version. The new schema must accept every instance of the old schema. 3. However, a typical e-commerce standard XSD contains thousands of types / elements, which makes manual verification of compatibility hardly possible. 4. When an XSD is too large, how can we extract a smaller subschema just enough for processing by a specific application?5
  6. 6. Schema compatibility problems 1. Given two XSDs, how to verify two XSDs are equivalent or one is a subschema of the other? 2. Given XSD A , how to extract a smaller subschema of A called B so that B recognizes only a subset of elements recognized by A ? 3. In this research, we have developed the formal models for XML data and schemas, as well as the algorithms to solve these problems.6
  7. 7. Outline Introduction and motivation Formal models for XML data and schemas Schema computational algorithms Experiments and conclusions7
  8. 8. Data Tree (DT) to model XML data A DT is a tree where edges represent elements and nodes represent their contents. <Quote> n0:ε <Line> <Quote> <Desc>hPhone</Desc> <Price>499.9</Price> n1:ε </Line> <Line> <Line> <Line> <Desc>iMat</Desc> n2:ε n3:ε <Price>999.9</Price> <Desc> <Price> <Desc> <Price> </Line> </Quote> n4: n5: n6: n7: "hPhone" "499.9" "iMat" "999.9"8
  9. 9. Schema Automaton (SA) to model XML schemas 1. An SA is a deterministic finite automaton (DFA) where each state is associated with a regular expression (RE) and a set of values called value domain (VDom) 2. The DFA called vertical language (VLang) defines how the symbols are arranged along the paths from the root to the leaves. 2.1 Each state represents an XSD data type and each symbol represents an element name. 3. The RE of a state called horizontal language (HLang) defines how child elements can be arranged under an XSD data type, i.e., content model. 4. The value domain defines the set of all possible values an element can contain.9
  10. 10. Example SA <Line> q3 <Desc> <Quote> q1 <Price> q5 q0 <Order> <Line> <Qty> q2 q4 q8 <Desc> <Product> <Price> q6 q7 q HLang(q) VDom(q) q HLang(q) VDom(q) q0 <Quote>|<Order> { } q5 { } STRINGS q1 <Line>+ { } q6 { } DECIMALS q2 <Line>+ { } q7 <Desc><Price> { } q3 <Desc><Price> { } q8 { } INTEGERS q4 <Product><Qty> { }10
  11. 11. Outline Introduction and motivation Formal models for XML data and schemas Schema computational algorithms Experiments and conclusions11
  12. 12. Schema compatibility testing 1. Schema equivalence testing and subschema testing . 2. A schema minimization is involved. 2.1 All useless states (data types) are removed first. A useless state is an inaccessible state or a state which does not recognize any element with a finite number of descendants. 2.2 The process is like a DFA minimization but the HLang and VDom of each state are considered when deciding whether two states can be merged. 3. We have proved that two SAs (XSDs) are equivalent iff their minimized forms have isomorphic VLang DFAs and all corresponding HLangs and VDoms are equivalent . 4. We have developed an algorithm to verify whether an SA is a subschema of another SA.12
  13. 13. Useless states B q2 A A q0 A q7 q8 C q3 B q1 C B C q4 q5 A B q6 q9 q HLang(q) VDom(q) q HLang(q) VDom(q) q0 A{2,5}BC? STRINGS q5 C STRINGS q1 C* STRINGS q6 A+B* INTEGERS q2 { } INTEGERS q7 A? STRINGS q3 A* STRINGS q8 B* STRINGS q4 B+ STRINGS q9 { } DECIMALS 1. q7 and q8 are inaccessible. 2. q5 and q6 are irrational because they generate infinite children. 3. q9 is useless because it is blocked by irrational states. 4. q4 is useless because it must lead to an irrational state.13
  14. 14. Schema minimization and equivalence q HLang(q) VDom(q) q0 Quote | Order { }Schema A q1 Line + { } <Line> q3 <Desc> q2 Line + { } <Quote> q1 <Price> q5 q0 <Order> q3 Desc Price { } <Line> <Qty> q2 q4 q8 <Desc> q4 Product Qty { } <Product> <Price> q6 q5 { } STRS q7 q6 { } DECS q7 Desc Price { } q8 { } INTS q4 Product Qty { } 1. q3 and q7 can be merged into q9. 2. Two SAs are equivalent. q HLang(q) VDom(q) q0 Quote | Order { } <Desc> q5 <Line> q1 Line + { } <Quote> q1 q9 <Price> <Order> <Product> q2 Line + { } q0 <Line> q6 q9 Desc Price { } q2 q4 <Qty> q8 q4 Product Qty { } q5 { } STRSSchema B q6 { } DECS q8 { } INTS 14
  15. 15. Subschema testing q HLang(q) VDom(q)Schema A q0 Quote | Order { } q1 Line + { } <Desc> q5 q2 Line + { } q1 <Line> <Quote> q9 <Price> <Order> <Product> q9 Desc Price { } q0 <Line> q6 q4 Product Qty { } q2 q4 <Qty> q8 q5 { } STRS q6 { } DECS q8 { } INTSB is a subschema of A. 1. HLang(q0B ) ⊆ HLang(q0A ) and VDom(q0B ) = VDom(q0A ). 2. HLang(q6B ) = HLang(q6A ) and VDom(q6B ) ⊆ VDom(q6A ). 3. HLang(qiB ) = HLang(qiA ) and VDom(qiB ) = VDom(qiA ), for i = 1.5, 9. q HLang(q) VDom(q) <Desc> q5 q0 Quote { } <Quote> <Line> q0 q1 q9 <Price> q1 Line + { } q6 q9 Desc Price { } q5 { } STRSSchema B q6 { } INTS 15
  16. 16. Subschema extraction We have developed the subschema extraction algorithm: Given SA (XSD) A and a set of symbols (element names) Z, compute an SA which accepts all instances (XML documents) of A except those containing some symbols not in Z. <Desc> q4 q1 <Line> <Quote> q2 <Price> q0 <Order> <Product> <Line> q5 q7 q3 <Qty> q6 q HLang(q) VDom(q) q HLang(q) VDom(q) q0 <Quote>|<Order> { } q3 <Product><Qty> { } q1 <Line>+ { } q4 { } STRINGS q7 <Line>+ { } q5 { } DECIMALS q2 <Desc><Price> { } q6 { } INTEGERS Z = {<Quote>, <Line>, <Desc>, <Price>, <Order>, <Qty>}, where <Product> is excluded.16
  17. 17. Outline Introduction and motivation Formal models for XML data and schemas Schema computational algorithms Experiments and conclusions17
  18. 18. xCBL compatibility testing experiment 1. Data sets: XML Common Business Library file no. of data element doc. XSD size files types names types xCBL 3.0 1.8MB 413 1,290 3,728 42 xCBL 3.5 2.0MB 496 1,476 4,473 51 2. The subschema testing program has disproved the claim on xCBL.org: The only modifications allowed to xCBL 3.0 documents were the additions of new optional elements and additions to code lists; to maintain interoperability between the two versions. An xCBL 3.0 instance of a document is also a valid instance in xCBL 3.5. 3. xCBL 3.5 is not a superschema of xCBL 3.0. 4. The experiment took only 272ms when the quick RE test was applied. Machine: Q6600@2.40GHz, 4GB RAM, Linux OS18
  19. 19. Schema size reduction by subschema extraction 1. The subschema extraction program was run to extract different subschemas from xCBL. Each subschema recognizes a different element subset for a specific application, e.g., order, invoice, etc. 2. The schema size was reduced to 6–32% of the original size. 3. The time required by XMLBeans to compile a subschema was reduced to 34–50% of the time originally required. 4. The time to extract such a subschema was only 2–3s. 5000 35 #element names #types 30 4000 #element declarations XMLBeans compilation time 25 time (second) 3000 number 20 2000 15 10 1000 5 0 0 original invoice order quote auction catalog Subschema extraction from xCBL 3.5.19
  20. 20. Conclusions 1. We have developed: formal models for XML and XSD, and algorithms for schema equivalence and subschema testing, and subschema extraction. 2. These algorithms are PSPACE-complete because of comparions of regular expressions. We have developed a heuristic (quick RE test) to make these algorithms run fast on very large schemas. 3. Our experiments: have proved that xCBL 3.5 is in fact not backward-compatible with xCBL 3.0, and have extracted small subschemas from xCBL for different instance subsets, which largely reduce processing time on these subschemas. 4. These models can be extended for other applications: web service adaptor for legacy systems (text to XML transformation), and schema inferrer from XML instances.20