SlideShare a Scribd company logo
XML Schema Computations: Schema
    Compatibility Testing and Subschema Extraction

           Thomas Y.T. LEE and David W.L. Cheung

                  Department of Computer Science
                    The University of Hong Kong


                      October 28, 2010
                        CIKM 2010
                      Toronto, Canada




1
Outline



    Introduction and motivation


    Formal models for XML data and schemas


    Schema computational algorithms


    Experiments and conclusions




2
Outline



    Introduction and motivation


    Formal models for XML data and schemas


    Schema computational algorithms


    Experiments and conclusions




3
Data interoperability on web services
    In order for two web services to be interoperable , the XML
    schema on the message receiving end must accept all possible
    XML messages from the sending end.
        The sending schema must be a subschema of the receiving
        schema.

                                  _


                                  ∩
                      XML                     XML
                   Instances               Instances




                   Schema A                 Schema B




                     Web                      Web
                    Service                  Service
                      A                        B


4
W3C XML Schema and data standards

    1. W3C XML Schema (XSD) is the most popular schema
       language to define data standards.
    2. In order for the new version of an XSD to be
       backward-compatible with the old version, the new version
       must be a superschema of the old version.
           The new schema must accept every instance of the old
           schema.
    3. However, a typical e-commerce standard XSD contains
       thousands of types / elements, which makes manual
       verification of compatibility hardly possible.
    4. When an XSD is too large, how can we extract a smaller
       subschema just enough for processing by a specific
       application?



5
Schema compatibility problems



    1. Given two XSDs, how to verify two XSDs are equivalent or
       one is a subschema of the other?
    2. Given XSD A , how to extract a smaller subschema of A called
       B so that B recognizes only a subset of elements recognized
       by A ?
    3. In this research, we have developed the formal models for
       XML data and schemas, as well as the algorithms to solve
       these problems.




6
Outline



    Introduction and motivation


    Formal models for XML data and schemas


    Schema computational algorithms


    Experiments and conclusions




7
Data Tree (DT) to model XML data


    A DT is a tree where edges represent elements and nodes
    represent their contents.
    <Quote>                                            n0:ε

     <Line>                                             <Quote>
      <Desc>hPhone</Desc>
      <Price>499.9</Price>                             n1:ε
     </Line>
                                                <Line> <Line>
     <Line>
      <Desc>iMat</Desc>                       n2:ε            n3:ε
      <Price>999.9</Price>
                                         <Desc> <Price>         <Desc> <Price>
     </Line>
    </Quote>                     n4:           n5:              n6:      n7:
                              "hPhone"       "499.9"          "iMat"   "999.9"




8
Schema Automaton (SA) to model XML schemas

    1. An SA is a deterministic finite automaton (DFA) where each
       state is associated with a regular expression (RE) and a set of
       values called value domain (VDom)
    2. The DFA called vertical language (VLang) defines how the
       symbols are arranged along the paths from the root to the
       leaves.
       2.1 Each state represents an XSD data type and each symbol
           represents an element name.
    3. The RE of a state called horizontal language (HLang)
       defines how child elements can be arranged under an XSD
       data type, i.e., content model.
    4. The value domain defines the set of all possible values an
       element can contain.



9
Example SA


                                              <Line>                 q3    <Desc>
              <Quote>            q1
                                                                           <Price>    q5
     q0       <Order>
                             <Line>                       <Qty>
                        q2                      q4                   q8    <Desc>
                                                         <Product>
                                                                           <Price>    q6

                                                                     q7

          q       HLang(q)       VDom(q)
                                                     q        HLang(q)      VDom(q)
      q0       <Quote>|<Order>        {   }
                                                     q5         { }         STRINGS
      q1           <Line>+            {   }
                                                     q6         { }         DECIMALS
      q2           <Line>+            {   }
                                                     q7    <Desc><Price>       { }
      q3        <Desc><Price>         {   }
                                                     q8         { }         INTEGERS
      q4       <Product><Qty>         {   }




10
Outline



     Introduction and motivation


     Formal models for XML data and schemas


     Schema computational algorithms


     Experiments and conclusions




11
Schema compatibility testing

     1. Schema equivalence testing and subschema testing .
     2. A schema minimization is involved.
        2.1 All useless states (data types) are removed first. A useless
            state is an inaccessible state or a state which does not
            recognize any element with a finite number of descendants.
        2.2 The process is like a DFA minimization but the HLang and
            VDom of each state are considered when deciding whether
            two states can be merged.
     3. We have proved that two SAs (XSDs) are equivalent iff their
        minimized forms have isomorphic VLang DFAs and all
        corresponding HLangs and VDoms are equivalent .
     4. We have developed an algorithm to verify whether an SA is a
        subschema of another SA.



12
Useless states

                               B         q2

                               A
                                                       A
                    q0     A                      q7        q8
                                     C   q3            B
                               q1

                               C              B        C
                                         q4       q5   A          B
                                                            q6        q9

             q    HLang(q)          VDom(q)       q    HLang(q)       VDom(q)
            q0   A{2,5}BC?          STRINGS       q5        C          STRINGS
            q1       C*             STRINGS       q6       A+B*       INTEGERS
            q2       { }           INTEGERS       q7        A?         STRINGS
            q3       A*             STRINGS       q8        B*         STRINGS
            q4       B+             STRINGS       q9        { }       DECIMALS

     1. q7 and q8 are inaccessible.
     2. q5 and q6 are irrational because they generate infinite children.
     3. q9 is useless because it is blocked by irrational states.
     4. q4 is useless because it must lead to an irrational state.


13
Schema minimization and equivalence
                                                                        q     HLang(q)       VDom(q)
                                                                        q0   Quote | Order     { }
Schema A                                                                q1      Line +         { }
                               <Line>               q3   <Desc>         q2      Line +         { }
       <Quote>          q1
                                                         <Price>   q5
 q0    <Order>                                                          q3    Desc Price       { }
                      <Line>             <Qty>
                 q2              q4                 q8   <Desc>         q4   Product Qty       { }
                                        <Product>
                                                         <Price>   q6   q5       { }          STRS
                                                    q7                  q6       { }          DECS
                                                                        q7   Desc Price        { }
                                                                        q8       { }          INTS
                                                                        q4   Product Qty       { }
         1. q3 and q7 can be merged into q9.
         2. Two SAs are equivalent.                                     q     HLang(q)       VDom(q)
                                                                        q0   Quote | Order     { }
                                                         <Desc>    q5
                               <Line>
                                                                        q1      Line +         { }
       <Quote>          q1                          q9   <Price>
       <Order>                          <Product>                       q2      Line +         { }
 q0
                      <Line>                                       q6   q9    Desc Price       { }
                 q2              q4      <Qty>
                                                    q8                  q4   Product Qty       { }
                                                                        q5        { }         STRS
Schema B                                                                q6        { }         DECS
                                                                        q8        { }         INTS


  14
Subschema testing
                                                                            q      HLang(q)        VDom(q)

Schema A                                                                    q0   Quote | Order       { }
                                                                            q1       Line +          { }
                                                             <Desc>    q5
                                                                            q2       Line +          { }
                         q1     <Line>
       <Quote>                                          q9   <Price>
       <Order>                              <Product>                       q9    Desc Price         { }
 q0
                       <Line>                                          q6   q4    Product Qty        { }
                 q2               q4         <Qty>
                                                        q8                  q5        { }           STRS
                                                                            q6        { }           DECS
                                                                            q8        { }           INTS
B is a subschema of A.
 1. HLang(q0B ) ⊆ HLang(q0A ) and VDom(q0B ) = VDom(q0A ).
 2. HLang(q6B ) = HLang(q6A ) and VDom(q6B ) ⊆ VDom(q6A ).
 3. HLang(qiB ) = HLang(qiA ) and VDom(qiB ) = VDom(qiA ), for i = 1.5, 9.
                                                                            q    HLang(q)        VDom(q)
                                             <Desc>     q5
                                                                            q0     Quote           { }
       <Quote>          <Line>
 q0               q1                   q9    <Price>                        q1     Line +          { }
                                                        q6                  q9   Desc Price        { }
                                                                            q5       { }          STRS
Schema B                                                                    q6       { }          INTS



  15
Subschema extraction

     We have developed the subschema extraction algorithm:
         Given SA (XSD) A and a set of symbols (element names) Z,
         compute an SA which accepts all instances (XML documents)
         of A except those containing some symbols not in Z.
                                                                          <Desc>     q4
                                    q1       <Line>
                  <Quote>                                            q2   <Price>
             q0   <Order>                             <Product>
                                  <Line>                                             q5
                             q7                q3         <Qty>
                                                                     q6

        q         HLang(q)        VDom(q)             q           HLang(q)          VDom(q)
        q0   <Quote>|<Order>         {   }          q3      <Product><Qty>             { }
        q1       <Line>+             {   }          q4            { }                STRINGS
        q7       <Line>+             {   }          q5            { }               DECIMALS
        q2    <Desc><Price>          {   }          q6            { }               INTEGERS

         Z = {<Quote>, <Line>, <Desc>, <Price>, <Order>, <Qty>}, where <Product> is
         excluded.



16
Outline



     Introduction and motivation


     Formal models for XML data and schemas


     Schema computational algorithms


     Experiments and conclusions




17
xCBL compatibility testing experiment

     1. Data sets: XML Common Business Library
                         file   no. of    data   element     doc.
         XSD            size    files    types    names     types
         xCBL 3.0    1.8MB       413    1,290     3,728       42
         xCBL 3.5    2.0MB       496    1,476     4,473       51
     2. The subschema testing program has disproved the claim on
        xCBL.org:
       The only modifications allowed to xCBL 3.0 documents were the
       additions of new optional elements and additions to code lists; to
       maintain interoperability between the two versions. An xCBL 3.0
       instance of a document is also a valid instance in xCBL 3.5.
     3. xCBL 3.5 is not a superschema of xCBL 3.0.
     4. The experiment took only 272ms when the quick RE test
        was applied.
            Machine: Q6600@2.40GHz, 4GB RAM, Linux OS


18
Schema size reduction by subschema extraction
     1. The subschema extraction program was run to extract
        different subschemas from xCBL. Each subschema
        recognizes a different element subset for a specific
        application, e.g., order, invoice, etc.
     2. The schema size was reduced to 6–32% of the original size.
     3. The time required by XMLBeans to compile a subschema was
        reduced to 34–50% of the time originally required.
     4. The time to extract such a subschema was only 2–3s.
                  5000                                                              35
                                                               #element names
                                                                         #types     30
                  4000                                    #element declarations
                                                      XMLBeans compilation time     25




                                                                                         time (second)
                  3000
         number




                                                                                    20

                  2000                                                              15
                                                                                    10
                  1000
                                                                                    5
                    0                                                               0
                         original   invoice   order    quote    auction   catalog
                              Subschema extraction from xCBL 3.5.

19
Conclusions
     1. We have developed:
            formal models for XML and XSD, and
            algorithms for schema equivalence and subschema testing,
            and subschema extraction.
     2. These algorithms are PSPACE-complete because of
        comparions of regular expressions.
            We have developed a heuristic (quick RE test) to make these
            algorithms run fast on very large schemas.
     3. Our experiments:
            have proved that xCBL 3.5 is in fact not backward-compatible
            with xCBL 3.0, and
            have extracted small subschemas from xCBL for different
            instance subsets, which largely reduce processing time on
            these subschemas.
     4. These models can be extended for other applications:
            web service adaptor for legacy systems (text to XML
            transformation), and
            schema inferrer from XML instances.
20

More Related Content

Similar to XML Schema Computations: Schema Compatibility Testing and Subschema Extraction

Cassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A ComparisonCassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A Comparison
shsedghi
 
Enabling ontology based streaming data access final
Enabling ontology based streaming data access finalEnabling ontology based streaming data access final
Enabling ontology based streaming data access final
Jean-Paul Calbimonte
 
NoSQL Smackdown!
NoSQL Smackdown!NoSQL Smackdown!
NoSQL Smackdown!
Tim Berglund
 
Cascalog at Strange Loop
Cascalog at Strange LoopCascalog at Strange Loop
Cascalog at Strange Loop
nathanmarz
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)
Zakaria Zubi
 
Ast2Cfg - A Framework for CFG-Based Analysis and Visualisation of Ada Programs
Ast2Cfg - A Framework for CFG-Based Analysis and Visualisation of Ada ProgramsAst2Cfg - A Framework for CFG-Based Analysis and Visualisation of Ada Programs
Ast2Cfg - A Framework for CFG-Based Analysis and Visualisation of Ada Programs
Gneuromante canalada.org
 
Online Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraOnline Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and Cassandra
Robbie Strickland
 

Similar to XML Schema Computations: Schema Compatibility Testing and Subschema Extraction (9)

Cassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A ComparisonCassandra Java APIs Old and New – A Comparison
Cassandra Java APIs Old and New – A Comparison
 
Enabling ontology based streaming data access final
Enabling ontology based streaming data access finalEnabling ontology based streaming data access final
Enabling ontology based streaming data access final
 
NoSQL Smackdown!
NoSQL Smackdown!NoSQL Smackdown!
NoSQL Smackdown!
 
Cascalog at Strange Loop
Cascalog at Strange LoopCascalog at Strange Loop
Cascalog at Strange Loop
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)
 
Ast2Cfg - A Framework for CFG-Based Analysis and Visualisation of Ada Programs
Ast2Cfg - A Framework for CFG-Based Analysis and Visualisation of Ada ProgramsAst2Cfg - A Framework for CFG-Based Analysis and Visualisation of Ada Programs
Ast2Cfg - A Framework for CFG-Based Analysis and Visualisation of Ada Programs
 
Online Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and CassandraOnline Analytics with Hadoop and Cassandra
Online Analytics with Hadoop and Cassandra
 

More from Thomas Lee

What AI can do for your business
What AI can do for your businessWhat AI can do for your business
What AI can do for your business
Thomas Lee
 
多雲策略:別把所有系統跑在同一雲平台上
多雲策略:別把所有系統跑在同一雲平台上多雲策略:別把所有系統跑在同一雲平台上
多雲策略:別把所有系統跑在同一雲平台上
Thomas Lee
 
XML Schema Design and Management for e-Government Data Interoperability
XML Schema Design and Management for e-Government Data Interoperability XML Schema Design and Management for e-Government Data Interoperability
XML Schema Design and Management for e-Government Data Interoperability
Thomas Lee
 
Automating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic DatasetsAutomating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic Datasets
Thomas Lee
 
Formal Models and Algorithms for XML Data Interoperability
Formal Models and Algorithms for XML Data InteroperabilityFormal Models and Algorithms for XML Data Interoperability
Formal Models and Algorithms for XML Data Interoperability
Thomas Lee
 
Cloud Portability and Interoperability Architecture Model and Best Practices ...
Cloud Portability and Interoperability Architecture Model and Best Practices ...Cloud Portability and Interoperability Architecture Model and Best Practices ...
Cloud Portability and Interoperability Architecture Model and Best Practices ...
Thomas Lee
 
Architecture and Practices on Cloud Interoperability and Portability
Architecture and Practices on Cloud Interoperability and PortabilityArchitecture and Practices on Cloud Interoperability and Portability
Architecture and Practices on Cloud Interoperability and Portability
Thomas Lee
 
ebXML Technology Development in Hong Kong
ebXML Technology Development in Hong KongebXML Technology Development in Hong Kong
ebXML Technology Development in Hong Kong
Thomas Lee
 
ebXML and Open Source Software for E-Commerce
ebXML and Open Source Software for E-CommerceebXML and Open Source Software for E-Commerce
ebXML and Open Source Software for E-Commerce
Thomas Lee
 
The Mythical XML
The Mythical XMLThe Mythical XML
The Mythical XML
Thomas Lee
 
Paperless Trading Infrastructure Technology Development in Hong Kong
Paperless Trading Infrastructure Technology Development in Hong KongPaperless Trading Infrastructure Technology Development in Hong Kong
Paperless Trading Infrastructure Technology Development in Hong Kong
Thomas Lee
 
E government Interoperability Infrastructure Development
E government Interoperability Infrastructure DevelopmentE government Interoperability Infrastructure Development
E government Interoperability Infrastructure Development
Thomas Lee
 
Adopting Web 2.0 in Business World
Adopting Web 2.0 in Business WorldAdopting Web 2.0 in Business World
Adopting Web 2.0 in Business World
Thomas Lee
 
Webformer: a Rapid Application Development Toolkit for Writing Ajax Web Form ...
Webformer: a Rapid Application Development Toolkit for Writing Ajax Web Form ...Webformer: a Rapid Application Development Toolkit for Writing Ajax Web Form ...
Webformer: a Rapid Application Development Toolkit for Writing Ajax Web Form ...
Thomas Lee
 
E-Government Interoperability Infrastructure in Hong Kong
E-Government Interoperability Infrastructure in Hong KongE-Government Interoperability Infrastructure in Hong Kong
E-Government Interoperability Infrastructure in Hong Kong
Thomas Lee
 

More from Thomas Lee (15)

What AI can do for your business
What AI can do for your businessWhat AI can do for your business
What AI can do for your business
 
多雲策略:別把所有系統跑在同一雲平台上
多雲策略:別把所有系統跑在同一雲平台上多雲策略:別把所有系統跑在同一雲平台上
多雲策略:別把所有系統跑在同一雲平台上
 
XML Schema Design and Management for e-Government Data Interoperability
XML Schema Design and Management for e-Government Data Interoperability XML Schema Design and Management for e-Government Data Interoperability
XML Schema Design and Management for e-Government Data Interoperability
 
Automating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic DatasetsAutomating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic Datasets
 
Formal Models and Algorithms for XML Data Interoperability
Formal Models and Algorithms for XML Data InteroperabilityFormal Models and Algorithms for XML Data Interoperability
Formal Models and Algorithms for XML Data Interoperability
 
Cloud Portability and Interoperability Architecture Model and Best Practices ...
Cloud Portability and Interoperability Architecture Model and Best Practices ...Cloud Portability and Interoperability Architecture Model and Best Practices ...
Cloud Portability and Interoperability Architecture Model and Best Practices ...
 
Architecture and Practices on Cloud Interoperability and Portability
Architecture and Practices on Cloud Interoperability and PortabilityArchitecture and Practices on Cloud Interoperability and Portability
Architecture and Practices on Cloud Interoperability and Portability
 
ebXML Technology Development in Hong Kong
ebXML Technology Development in Hong KongebXML Technology Development in Hong Kong
ebXML Technology Development in Hong Kong
 
ebXML and Open Source Software for E-Commerce
ebXML and Open Source Software for E-CommerceebXML and Open Source Software for E-Commerce
ebXML and Open Source Software for E-Commerce
 
The Mythical XML
The Mythical XMLThe Mythical XML
The Mythical XML
 
Paperless Trading Infrastructure Technology Development in Hong Kong
Paperless Trading Infrastructure Technology Development in Hong KongPaperless Trading Infrastructure Technology Development in Hong Kong
Paperless Trading Infrastructure Technology Development in Hong Kong
 
E government Interoperability Infrastructure Development
E government Interoperability Infrastructure DevelopmentE government Interoperability Infrastructure Development
E government Interoperability Infrastructure Development
 
Adopting Web 2.0 in Business World
Adopting Web 2.0 in Business WorldAdopting Web 2.0 in Business World
Adopting Web 2.0 in Business World
 
Webformer: a Rapid Application Development Toolkit for Writing Ajax Web Form ...
Webformer: a Rapid Application Development Toolkit for Writing Ajax Web Form ...Webformer: a Rapid Application Development Toolkit for Writing Ajax Web Form ...
Webformer: a Rapid Application Development Toolkit for Writing Ajax Web Form ...
 
E-Government Interoperability Infrastructure in Hong Kong
E-Government Interoperability Infrastructure in Hong KongE-Government Interoperability Infrastructure in Hong Kong
E-Government Interoperability Infrastructure in Hong Kong
 

Recently uploaded

RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 

Recently uploaded (20)

RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 

XML Schema Computations: Schema Compatibility Testing and Subschema Extraction

  • 1. XML Schema Computations: Schema Compatibility Testing and Subschema Extraction Thomas Y.T. LEE and David W.L. Cheung Department of Computer Science The University of Hong Kong October 28, 2010 CIKM 2010 Toronto, Canada 1
  • 2. Outline Introduction and motivation Formal models for XML data and schemas Schema computational algorithms Experiments and conclusions 2
  • 3. Outline Introduction and motivation Formal models for XML data and schemas Schema computational algorithms Experiments and conclusions 3
  • 4. Data interoperability on web services In order for two web services to be interoperable , the XML schema on the message receiving end must accept all possible XML messages from the sending end. The sending schema must be a subschema of the receiving schema. _ ∩ XML XML Instances Instances Schema A Schema B Web Web Service Service A B 4
  • 5. W3C XML Schema and data standards 1. W3C XML Schema (XSD) is the most popular schema language to define data standards. 2. In order for the new version of an XSD to be backward-compatible with the old version, the new version must be a superschema of the old version. The new schema must accept every instance of the old schema. 3. However, a typical e-commerce standard XSD contains thousands of types / elements, which makes manual verification of compatibility hardly possible. 4. When an XSD is too large, how can we extract a smaller subschema just enough for processing by a specific application? 5
  • 6. Schema compatibility problems 1. Given two XSDs, how to verify two XSDs are equivalent or one is a subschema of the other? 2. Given XSD A , how to extract a smaller subschema of A called B so that B recognizes only a subset of elements recognized by A ? 3. In this research, we have developed the formal models for XML data and schemas, as well as the algorithms to solve these problems. 6
  • 7. Outline Introduction and motivation Formal models for XML data and schemas Schema computational algorithms Experiments and conclusions 7
  • 8. Data Tree (DT) to model XML data A DT is a tree where edges represent elements and nodes represent their contents. <Quote> n0:ε <Line> <Quote> <Desc>hPhone</Desc> <Price>499.9</Price> n1:ε </Line> <Line> <Line> <Line> <Desc>iMat</Desc> n2:ε n3:ε <Price>999.9</Price> <Desc> <Price> <Desc> <Price> </Line> </Quote> n4: n5: n6: n7: "hPhone" "499.9" "iMat" "999.9" 8
  • 9. Schema Automaton (SA) to model XML schemas 1. An SA is a deterministic finite automaton (DFA) where each state is associated with a regular expression (RE) and a set of values called value domain (VDom) 2. The DFA called vertical language (VLang) defines how the symbols are arranged along the paths from the root to the leaves. 2.1 Each state represents an XSD data type and each symbol represents an element name. 3. The RE of a state called horizontal language (HLang) defines how child elements can be arranged under an XSD data type, i.e., content model. 4. The value domain defines the set of all possible values an element can contain. 9
  • 10. Example SA <Line> q3 <Desc> <Quote> q1 <Price> q5 q0 <Order> <Line> <Qty> q2 q4 q8 <Desc> <Product> <Price> q6 q7 q HLang(q) VDom(q) q HLang(q) VDom(q) q0 <Quote>|<Order> { } q5 { } STRINGS q1 <Line>+ { } q6 { } DECIMALS q2 <Line>+ { } q7 <Desc><Price> { } q3 <Desc><Price> { } q8 { } INTEGERS q4 <Product><Qty> { } 10
  • 11. Outline Introduction and motivation Formal models for XML data and schemas Schema computational algorithms Experiments and conclusions 11
  • 12. Schema compatibility testing 1. Schema equivalence testing and subschema testing . 2. A schema minimization is involved. 2.1 All useless states (data types) are removed first. A useless state is an inaccessible state or a state which does not recognize any element with a finite number of descendants. 2.2 The process is like a DFA minimization but the HLang and VDom of each state are considered when deciding whether two states can be merged. 3. We have proved that two SAs (XSDs) are equivalent iff their minimized forms have isomorphic VLang DFAs and all corresponding HLangs and VDoms are equivalent . 4. We have developed an algorithm to verify whether an SA is a subschema of another SA. 12
  • 13. Useless states B q2 A A q0 A q7 q8 C q3 B q1 C B C q4 q5 A B q6 q9 q HLang(q) VDom(q) q HLang(q) VDom(q) q0 A{2,5}BC? STRINGS q5 C STRINGS q1 C* STRINGS q6 A+B* INTEGERS q2 { } INTEGERS q7 A? STRINGS q3 A* STRINGS q8 B* STRINGS q4 B+ STRINGS q9 { } DECIMALS 1. q7 and q8 are inaccessible. 2. q5 and q6 are irrational because they generate infinite children. 3. q9 is useless because it is blocked by irrational states. 4. q4 is useless because it must lead to an irrational state. 13
  • 14. Schema minimization and equivalence q HLang(q) VDom(q) q0 Quote | Order { } Schema A q1 Line + { } <Line> q3 <Desc> q2 Line + { } <Quote> q1 <Price> q5 q0 <Order> q3 Desc Price { } <Line> <Qty> q2 q4 q8 <Desc> q4 Product Qty { } <Product> <Price> q6 q5 { } STRS q7 q6 { } DECS q7 Desc Price { } q8 { } INTS q4 Product Qty { } 1. q3 and q7 can be merged into q9. 2. Two SAs are equivalent. q HLang(q) VDom(q) q0 Quote | Order { } <Desc> q5 <Line> q1 Line + { } <Quote> q1 q9 <Price> <Order> <Product> q2 Line + { } q0 <Line> q6 q9 Desc Price { } q2 q4 <Qty> q8 q4 Product Qty { } q5 { } STRS Schema B q6 { } DECS q8 { } INTS 14
  • 15. Subschema testing q HLang(q) VDom(q) Schema A q0 Quote | Order { } q1 Line + { } <Desc> q5 q2 Line + { } q1 <Line> <Quote> q9 <Price> <Order> <Product> q9 Desc Price { } q0 <Line> q6 q4 Product Qty { } q2 q4 <Qty> q8 q5 { } STRS q6 { } DECS q8 { } INTS B is a subschema of A. 1. HLang(q0B ) ⊆ HLang(q0A ) and VDom(q0B ) = VDom(q0A ). 2. HLang(q6B ) = HLang(q6A ) and VDom(q6B ) ⊆ VDom(q6A ). 3. HLang(qiB ) = HLang(qiA ) and VDom(qiB ) = VDom(qiA ), for i = 1.5, 9. q HLang(q) VDom(q) <Desc> q5 q0 Quote { } <Quote> <Line> q0 q1 q9 <Price> q1 Line + { } q6 q9 Desc Price { } q5 { } STRS Schema B q6 { } INTS 15
  • 16. Subschema extraction We have developed the subschema extraction algorithm: Given SA (XSD) A and a set of symbols (element names) Z, compute an SA which accepts all instances (XML documents) of A except those containing some symbols not in Z. <Desc> q4 q1 <Line> <Quote> q2 <Price> q0 <Order> <Product> <Line> q5 q7 q3 <Qty> q6 q HLang(q) VDom(q) q HLang(q) VDom(q) q0 <Quote>|<Order> { } q3 <Product><Qty> { } q1 <Line>+ { } q4 { } STRINGS q7 <Line>+ { } q5 { } DECIMALS q2 <Desc><Price> { } q6 { } INTEGERS Z = {<Quote>, <Line>, <Desc>, <Price>, <Order>, <Qty>}, where <Product> is excluded. 16
  • 17. Outline Introduction and motivation Formal models for XML data and schemas Schema computational algorithms Experiments and conclusions 17
  • 18. xCBL compatibility testing experiment 1. Data sets: XML Common Business Library file no. of data element doc. XSD size files types names types xCBL 3.0 1.8MB 413 1,290 3,728 42 xCBL 3.5 2.0MB 496 1,476 4,473 51 2. The subschema testing program has disproved the claim on xCBL.org: The only modifications allowed to xCBL 3.0 documents were the additions of new optional elements and additions to code lists; to maintain interoperability between the two versions. An xCBL 3.0 instance of a document is also a valid instance in xCBL 3.5. 3. xCBL 3.5 is not a superschema of xCBL 3.0. 4. The experiment took only 272ms when the quick RE test was applied. Machine: Q6600@2.40GHz, 4GB RAM, Linux OS 18
  • 19. Schema size reduction by subschema extraction 1. The subschema extraction program was run to extract different subschemas from xCBL. Each subschema recognizes a different element subset for a specific application, e.g., order, invoice, etc. 2. The schema size was reduced to 6–32% of the original size. 3. The time required by XMLBeans to compile a subschema was reduced to 34–50% of the time originally required. 4. The time to extract such a subschema was only 2–3s. 5000 35 #element names #types 30 4000 #element declarations XMLBeans compilation time 25 time (second) 3000 number 20 2000 15 10 1000 5 0 0 original invoice order quote auction catalog Subschema extraction from xCBL 3.5. 19
  • 20. Conclusions 1. We have developed: formal models for XML and XSD, and algorithms for schema equivalence and subschema testing, and subschema extraction. 2. These algorithms are PSPACE-complete because of comparions of regular expressions. We have developed a heuristic (quick RE test) to make these algorithms run fast on very large schemas. 3. Our experiments: have proved that xCBL 3.5 is in fact not backward-compatible with xCBL 3.0, and have extracted small subschemas from xCBL for different instance subsets, which largely reduce processing time on these subschemas. 4. These models can be extended for other applications: web service adaptor for legacy systems (text to XML transformation), and schema inferrer from XML instances. 20