Best Practices and
Performance Tuning of
XML Queries in SQL Server
AD-501-M

Michael Rys
Principal Program Manager
Microsoft Corp

mrys@microsoft.com
@SQLServerMike




                            October 11-14, Seattle, WA
Session Objectives
• Understand when and how
  to use XML in SQL Server
• Understand and correct common
  performance problems with XML and
  XQuery
Session Agenda

XML Scenarios and when to store XML


XML Design Optimizations


General Optimizations


XML Datatype method Optimizations


XQuery Optimizations


XML Index Optimizations

                                      AD-501-M| XQuery Performance   3
AD-501-M| XQuery Performance   4
XML Scenarios

Data Exchange between loosely-coupled systems
•   XML is ubiquitous, extensible, platform independent transport format
•   Message Envelope in XML
    Simple Object Access Protocol (SOAP), RSS, REST
•   Message Payload/Business Data in XML
•   Vertical Industry Exchange schemas
Document Management
•   XHTML, DocBook, Home-grown, domain-specific markup (e.g.
    contracts), OpenOffice, Microsoft Office XML (both default and user-
    extended)
Ad-hoc modeling of semistructured data
•   Storing and querying heterogeneous complex objects
•   Semistructured data with sparse, highly-varying
    structure at the instance level
•   XML provides self-describing format and extensible schemas

     →Transport, Store, and Query XML data
                                                   AD-501-M| XQuery Performance   5
Decision Tree: Processing XML In SQL Server
Does the data fit
                              Shred the XML
 the relational
                    Yes        into relations
    model?
         No                                structured
                                                         Known sparse
                              Shred the structured
                             XML into relations, store            Shred known
Is the data semi-            semistructured aspects             sparse data into
    structured?     Yes       as XML and/or sparse
                                                                 sparse columns
                                       col

         No                                Open schema


                               Is the XML                            Promote
                                                         Yes
  Is the data a               Search within
                             constrainedthe
                              Query into by                    frequently queried
   document?                    the XML?
                                   XML?                             properties
                    Yes        schemas?                            relationally
                        No                 Yes

                                                                Use primary and
                             Constrain XML if
          Store as           Define a full-text                  secondary XML
                             validation XML is
                               Store as cost
       varbinary(max)              index                             indexes as
                                     ok     AD-501-M|                  needed 6
                                                           XQuery Performance
SQL Server XML Data Type Architecture

              XML                                  Relational

   XML
            XML Parser                                     XML Schemata


                                                   Schema
             Validation                           Collection

                          OpenXML/nodes()                         PATH
XML-DML XML data type                               Rowsets
                                                                 Index
         (binary XML) PRIMARY        Node
                                      Table                      PROP
                      XML INDEX with
                         FOR XML                                 Index
                         TYPE directive
                                                                 VALUE
           XQuery                                                Index


                                          AD-501-M| XQuery Performance   7
General Impacts
Concurrency Control
•   Locks on both XML data type and relevant
    rows in primary and secondary XML Indices
•   Lock escalation on indices
•   Snapshot Isolation reduces locks and lock contention
Transaction Logs
•   Bulkinsert into XML Indices may fill transaction log
•   Delay the creation of the XML indexes and use the SIMPLE recovery
    model
•   Preallocate database file instead of dynamically growing
•   Place log on different disk
In-Row/Out-of-Row of XML large object
•   Moving XML into side table or out-of-row if
    mixed with relational data reduces scan time
Due to clustering, insertion into XML Index may not be linear
•   Chose integer/bigint identity column as key
                                                  AD-501-M| XQuery Performance   8
Choose The Right XML Model
• Element-centric versus attribute-centric
     <Customer><name>Joe</name></Customer>
     <Customer name="Joe" />
  +: Attributes    often better performing querying
  –: Parsing Attributes   uniqueness check

• Generic element names with type attribute vs Specific
  element names
     <Entity type="Customer">
       <Prop type="Name">Joe</Prop>
     </Entity>
  <Customer><name>Joe</name></Customer>
  +: Specific names    shorter path expressions
  +: Specific names    no filter on type attribute
  /Entity[@type="Customer"]/Prop[@type="Name"] vs /Customer/name

• Wrapper elements
     <Orders><Order id="1"/></Orders>
  +: No wrapper elements      smaller XML, shorter path expressions
                                                 AD-501-M| XQuery Performance   9
Use an XML Schema Collection?

Using no XML Schema (untyped XML)
•   Can still use XQuery and XML Index!!!
•   Atomic values are always weakly typed strings
      compare as strings to avoid runtime
    conversions and loss of index usage
•   No schema validation overhead
•   No schema evolution revalidation costs

XML Schema provides structural information
•   Atomic typed elements are now using only one instead of two
    rows in node table/XML index (closer to attributes)
•   Static typing can detect cardinality and feasibility of expression

XML Schema provides semantic information
•   Elements/attributes have correct atomic
    type for comparison and order semantics
•   No runtime casts required and better use of index for value lookup

                                               AD-501-M| XQuery Performance   10
XQuery Methods

query() creates new, untyped XML data type
instance

exist() returns 1 if the XQuery expression returns
at least one item, 0 otherwise

value() extracts an XQuery value into the SQL
value and type space
• Expression has to statically be a singleton
• String value of atomized XQuery item is cast to
  SQL type
• SQL type has to be SQL scalar type
  (no XML or CLR UDT)                 AD-501-M| XQuery Performance   11
XQuery: nodes()

Returns a row per selected node as a special
XML data type instance
• Preserves the original structure and types
• Can only be used with the XQuery methods (but not
  modify()), count(*), and IS (NOT) NULL

Appears as Table-valued Function (TVF) in
queryplan if no index present




                                 AD-501-M| XQuery Performance   12
sql:column()/sql:variable()

Map SQL value and type into XQuery values and types in context of XQuery or
XML-DML
• sql:variable(): accesses a SQL variable/parameter
  declare @value int
  set @value=42
  select * from T
  where
  T.x.exist('/a/b[@id=sql:variable("@value")]')=1
• sql:column(): accesses another column value

   tables: T(key int, x xml), S(key int, val int)

   select * from T join S on T.key=S.key
   where T.x.exist('/a/b[@id=sql:column("S.val")]')=1

• Restrictions in SQL Server:
   No XML, CLR UDT, datetime, or deprecated text/ntext/image
                                                    AD-501-M| XQuery Performance   13
Improving Slow XQueries, Bad
FOR XML
demo




                     October 11-14, Seattle, WA
Optimal Use Of Methods
How to Cast from XML to SQL

BAD:
CAST( CAST(xmldoc.query('/a/b/text()') as
      nvarchar(500)) as int)
GOOD:
xmldoc.value('(/a/b/text())[1]', 'int')
BAD:
node.query('.').value('@attr',
                      'nvarchar(50)')
GOOD:
node.value('@attr', 'nvarchar(50)')



                           AD-501-M| XQuery Performance   15
Optimal Use Of Methods
Grouping value() method
Group value() methods on same XML instance next to
each other if the path expressions in the value()
methods are
• Simple path expressions that only use child and attribute axis
  and do not contain wildcards, predicates, node tests, ordinals
• The path expressions infer statically a singleton

The singleton can be statically inferred from
• the DOCUMENT and XML Schema Collection
• Relative paths on the context node provided by the nodes()
  method

Requires XML index to be present
                                          AD-501-M| XQuery Performance   16
Optimal Use of Methods
Using the right method to join and compare

  Use exist() method, sql:column()/sql:variable() and an
  XQuery comparison for checking for a value or joining
  if secondary XML indices present
    BAD:*
    select doc
    from doc_tab join authors
    on doc.value('(/doc/mainauthor/lname/text())[1]',
    'nvarchar(50)') = lastname
    GOOD:
    select doc
    from doc_tab join authors
    on 1 = doc.exist('/doc/mainauthor/lname/text()[. =
    sql:column("lastname")]')
  * If applied on XML variable/no index present, value()
  method is most of the time more efficient
                                    AD-501-M| XQuery Performance   17
Optimal Use of Methods
Avoiding bad costing with nodes()
nodes() without XML index is a Table-valued function (details later)
Bad cardinality estimates can lead to bad plans
   •   BAD:
       select c.value('@id', 'int') as CustID
            , c.value('@name', 'nvarchar(50)') as CName
       from Customer, @x.nodes('/doc/customer') as N(c)
       where Customer.ID = c.value('@id', 'int')
   •   BETTER (if only one wrapper doc element):
      select c.value('@id', 'int') as CustID
           , c.value('@name', 'nvarchar(50)') as CName
      from Customer, @x.nodes('/doc[1]') as D(d)
      cross apply d.nodes('customer') as N(c)
      where Customer.ID = c.value('@id', 'int')
Use temp table (insert into #temp select … from nodes()) or Table-
valued parameter instead of XML to get better estimates
                                               AD-501-M| XQuery Performance   18
Optimal Use Of Methods
Avoiding multiple method evaluations
Use subqueries
   • BAD:
     SELECT CASE isnumeric (doc.value(
       '(/doc/customer/order/price)[1]', 'nvarchar(32)'))
      WHEN 1 THEN doc.value(
       '(/doc/customer/order/price)[1]', 'decimal(5,2)')
      ELSE 0 END
     FROM T
   • GOOD:
     SELECT CASE isnumeric (Price)
       WHEN 1 THEN CAST(Price as decimal(5,2))
       ELSE 0 END
     FROM (SELECT doc.value(
             '(/doc/customer/order/price)[1]',
             'nvarchar(32)')) as Price FROM T) X

Use subqueries also with NULLIF()
                                    AD-501-M| XQuery Performance   19
Combined SQL And XQuery/DML Processing
         SELECT x.query('…'), y FROM T WHERE …

Static                                                            Metadata
              SQL Parser        XQuery Parser
Phase
                                                                      XML
            Static Typing        Static Typing                      Schema
                                                                   Collection

            Algebrization        Algebrization



                  Static Optimization of
                 combined Logical and
                 Physical Operation Tree

Dynamic            Runtime Optimization                             XML and
Phase                and Execution of                                  rel.
                     physical Op Tree                                Indices
                                                 AD-501-M| XQuery Performance   20
New XQuery Algebra Operators
XML Reader TVF
Table-Valued Function XML Reader UDF with XPath Filter
Used if no Primary XML Index is present
Creates node table rowset in query flow
Multiple XPath filters can be pushed in to reduce node table
to subtree
Base cardinality estimate is always 10’000 rows! 
Some adjustment based on pushed path filters

XMLReader node table format example (simplified)

 ID      TAG ID      Node      Type-ID         VALUE            HID
 1.3.1   4 (TITLE)   Element   2 (xs:string)   Bad Bugs         #title#section#book

                                                       AD-501-M| XQuery Performance   21
New XQuery Algebra Operators
UDX

• Serializer UDX
  serializes the query result as XML
• XQuery String UDX
  evaluates the XQuery string() function
• XQuery Data UDX
  evaluates the XQuery data() function
• Check UDX
  validates XML being inserted

•   UDX name visible in SSMS properties window
                              AD-501-M| XQuery Performance   22
Optimal Use Of XQuery
Atomization of nodes
Value comparisons, XQuery casts and value() method
casts require atomization of item
  • attribute:
    /person[@age = 42]
    /person[data(@age) = 42]
  • Atomic typed element:
    /person[age = 42]          /person[data(age) = 42]
  • Untyped, mixed content typed element (adds UDX):
    /person[age = 42]          /person[data(age) = 42]
    /person[string(age) = 42]
  • If only one text node for untyped element (better):
    /person[age/text() = 42]
    /person[data(age/text()) = 42]
  • value() method on untyped elements:
    value('/person/age', 'int')
      value('/person/age/text()', 'int')

String() aggregates all text nodes, prohibits index use
                                       AD-501-M| XQuery Performance   23
Optimal Use Of XQuery
Casting Values
Value comparisons require casts and type promotion
  • Untyped attribute:
    /person[@age = 42]       /person[xs:decimal(@age) = 42]
  • Untyped text node():
    /person[age/text() = 42]
    /person[xs:decimal(age/text()) = 42]
  • Typed element (typed as xs:int):
    /person[salary = 3e4]         /person[xs:double(salary) =
    3e4]

Casting is expensive and prohibits index lookup

Tips to avoid casting
  • Use appropriate types for comparison (string for untyped)
  • Use schema to declare type          AD-501-M| XQuery Performance   24
Optimal Use Of XQuery
Maximize XPath expressions

Single paths are more efficient than twig paths
Avoid predicates in the middle of path expressions
    book[@ISBN = "1-8610-0157-6"]/author[first-
    name = "Davis"]
    /book[@ISBN = "1-8610-0157-6"] "∩"
    /book/author[first-name = "Davis"]

Move ordinals to the end of path expressions
  • Make sure you get the same semantics!
  • /a[1]/b[1] ≠ (/a/b)[1] ≠ /a/b[1]
  • (/book/@isbn)[1] is better than/book[1]/@isbn
                               AD-501-M| XQuery Performance   25
Optimal Use Of XQuery
Maximize XPath expressions in exist()
Use context item in predicate to lengthen path in exist()
   • Existential quantification makes returned node irrelevant

• BAD:
     SELECT * FROM docs WHERE 1 = xCol.exist
       ('/book/subject[text() = "security"]')
• GOOD:
     SELECT * FROM docs WHERE 1 = xCol.exist
       ('/book/subject/text()[. = "security"]')
• BAD:
     SELECT * FROM docs WHERE 1 = xCol.exist
       ('/book[@price > 9.99 and @price < 49.99]')
• GOOD:
     SELECT * FROM docs WHERE 1 = xCol.exist
       ('/book/@price[. > 9.99 and . < 49.99]')

This does not work with or-predicate            AD-501-M| XQuery Performance   26
Optimal Use Of XQuery
Inefficient operations: Parent axis

Most frequent offender: parent axis with nodes()

• BAD:
  select o.value('../@id', 'int') as CustID
       , o.value('@id', 'int') as OrdID
  from T
  cross apply x.nodes('/doc/customer/orders') as N(o)

• GOOD:
  select c.value('@id', 'int') as CustID
       , o.value('@id', 'int') as OrdID
  from T cross apply x.nodes('/doc/customer') as N1(c)
         cross apply c.nodes('orders') as N2(o)
                                    AD-501-M| XQuery Performance   27
Optimal Use Of XQuery
Inefficient operations
Avoid descendant axes and // in the middle of path
expressions if the data structure is known.
  • // still can use the HID lookup, but is less efficient

XQuery construction performs worse than FOR XML
  • BAD:
     SELECT notes.query('
       <Customer cid="{sql:column(''cid'')}">{
         <name>{sql:column("name")}</name>, /
       }</Customer>')
     FROM Customers WHERE cid=1
  • GOOD:
     SELECT cid as "@cid", name, notes as "*"
     FROM Customers WHERE cid=1
     FOR XML PATH('Customer'), TYPE
                                              AD-501-M| XQuery Performance   28
Optimal Use Of FOR XML
Use TYPE directive when assigning result to XML
  • BAD:
    declare @x xml;
    set @x =
         (select * from Customers for xml raw);
  • GOOD:
    declare @x xml;
    set @x =
         (select * from Customers for xml raw,
          type);

Use FOR XML PATH for complex grouping and additional
hierarchy levels over FOR XML EXPLICIT

Use FOR XML EXPLICIT for complex nesting if FOR XML PATH
performance is not appropriate

                                    AD-501-M| XQuery Performance   29
XML Indices
Create XML index on XML column
        CREATE PRIMARY XML INDEX idx_1 ON docs (xDoc)
Create secondary indexes on tags, values, paths
Creation:
  • Single-threaded only for primary XML index
  • Multi-threaded for secondary XML indexes
Uses:
  •     Primary Index will always be used if defined (not a cost
        based decision)
  •     Results can be served directly from index
  •     SQL’s cost based optimizer will consider secondary indexes
Maintenance:
  •     Primary and Secondary Indices will be efficiently maintained
        during updates
  •     Only subtree that changes will be updated
  •     No online index rebuild 
  •     Clustered key may lead to non-linear maintenance cost 
Schema revalidation still checks whole instance
                                            AD-501-M| XQuery Performance   30
Example Index Contents

insert into Person values (42,
'<book ISBN=”1-55860-438-3”>
    <section>
      <title>Bad Bugs</title>
      Nobody loves bad bugs.
    </section>
    <section>
      <title>Tree Frogs</title>
     All right-thinking people
      <bold>love</bold> tree frogs.
</section>
</book>')

                       AD-501-M| XQuery Performance   31
Primary XML Index
 CREATE PRIMARY XML INDEX PersonIdx ON Person (Pdesc)
PK   XID     TAG ID        Node        Type-ID         VALUE                HID
42   1       1 (book)      Element     1 (bookT)       null                 #book
42   1.1     2 (ISBN)      Attribute   2 (xs:string)   1-55860-438-3        #@ISBN#book
42   1.3     3 (section)   Element     3 (sectionT)    null                 #section#book
42   1.3.1   4 (TITLE)     Element     2 (xs:string)   Bad Bugs             #title#section#book

42   1.3.3   --            Text        --              Nobody    loves      #text()#section#book
                                                       bad bugs.
42   1.5     3 (section)   Element     3 (sectionT)    null                 #section#book
42   1.5.1   4 (title)     Element     2 (xs:string)   Tree frogs           #title#section#book
42   1.5.3   --            Text        --              All right-thinking   #text()#section#book
                                                       people
42   1.5.5   7 (bold)      Element     4 (boldT)       love                 #bold#section#book
42   1.5.7   --            Text        --              tree frogs           #text()#section#book


 Assumes typed data; Columns and Values are simplified, see VLDB 2004 paper for details



                                                               AD-501-M| XQuery Performance   32
Secondary XML Indices

     XML Column              Primary XML Index (1 per XML column)
     in table T(id, x)       Clustered on Primary Key (of table T), XID

                         PK     XID   NID   TID   VALUE      LVALUE     HID     xsinil   …
id      x
                         1
1       Binary XML
                         1

                         1

2       Binary XML       2

                         2
                             1 34                   1
                                                    2
                                                    3                     1
                                                                          2
                         2
3       Binary XML
                         3

                         3

                         3




 Non-clustered Secondary Indices (n per primary Index)

       Value Index           Property Index                 Path Index


                                                        AD-501-M| XQuery Performance     33
XQueries And XML
Indices
demo




                   October 11-14, Seattle, WA
Takeaway: XML Indices

PRIMARY XML Index – Use when lots of XQuery
FOR VALUE – Useful for queries where values are
more selective than paths such as
//*[.=“Seattle”]
FOR PATH – Useful for Path expressions: avoids
joins by mapping paths to hierarchical index
(HID) numbers. Example: /person/address/zip
FOR PROPERTY – Useful when optimizer chooses
other index (for example, on relational column,
or FT Index) in addition so row is already known



                              AD-501-M| XQuery Performance   35
Shredding Approaches
Approach     Complex    Bulkload   Server Business       Programming Scale/
             Shapes                vs      logic                     Performance
                                   Midtier
SQLXML       Yes with   Yes        midtier   staging     annotated     very good/
Bulkload     limits                          tables on   XSD and small very good
with                                         server,     API
annotated                                    XSLT on
schema                                       midtier
ADO.Net      No         No         midtier   midtier,    DataSet API      good/good
DataSet                                      SSIS        or SSIS
CLR Table-   Yes        No         Server    Server or   C#, VB           limited/good
valued                             or        midtier     custom code
function                           midtier
OpenXML      Yes        No         Server    T-SQL       declarative T-   limited/good
                                                         SQL, XPath
                                                         against
                                                         variable
nodes()      Yes        No         Server    T-SQL       declarative      good/careful
                                                         SQL, XQuery
                                                         against var or
                                                         table
To Promote or Not Promote…
Promotion pre-calculates paths
Requires relational query
•    XQuery does not know about promotion

Promotion during loading of the data
•    Using any of the shredding mechanisms
•    1-to-1 or 1-to-many relationships

Promotion using computed columns
•    1-to-1 only
•    Persist computed column: Fast lookup and retrieval
•    Relational index on persisted computed column: Fast lookup
•    Have to be precise

Promotion using Triggers
•    1-to-1 or 1-to-many relationships
•    Trigger overhead

Relational View over XML data
•    Filters on relational view are not pushed down due to different type/value system
                                                       AD-501-M| XQuery Performance   37
Promotion using computed columns
Use a schema-bound UDF that encapsulates XQuery

Persist computed column
 •   Fast lookup and retrieval


Relational index on persisted computed column
 •   Fast lookup


Query will have to use the schema-bound UDF to match

CAVEAT: No parallel plans with a persisted computed
column based on a UDF


                                   AD-501-M| XQuery Performance   38
Use of Full-Text Index for Optimization

 Can provide improvement for XQuery contains() queries

 Query for documents where section title contains “optimization”

 Use Fulltext index to prefilter candidates (includes false positives)




 SELECT * FROM docs
 WHERE contains(xCol, 'optimization')
       1 = xCol.exist('
 /book/section/title/text()[contains(.,"optimization")]
 AND 1 = xCol.exist('
 ')
 /book/section/title/text()[contains(.,"optimization")]
 ')


                                               AD-501-M| XQuery Performance   39
Futures: Selective XML Index
CREATE SELECTIVE XML INDEX pxi_index ON Tbl(xmlcol)
FOR (
-– the first four match XQuery predicates
-- in all XML data type methods

-- simple flavor - default mapping (xs:untypedAtomic),
-- no optimization hints
node42 = ‘/a/b’,
pathatc = ‘/a/b/c/@atc’,

-- advanced flavor - use of optimization hints
path02 =‘/a/b/c’ as XQUERY ‘xs:string’ MAXLENGTH(25),
node13 = ‘/a/b/d’ as XQUERY ‘xs:double SINGLETON,

-–   the next two match value() method
--   require regular SQL Server type semantics
--   they can be mixed with the XQUERY ones
--   specifying a type is mandatory for the SQL type semantics

pathfloat = ‘/a/b/c’ as SQL FLOAT,
pathabd = ‘/a/b/d’ as SQL VARCHAR(200)
)
Session Takeaways

• Understand when and how
  to use XML in SQL Server
• Understand and correct common
  performance problems with XML and
  XQuery
• Shred “relational” XML to relations
• Use XML datatype for semistructured
  and markup scenarios
• Write your XQueries so that XML
  Indices can be used
• Use persisted computed columns to
  promote XQuery results (with caveat)
October 11-14, Seattle, WA
Related Content
Optimization whitepapers
http://msdn2.microsoft.com/en-us/library/ms345118.aspx
http://msdn2.microsoft.com/en-us/library/ms345121.aspx
General XML and Databases whitepapers
http://msdn2.microsoft.com/en-us/xml/bb190603.aspx
Online WebCasts
http://www.microsoft.com/events/series/msdnsqlserver2005.mspx#SQ
LXML
Newsgroups & Forum:
microsoft.public.sqlserver.xml
http://communities.microsoft.com/newsgroups/default.asp?ICP=sqlse
rver2005&sLCID=us
http://forums.microsoft.com/msdn/ShowForum.aspx?ForumID=89

My E-mail: mrys@microsoft.com
My Weblog: http://sqlblog.com/blogs/michael_rys/


                                           AD-501-M| XQuery Performance   43
Complete the Evaluation Form to Win!



 Win a Dell Mini Netbook – every day – just for
 submitting your completed form. Each session
 evaluation form represents a chance to win.

 Pick up your evaluation form:
 • In each presentation room                       Sponsored by Dell
 • Online on the PASS Summit website
 Drop off your completed form:
 • Near the exit of each presentation room
 • At the Registration desk
 • Online on the PASS Summit website


                                         AD-501-M| XQuery Performance   44
Thank you
for attending this session and the
2011 PASS Summit in Seattle




                                     October 11-14, Seattle, WA
Microsoft SQL                Microsoft                Expert Pods              Hands-on Labs
  Server Clinic             Product Pavilion            Meet Microsoft SQL
                                                        Server Engineering
   Work through your         Talk with Microsoft SQL                           Get experienced through
                                                         team members &
technical issues with SQL     Server & BI experts to                            self-paced & instructor-
                                                            SQL MVPs
    Server CSS & get          learn about the next                                led labs on our cloud
 architectural guidance       version of SQL Server                                based lab platform -
      from SQLCAT           and check out the new                              bring your laptop or use
                            Database Consolidation                               HP provided hardware
                                   Appliance


     Room 611                    Expo Hall             6th Floor Lobby           Room 618-620

                                                                AD-501-M| XQuery Performance     46

SQLPASS AD501-M XQuery MRys

  • 1.
    Best Practices and PerformanceTuning of XML Queries in SQL Server AD-501-M Michael Rys Principal Program Manager Microsoft Corp mrys@microsoft.com @SQLServerMike October 11-14, Seattle, WA
  • 2.
    Session Objectives • Understandwhen and how to use XML in SQL Server • Understand and correct common performance problems with XML and XQuery
  • 3.
    Session Agenda XML Scenariosand when to store XML XML Design Optimizations General Optimizations XML Datatype method Optimizations XQuery Optimizations XML Index Optimizations AD-501-M| XQuery Performance 3
  • 4.
  • 5.
    XML Scenarios Data Exchangebetween loosely-coupled systems • XML is ubiquitous, extensible, platform independent transport format • Message Envelope in XML Simple Object Access Protocol (SOAP), RSS, REST • Message Payload/Business Data in XML • Vertical Industry Exchange schemas Document Management • XHTML, DocBook, Home-grown, domain-specific markup (e.g. contracts), OpenOffice, Microsoft Office XML (both default and user- extended) Ad-hoc modeling of semistructured data • Storing and querying heterogeneous complex objects • Semistructured data with sparse, highly-varying structure at the instance level • XML provides self-describing format and extensible schemas →Transport, Store, and Query XML data AD-501-M| XQuery Performance 5
  • 6.
    Decision Tree: ProcessingXML In SQL Server Does the data fit Shred the XML the relational Yes into relations model? No structured Known sparse Shred the structured XML into relations, store Shred known Is the data semi- semistructured aspects sparse data into structured? Yes as XML and/or sparse sparse columns col No Open schema Is the XML Promote Yes Is the data a Search within constrainedthe Query into by frequently queried document? the XML? XML? properties Yes schemas? relationally No Yes Use primary and Constrain XML if Store as Define a full-text secondary XML validation XML is Store as cost varbinary(max) index indexes as ok AD-501-M| needed 6 XQuery Performance
  • 7.
    SQL Server XMLData Type Architecture XML Relational XML XML Parser XML Schemata Schema Validation Collection OpenXML/nodes() PATH XML-DML XML data type Rowsets Index (binary XML) PRIMARY Node Table PROP XML INDEX with FOR XML Index TYPE directive VALUE XQuery Index AD-501-M| XQuery Performance 7
  • 8.
    General Impacts Concurrency Control • Locks on both XML data type and relevant rows in primary and secondary XML Indices • Lock escalation on indices • Snapshot Isolation reduces locks and lock contention Transaction Logs • Bulkinsert into XML Indices may fill transaction log • Delay the creation of the XML indexes and use the SIMPLE recovery model • Preallocate database file instead of dynamically growing • Place log on different disk In-Row/Out-of-Row of XML large object • Moving XML into side table or out-of-row if mixed with relational data reduces scan time Due to clustering, insertion into XML Index may not be linear • Chose integer/bigint identity column as key AD-501-M| XQuery Performance 8
  • 9.
    Choose The RightXML Model • Element-centric versus attribute-centric <Customer><name>Joe</name></Customer> <Customer name="Joe" /> +: Attributes often better performing querying –: Parsing Attributes uniqueness check • Generic element names with type attribute vs Specific element names <Entity type="Customer"> <Prop type="Name">Joe</Prop> </Entity> <Customer><name>Joe</name></Customer> +: Specific names shorter path expressions +: Specific names no filter on type attribute /Entity[@type="Customer"]/Prop[@type="Name"] vs /Customer/name • Wrapper elements <Orders><Order id="1"/></Orders> +: No wrapper elements smaller XML, shorter path expressions AD-501-M| XQuery Performance 9
  • 10.
    Use an XMLSchema Collection? Using no XML Schema (untyped XML) • Can still use XQuery and XML Index!!! • Atomic values are always weakly typed strings compare as strings to avoid runtime conversions and loss of index usage • No schema validation overhead • No schema evolution revalidation costs XML Schema provides structural information • Atomic typed elements are now using only one instead of two rows in node table/XML index (closer to attributes) • Static typing can detect cardinality and feasibility of expression XML Schema provides semantic information • Elements/attributes have correct atomic type for comparison and order semantics • No runtime casts required and better use of index for value lookup AD-501-M| XQuery Performance 10
  • 11.
    XQuery Methods query() createsnew, untyped XML data type instance exist() returns 1 if the XQuery expression returns at least one item, 0 otherwise value() extracts an XQuery value into the SQL value and type space • Expression has to statically be a singleton • String value of atomized XQuery item is cast to SQL type • SQL type has to be SQL scalar type (no XML or CLR UDT) AD-501-M| XQuery Performance 11
  • 12.
    XQuery: nodes() Returns arow per selected node as a special XML data type instance • Preserves the original structure and types • Can only be used with the XQuery methods (but not modify()), count(*), and IS (NOT) NULL Appears as Table-valued Function (TVF) in queryplan if no index present AD-501-M| XQuery Performance 12
  • 13.
    sql:column()/sql:variable() Map SQL valueand type into XQuery values and types in context of XQuery or XML-DML • sql:variable(): accesses a SQL variable/parameter declare @value int set @value=42 select * from T where T.x.exist('/a/b[@id=sql:variable("@value")]')=1 • sql:column(): accesses another column value tables: T(key int, x xml), S(key int, val int) select * from T join S on T.key=S.key where T.x.exist('/a/b[@id=sql:column("S.val")]')=1 • Restrictions in SQL Server: No XML, CLR UDT, datetime, or deprecated text/ntext/image AD-501-M| XQuery Performance 13
  • 14.
    Improving Slow XQueries,Bad FOR XML demo October 11-14, Seattle, WA
  • 15.
    Optimal Use OfMethods How to Cast from XML to SQL BAD: CAST( CAST(xmldoc.query('/a/b/text()') as nvarchar(500)) as int) GOOD: xmldoc.value('(/a/b/text())[1]', 'int') BAD: node.query('.').value('@attr', 'nvarchar(50)') GOOD: node.value('@attr', 'nvarchar(50)') AD-501-M| XQuery Performance 15
  • 16.
    Optimal Use OfMethods Grouping value() method Group value() methods on same XML instance next to each other if the path expressions in the value() methods are • Simple path expressions that only use child and attribute axis and do not contain wildcards, predicates, node tests, ordinals • The path expressions infer statically a singleton The singleton can be statically inferred from • the DOCUMENT and XML Schema Collection • Relative paths on the context node provided by the nodes() method Requires XML index to be present AD-501-M| XQuery Performance 16
  • 17.
    Optimal Use ofMethods Using the right method to join and compare Use exist() method, sql:column()/sql:variable() and an XQuery comparison for checking for a value or joining if secondary XML indices present BAD:* select doc from doc_tab join authors on doc.value('(/doc/mainauthor/lname/text())[1]', 'nvarchar(50)') = lastname GOOD: select doc from doc_tab join authors on 1 = doc.exist('/doc/mainauthor/lname/text()[. = sql:column("lastname")]') * If applied on XML variable/no index present, value() method is most of the time more efficient AD-501-M| XQuery Performance 17
  • 18.
    Optimal Use ofMethods Avoiding bad costing with nodes() nodes() without XML index is a Table-valued function (details later) Bad cardinality estimates can lead to bad plans • BAD: select c.value('@id', 'int') as CustID , c.value('@name', 'nvarchar(50)') as CName from Customer, @x.nodes('/doc/customer') as N(c) where Customer.ID = c.value('@id', 'int') • BETTER (if only one wrapper doc element): select c.value('@id', 'int') as CustID , c.value('@name', 'nvarchar(50)') as CName from Customer, @x.nodes('/doc[1]') as D(d) cross apply d.nodes('customer') as N(c) where Customer.ID = c.value('@id', 'int') Use temp table (insert into #temp select … from nodes()) or Table- valued parameter instead of XML to get better estimates AD-501-M| XQuery Performance 18
  • 19.
    Optimal Use OfMethods Avoiding multiple method evaluations Use subqueries • BAD: SELECT CASE isnumeric (doc.value( '(/doc/customer/order/price)[1]', 'nvarchar(32)')) WHEN 1 THEN doc.value( '(/doc/customer/order/price)[1]', 'decimal(5,2)') ELSE 0 END FROM T • GOOD: SELECT CASE isnumeric (Price) WHEN 1 THEN CAST(Price as decimal(5,2)) ELSE 0 END FROM (SELECT doc.value( '(/doc/customer/order/price)[1]', 'nvarchar(32)')) as Price FROM T) X Use subqueries also with NULLIF() AD-501-M| XQuery Performance 19
  • 20.
    Combined SQL AndXQuery/DML Processing SELECT x.query('…'), y FROM T WHERE … Static Metadata SQL Parser XQuery Parser Phase XML Static Typing Static Typing Schema Collection Algebrization Algebrization Static Optimization of combined Logical and Physical Operation Tree Dynamic Runtime Optimization XML and Phase and Execution of rel. physical Op Tree Indices AD-501-M| XQuery Performance 20
  • 21.
    New XQuery AlgebraOperators XML Reader TVF Table-Valued Function XML Reader UDF with XPath Filter Used if no Primary XML Index is present Creates node table rowset in query flow Multiple XPath filters can be pushed in to reduce node table to subtree Base cardinality estimate is always 10’000 rows!  Some adjustment based on pushed path filters XMLReader node table format example (simplified) ID TAG ID Node Type-ID VALUE HID 1.3.1 4 (TITLE) Element 2 (xs:string) Bad Bugs #title#section#book AD-501-M| XQuery Performance 21
  • 22.
    New XQuery AlgebraOperators UDX • Serializer UDX serializes the query result as XML • XQuery String UDX evaluates the XQuery string() function • XQuery Data UDX evaluates the XQuery data() function • Check UDX validates XML being inserted • UDX name visible in SSMS properties window AD-501-M| XQuery Performance 22
  • 23.
    Optimal Use OfXQuery Atomization of nodes Value comparisons, XQuery casts and value() method casts require atomization of item • attribute: /person[@age = 42] /person[data(@age) = 42] • Atomic typed element: /person[age = 42] /person[data(age) = 42] • Untyped, mixed content typed element (adds UDX): /person[age = 42] /person[data(age) = 42] /person[string(age) = 42] • If only one text node for untyped element (better): /person[age/text() = 42] /person[data(age/text()) = 42] • value() method on untyped elements: value('/person/age', 'int') value('/person/age/text()', 'int') String() aggregates all text nodes, prohibits index use AD-501-M| XQuery Performance 23
  • 24.
    Optimal Use OfXQuery Casting Values Value comparisons require casts and type promotion • Untyped attribute: /person[@age = 42] /person[xs:decimal(@age) = 42] • Untyped text node(): /person[age/text() = 42] /person[xs:decimal(age/text()) = 42] • Typed element (typed as xs:int): /person[salary = 3e4] /person[xs:double(salary) = 3e4] Casting is expensive and prohibits index lookup Tips to avoid casting • Use appropriate types for comparison (string for untyped) • Use schema to declare type AD-501-M| XQuery Performance 24
  • 25.
    Optimal Use OfXQuery Maximize XPath expressions Single paths are more efficient than twig paths Avoid predicates in the middle of path expressions book[@ISBN = "1-8610-0157-6"]/author[first- name = "Davis"] /book[@ISBN = "1-8610-0157-6"] "∩" /book/author[first-name = "Davis"] Move ordinals to the end of path expressions • Make sure you get the same semantics! • /a[1]/b[1] ≠ (/a/b)[1] ≠ /a/b[1] • (/book/@isbn)[1] is better than/book[1]/@isbn AD-501-M| XQuery Performance 25
  • 26.
    Optimal Use OfXQuery Maximize XPath expressions in exist() Use context item in predicate to lengthen path in exist() • Existential quantification makes returned node irrelevant • BAD: SELECT * FROM docs WHERE 1 = xCol.exist ('/book/subject[text() = "security"]') • GOOD: SELECT * FROM docs WHERE 1 = xCol.exist ('/book/subject/text()[. = "security"]') • BAD: SELECT * FROM docs WHERE 1 = xCol.exist ('/book[@price > 9.99 and @price < 49.99]') • GOOD: SELECT * FROM docs WHERE 1 = xCol.exist ('/book/@price[. > 9.99 and . < 49.99]') This does not work with or-predicate AD-501-M| XQuery Performance 26
  • 27.
    Optimal Use OfXQuery Inefficient operations: Parent axis Most frequent offender: parent axis with nodes() • BAD: select o.value('../@id', 'int') as CustID , o.value('@id', 'int') as OrdID from T cross apply x.nodes('/doc/customer/orders') as N(o) • GOOD: select c.value('@id', 'int') as CustID , o.value('@id', 'int') as OrdID from T cross apply x.nodes('/doc/customer') as N1(c) cross apply c.nodes('orders') as N2(o) AD-501-M| XQuery Performance 27
  • 28.
    Optimal Use OfXQuery Inefficient operations Avoid descendant axes and // in the middle of path expressions if the data structure is known. • // still can use the HID lookup, but is less efficient XQuery construction performs worse than FOR XML • BAD: SELECT notes.query(' <Customer cid="{sql:column(''cid'')}">{ <name>{sql:column("name")}</name>, / }</Customer>') FROM Customers WHERE cid=1 • GOOD: SELECT cid as "@cid", name, notes as "*" FROM Customers WHERE cid=1 FOR XML PATH('Customer'), TYPE AD-501-M| XQuery Performance 28
  • 29.
    Optimal Use OfFOR XML Use TYPE directive when assigning result to XML • BAD: declare @x xml; set @x = (select * from Customers for xml raw); • GOOD: declare @x xml; set @x = (select * from Customers for xml raw, type); Use FOR XML PATH for complex grouping and additional hierarchy levels over FOR XML EXPLICIT Use FOR XML EXPLICIT for complex nesting if FOR XML PATH performance is not appropriate AD-501-M| XQuery Performance 29
  • 30.
    XML Indices Create XMLindex on XML column CREATE PRIMARY XML INDEX idx_1 ON docs (xDoc) Create secondary indexes on tags, values, paths Creation: • Single-threaded only for primary XML index • Multi-threaded for secondary XML indexes Uses: • Primary Index will always be used if defined (not a cost based decision) • Results can be served directly from index • SQL’s cost based optimizer will consider secondary indexes Maintenance: • Primary and Secondary Indices will be efficiently maintained during updates • Only subtree that changes will be updated • No online index rebuild  • Clustered key may lead to non-linear maintenance cost  Schema revalidation still checks whole instance AD-501-M| XQuery Performance 30
  • 31.
    Example Index Contents insertinto Person values (42, '<book ISBN=”1-55860-438-3”> <section> <title>Bad Bugs</title> Nobody loves bad bugs. </section> <section> <title>Tree Frogs</title> All right-thinking people <bold>love</bold> tree frogs. </section> </book>') AD-501-M| XQuery Performance 31
  • 32.
    Primary XML Index CREATE PRIMARY XML INDEX PersonIdx ON Person (Pdesc) PK XID TAG ID Node Type-ID VALUE HID 42 1 1 (book) Element 1 (bookT) null #book 42 1.1 2 (ISBN) Attribute 2 (xs:string) 1-55860-438-3 #@ISBN#book 42 1.3 3 (section) Element 3 (sectionT) null #section#book 42 1.3.1 4 (TITLE) Element 2 (xs:string) Bad Bugs #title#section#book 42 1.3.3 -- Text -- Nobody loves #text()#section#book bad bugs. 42 1.5 3 (section) Element 3 (sectionT) null #section#book 42 1.5.1 4 (title) Element 2 (xs:string) Tree frogs #title#section#book 42 1.5.3 -- Text -- All right-thinking #text()#section#book people 42 1.5.5 7 (bold) Element 4 (boldT) love #bold#section#book 42 1.5.7 -- Text -- tree frogs #text()#section#book Assumes typed data; Columns and Values are simplified, see VLDB 2004 paper for details AD-501-M| XQuery Performance 32
  • 33.
    Secondary XML Indices XML Column Primary XML Index (1 per XML column) in table T(id, x) Clustered on Primary Key (of table T), XID PK XID NID TID VALUE LVALUE HID xsinil … id x 1 1 Binary XML 1 1 2 Binary XML 2 2 1 34 1 2 3 1 2 2 3 Binary XML 3 3 3 Non-clustered Secondary Indices (n per primary Index) Value Index Property Index Path Index AD-501-M| XQuery Performance 33
  • 34.
    XQueries And XML Indices demo October 11-14, Seattle, WA
  • 35.
    Takeaway: XML Indices PRIMARYXML Index – Use when lots of XQuery FOR VALUE – Useful for queries where values are more selective than paths such as //*[.=“Seattle”] FOR PATH – Useful for Path expressions: avoids joins by mapping paths to hierarchical index (HID) numbers. Example: /person/address/zip FOR PROPERTY – Useful when optimizer chooses other index (for example, on relational column, or FT Index) in addition so row is already known AD-501-M| XQuery Performance 35
  • 36.
    Shredding Approaches Approach Complex Bulkload Server Business Programming Scale/ Shapes vs logic Performance Midtier SQLXML Yes with Yes midtier staging annotated very good/ Bulkload limits tables on XSD and small very good with server, API annotated XSLT on schema midtier ADO.Net No No midtier midtier, DataSet API good/good DataSet SSIS or SSIS CLR Table- Yes No Server Server or C#, VB limited/good valued or midtier custom code function midtier OpenXML Yes No Server T-SQL declarative T- limited/good SQL, XPath against variable nodes() Yes No Server T-SQL declarative good/careful SQL, XQuery against var or table
  • 37.
    To Promote orNot Promote… Promotion pre-calculates paths Requires relational query • XQuery does not know about promotion Promotion during loading of the data • Using any of the shredding mechanisms • 1-to-1 or 1-to-many relationships Promotion using computed columns • 1-to-1 only • Persist computed column: Fast lookup and retrieval • Relational index on persisted computed column: Fast lookup • Have to be precise Promotion using Triggers • 1-to-1 or 1-to-many relationships • Trigger overhead Relational View over XML data • Filters on relational view are not pushed down due to different type/value system AD-501-M| XQuery Performance 37
  • 38.
    Promotion using computedcolumns Use a schema-bound UDF that encapsulates XQuery Persist computed column • Fast lookup and retrieval Relational index on persisted computed column • Fast lookup Query will have to use the schema-bound UDF to match CAVEAT: No parallel plans with a persisted computed column based on a UDF AD-501-M| XQuery Performance 38
  • 39.
    Use of Full-TextIndex for Optimization Can provide improvement for XQuery contains() queries Query for documents where section title contains “optimization” Use Fulltext index to prefilter candidates (includes false positives) SELECT * FROM docs WHERE contains(xCol, 'optimization') 1 = xCol.exist(' /book/section/title/text()[contains(.,"optimization")] AND 1 = xCol.exist(' ') /book/section/title/text()[contains(.,"optimization")] ') AD-501-M| XQuery Performance 39
  • 40.
    Futures: Selective XMLIndex CREATE SELECTIVE XML INDEX pxi_index ON Tbl(xmlcol) FOR ( -– the first four match XQuery predicates -- in all XML data type methods -- simple flavor - default mapping (xs:untypedAtomic), -- no optimization hints node42 = ‘/a/b’, pathatc = ‘/a/b/c/@atc’, -- advanced flavor - use of optimization hints path02 =‘/a/b/c’ as XQUERY ‘xs:string’ MAXLENGTH(25), node13 = ‘/a/b/d’ as XQUERY ‘xs:double SINGLETON, -– the next two match value() method -- require regular SQL Server type semantics -- they can be mixed with the XQUERY ones -- specifying a type is mandatory for the SQL type semantics pathfloat = ‘/a/b/c’ as SQL FLOAT, pathabd = ‘/a/b/d’ as SQL VARCHAR(200) )
  • 41.
    Session Takeaways • Understandwhen and how to use XML in SQL Server • Understand and correct common performance problems with XML and XQuery • Shred “relational” XML to relations • Use XML datatype for semistructured and markup scenarios • Write your XQueries so that XML Indices can be used • Use persisted computed columns to promote XQuery results (with caveat)
  • 42.
  • 43.
    Related Content Optimization whitepapers http://msdn2.microsoft.com/en-us/library/ms345118.aspx http://msdn2.microsoft.com/en-us/library/ms345121.aspx GeneralXML and Databases whitepapers http://msdn2.microsoft.com/en-us/xml/bb190603.aspx Online WebCasts http://www.microsoft.com/events/series/msdnsqlserver2005.mspx#SQ LXML Newsgroups & Forum: microsoft.public.sqlserver.xml http://communities.microsoft.com/newsgroups/default.asp?ICP=sqlse rver2005&sLCID=us http://forums.microsoft.com/msdn/ShowForum.aspx?ForumID=89 My E-mail: mrys@microsoft.com My Weblog: http://sqlblog.com/blogs/michael_rys/ AD-501-M| XQuery Performance 43
  • 44.
    Complete the EvaluationForm to Win! Win a Dell Mini Netbook – every day – just for submitting your completed form. Each session evaluation form represents a chance to win. Pick up your evaluation form: • In each presentation room Sponsored by Dell • Online on the PASS Summit website Drop off your completed form: • Near the exit of each presentation room • At the Registration desk • Online on the PASS Summit website AD-501-M| XQuery Performance 44
  • 45.
    Thank you for attendingthis session and the 2011 PASS Summit in Seattle October 11-14, Seattle, WA
  • 46.
    Microsoft SQL Microsoft Expert Pods Hands-on Labs Server Clinic Product Pavilion Meet Microsoft SQL Server Engineering Work through your Talk with Microsoft SQL Get experienced through team members & technical issues with SQL Server & BI experts to self-paced & instructor- SQL MVPs Server CSS & get learn about the next led labs on our cloud architectural guidance version of SQL Server based lab platform - from SQLCAT and check out the new bring your laptop or use Database Consolidation HP provided hardware Appliance Room 611 Expo Hall 6th Floor Lobby Room 618-620 AD-501-M| XQuery Performance 46