Best practices for XML performance in Db2

By observing certain best practices you can help to improve the performance of XML data that is stored in Db2 for z/OS®.

Choose the granularity of XML documents carefully

When you design your XML application, and your XML document structure in particular, you might have a choice to define which business data is kept together in a single XML document.

Begin general-use programming interface information.

For example, each XML document in the sample department data contains information for one department.

CREATE TABLE DEPT(UNITID CHAR(8), DEPTDOC XML);

Figure 1. Sample data for the DEPT table

unitID	deptdoc
WWPR	`<dept deptID="PR27"> <employee id="901"> <name>Jim Qu</name> <phone>408 555 1212</phone> </employee> <employee id="902"> <name>Peter Pan</name> <office>216</office> </employee> </dept>`
WWPR	`<dept deptID="V15"> <employee id="673"> <name>Matt Foreman</name> <phone>416 891 7301</phone> <office>216</office> </employee> <description>This dept supports sales world wide</description> </dept>`
S-USE	...
...	...

End general-use programming interface information.

This intermediate granularity is a reasonable choice if a department is the predominant granularity at which your application accesses and processes the data. Alternatively, you might decide to combine multiple departments into a single XML document, such as those that belong to one unit. This coarse granularity, however, is sub-optimal if your application typically processes only one department at a time.

You might also choose one XML document per employee with an additional "dept" attribute for each employee to indicate which department he or she belongs to. This fine granularity would be a very good choice if employees use business objects of interest, which are often accessed and processed independently from the other employees in the same department. However, if the application typically processes all employees in one department together, one XML document per department might be the better choice.

Use attributes and elements appropriately in XML

A common question related to XML document design, is when to use attributes instead of elements, and how that choice affects performance.

This question is much more relevant to data modeling than to performance. However, as a general rule, XML elements are more flexible than attributes because they can be repeated and nested.

For example, the department documents shown in the preceding example, use an element "phone" which allows multiple occurrences of "phone" for an employee who has multiple numbers. This design is also extensible in case we later need to break phone numbers into fragments, such as child elements for country code, area code, extension, and so on.

By contrast, if "phone" is an attribute of the employee element instead, it can exist only once per employee, and you could not add child elements. Such limitations might hinder future schema evolution.

Although you can probably model all of your data without using attributes, they can be a very intuitive choice for data items that are known not to repeat for a single element, nor have any sub-fields. Attributes can reduce the size of XML data slightly because they have only a single name-value pair, as opposed to elements, which have both a start tag and an end tag.

In Db2, you can use attributes in queries, predicates, and index definitions just as easily as elements. Because attributes are less extensible than elements, Db2 can apply certain storage and access optimizations. However, these advantages should be considered an extra performance bonus rather than an incentive to convert elements to attributes for the sake of performance, especially when data modeling considerations call for elements.

Be aware of XML schema validation overhead

XML schema validation is an optional activity during XML parsing. Performance studies have shown that XML parsing in general is significantly more CPU-intensive if schema validation is enabled.

This overhead can vary drastically depending on the structure and size of your XML documents, and particularly on the size and complexity of the XML Schema used. For example, you might find 50% higher CPU consumption because of schema validation with moderately complex schemas. Unless your XML inserts are heavily I/O bound, the increased CPU consumption typically translates to reduced insert throughput.

An XML schema defines the structure, elements and attributes, data types, and value ranges, that are allowed in a set of XML documents. Db2 allows you to validate XML documents against XML schemas. If you choose to validate documents, you typically do so at insert time. Validation ensures that data inserted into the database is compliant with the schema definition, meaning that you improve the integrity of data entering your tables.

Consider the impact to performance, when you determine whether your application needs the stricter type checking for XML queries and XML schema compliance. For example, if you are using an application server which receives, validates, and processes XML documents before they are stored in the database, the documents probably do not need to be validated again in Db2. At that point you already know they are valid. Likewise, if the database receives XML documents from a trusted application, maybe even one that you control, and you know that the XML data is always valid, avoid schema validation for the benefit of higher insert performance. If, however, your Db2 database receives XML data from untrusted sources and you need to ensure schema compliance at the Db2 level, then you need to spend some extra CPU cycles on that.

For more information, see:

Specify full paths in XPath expressions when possible

When you know where in the structure of an XML document the element is located, provide that information in the form of a fully specified path to avoid unneeded overhead.

Consider the table that is created by the following SQL statement.

CREATE TABLE customer(info XML);

The following figure shows sample data in the info column.

Figure 2. Sample data in a customerinfo XML document

<customerinfo Cid="1004">
    <name>Matt Foreman</name>
    <addr country="Canada">
          <street>1596 Baseline</street>
          <city>Toronto</city>
          <state/>Ontario
          <pcode>M3Z-5H9</pcode>
    </addr>
    <phone type="work">905-555-4789</phone>
    <phone type="home">416-555-3376</phone>
</customerinfo>

If you want to retrieve customer phone numbers or the cities where they live, you can choose from several possible path expressions to get that data.

Both /customerinfo/phone and //phone would get you the phone numbers. Likewise, /customerinfo/addr/city and /customerinfo/*/city both return the city. For best performance, the fully specified path is preferred over using either * or // because the fully specified path enables Db2 to navigate directly to the elements, skipping over non-relevant parts of the document. If you ask for //phone instead of /customerinfo/phone, you ask for phone elements anywhere in the document. This requires Db2 to navigate down into the "addr" subtree of the document to look for phone elements at any level of the document.

Using * and // also can lead to unexpected query results (for example, if some of the "customerinfo" documents also contain "assistant" information, as shown in the following figure). The path //phone would return the customer phones and the assistant phone numbers, without distinguishing them. From the query result, you might mistakenly process the assistant's phone as a customer phone number.

Figure 3. Sample data with phone and name elements at multiple levels

<customerinfo Cid="1004">
    <name>Matt Foreman</name>
    <addr country="Canada">
          <street>1596 Baseline</street>
          <city>Toronto</city>
          <state/>Ontario
          <pcode>M3Z-5H9</pcode>
    </addr>
    <phone type="work">905-555-4789</phone>
    <phone type="home">416-555-3376</phone>
    <assistant>
          <name>Peter Smith</name>
          <phone type="home">416-555-3426</phone>
     </assistant>
</customerinfo>

Define lean indexes for XML data to avoid extra overhead

Assume that queries often search for "customerinfo" XML documents by customer name. An index on the customer name element, as shown in the following statements, can greatly improve the performance of such queries.

CREATE TABLE CUSTOMER (info XML);

CREATE INDEX custname1 ON customer(info) 
GENERATE KEY USING XMLPATTERN '/customerinfo/name' as sql varchar(20);

CREATE INDEX custname2 ON customer(info) 
GENERATE KEY USING XMLPATTERN '//name' as sql varchar(20);

SELECT * FROM customer
WHERE XMLEXISTS('$i/customerinfo[name = "Matt Foreman"]' passing info as $i);

Both of the indexes defined above are eligible to evaluate the XMLEXISTS predicate on the customer name. However, the custname2 index might be substantially larger than the custname1 index because it contains index entries not only for customer names but also for assistant names. This is because the XML pattern //name matches name elements anywhere in the document. However, if we never search by assistant name then we don't need them indexed.

For read operations, the custname1 index is smaller and therefore potentially better performing. For insert, update and delete operations, the custname1 index incurs index maintenance overhead only for customer names, while the custname2 index requires index maintenance for customer and assistant names. You certainly don't want to pay that extra price if you require maximum insert, update, and delete performance and you don't need indexed access based on assistant names.

For more information, see:

Use XMLEXISTS for predicates that filter at the document level

Consider the following table and sample data:

CREATE TABLE customer(info XML);

Figure 4. Sample data in the customer table

<customerinfo>
    <name>Matt Foreman</name>
    <phone>905-555-4789</phone>
</customerinfo>

<customerinfo>
    <name>Peter Jones</name>
    <phone>905-123-9065</phone>
</customerinfo>

<customerinfo>
    <name>Mary Clark</name>
    <phone>905-890-0763</phone>
</customerinfo>

Assume, for example, that you want to return the names of customers which have the phone number "905-555-4789". You might be tempted to write the following query.

SELECT XMLQUERY('$i/customerinfo[phone = "905-555-4789"]/name' passing info as "i") 
FROM customer;

However, this query is not what you want for several reasons:

It returns the following result set which has as many rows as there are rows in the table. This is because the SQL statement has no where clause and therefore cannot eliminate any rows. The result is shown in the following figure.
Figure 5. Result for the preceding example query
```
<name>Matt Foreman</name>

3 record(s) selected
```
For each row in the table which doesn't match the predicate, a row containing an empty XML sequence is returned. This is because the XQuery expression in the XMLQUERY function is applied to one row, or document, at a time and never removes a row from the result set, only modifies its value. The value produced by that XQuery is either the customer's name element if the predicate is true, or the empty sequence otherwise. These empty rows are semantically correct, according to the SQL/XML standard, and must be returned if the query is written as shown.
The performance of the query is poor. First, an index which might exist on /customerinfo/phone cannot be used because this query is not allowed to eliminate any rows. Secondly, returning many empty rows makes this query needlessly slow.

To resolve the performance issues and get the output that you want, use the XMLQUERY function in the select clause only to extract the customer names, and move the search condition, which should eliminate rows, into an XMLEXISTS predicate in the WHERE clause. Doing so will allow index usage, row filtering, and avoid the overhead of empty results rows. You could write the query as shown in the following figure.

SELECT XMLQUERY('$i/customerinfo/name' passing info as "i") 
FROM customer
WHERE XMLEXISTS('$i/customerinfo[phone = "905-555-4789"]' passing info as "i")

For more information, see:

Use square brackets to avoid Boolean predicates in XMLEXISTS

A common error is to write the previous query without the square brackets in the XMLEXISTS function, as shown in the following query.

SELECT XMLQUERY('$i/customerinfo/name' passing info as "i") 
FROM customer
WHERE XMLEXISTS('$i/customerinfo/phone = "905-555-4789"' passing info as "i")

Writing the query this way produces the following results shown in the following figure.

Figure 6. Sample results for the preceding example query

<name>Matt Foreman</name>
<name>Peter Jones</name>
<name>Mary Clark</name>

3 record(s) selected

The expression in the XMLEXISTS predicate is written such that XMLEXISTS always evaluates to true. Hence, no rows are eliminated. For a given row, the XMLEXISTS predicate evaluates to false only if the XQuery expression inside returns the empty sequence. However, without the square brackets the XQuery expression is a Boolean expression which always returns a Boolean value and never the empty sequence. Note that XMLEXISTS truly checks for the existence of a value and evaluates to true if a value exists, even if that value happens to be the Boolean value "false". This behavior is correct according to the SQL/XML standard, although it is probably not what you intended to express.

The impact is again that an index on phone cannot be used because no rows will be eliminated, and you receive a lot more rows than you actually want. Also, beware not to make this same mistake when using two or more predicates, as shown in the following query.

SELECT XMLQUERY('$i/customerinfo/name' passing info as "i") 
FROM customer
WHERE XMLEXISTS('$i/customerinfo[phone = "905-555-4789"] and 
		 $i/customerinfo[name = "Matt Foreman"]' 
      passing info as "i")

The XQuery expression is still a Boolean expression because it has the form "exp1 and exp2." You would write the query as shown in the following query to filter rows and allow for index usage.

SELECT XMLQUERY('$i/customerinfo/name' passing info as "i") 
from customer
WHERE XMLEXISTS('$i/customerinfo[phone = "905-555-4789" and name = "Matt Foreman"]' 
      passing info as "i")

For more information, see Logical expressions.

Use RUNSTATS to collect statistics for XML data and indexes

The RUNSTATS utility has been extended to collect statistics on XML tables and indexes, and Db2 optimizer uses these statistics to generate efficient execution plan for XML queries. Consequently, continue to use RUNSTATS on XML tables and indexes as you would for relational data. You need to specify XML table space names explicitly or use LISTDEF to include ALL or XML objects to obtain the XML table statistics.

For more information, see:

Use SQL/XML publishing views to expose relational data as XML

You can include relational columns in a SQL/XML publishing view, and when querying the view, express any predicates on those columns rather than on the constructed XML.

SQL/XML publishing functions allow you to convert relational data into XML format. Hiding the SQL/XML publishing functions in a view definition can be beneficial. Applications or other queries can simply select the constructed XML documents from the view, instead of dealing with the publishing functions themselves. The following statements creates a view that contains hidden SQL/XML publishing functions.

CREATE TABLE unit( unitID char(8), name char(20), manager varchar(20));

CREATE VIEW UnitView(unitID, name, unitdoc) as
   SELECT unitID, name, 
          XMLELEMENT(NAME "Unit",
             XMLELEMENT(NAME "ID", u,unitID),
             XMLELEMENT(NAME "UnitName", u.name),
             XMLELEMENT(NAME "Mgr", u.manager)
                  )
   FROM unit u;

Note that the view definition includes relational columns. This does not create any physical redundancy because it is only a view, not a materialized view. Exposing the relational columns helps to query this view efficiently.

The following query uses a relational predicate to ensure that only the XML document for "WWPR" is constructed, resulting in a shorter run time, especially on a large data set.

SELECT unitdoc
FROM UnitView
WHERE unitID = "WWPR";

Use XMLTABLE views to expose XML data in relational format

You might also want to use a view to expose XML data in relational format. Similar caution needs to be applied as before, but in the reverse way. In the following example the SQL/XML function XMLTABLE returns values from XML documents in tabular format.

CREATE TABLE customer(info XML);

CREATE VIEW myview(CustomerID, Name, Zip, Info) AS 
SELECT T.*, info 
FROM customer, XMLTABLE ('$c/customerinfo' passing info as "c" 
   COLUMNS 
   "CID"     INTEGER      PATH './@Cid',
   "Name"    VARCHAR(30)  PATH './name',
   "Zip"     CHAR(12)     PATH './addr/pcode' ) as T;

The view definition includes XML column info to help query the view efficiently. Assume that you want to retrieve a tabular list of customer IDs and names for a given ZIP code. Both of the following queries can do that, but the second one tends to perform better than the first.

In the first query, the filtering predicate is expressed on the CHAR column "Zip" generated by the XMLTABLE function. However, not all the relational predicates can be applied to the underlying XML column or indexes. Consequently, the query requires the view to generate rows for all customers and then picks out the one for zip code "95141".

SELECT CustomerID, Name 
FROM myview
WHERE Zip = '95141';

The second query uses an XML predicate to ensure that only the rows for "95141" get generated, resulting in a shorter run time, especially on a large data set.

SELECT CustomerID, Name
FROM myView
WHERE xmlexists('$i/customerinfo[addr/pcode = "95141"]' passing info as "i");

Use SQL and XML statements with parameter markers for short queries and OLTP applications

The SQL/XML functions XMLQUERY, XMLTABLE and XMLEXISTS support external parameters.

Very short database queries often execute so fast that the time to compile and optimize them is a substantial portion of their total response time. Consequently, you might want to compile, or "prepare," them just once and only pass predicate literal values for each execution. This technique is recommended for applications with short and repetitive queries. The following query shows how you can use parameter markers to achieve the result of the preceding example.

SELECT info 
FROM customer
WHERE xmlexists('$i/customerinfo[phone = $p]' 
                passing info as "i", cast(? as varchar(12)) as "p")

Avoid code page conversion during XML insert and retrieval

XML is different from other types of data in Db2 because it can be internally and externally encoded. Internally encoded means that the encoding of your XML data can be derived from the data itself. Externally encoded means that the encoding is derived from external information.

The data type of the application variables that you use to exchange XML data with Db2 determines how the encoding is derived. If your application uses character type variables for XML, then it is externally encoded. If you use binary application data types, then the XML data is considered internally encoded.

Internally encoded means that the encoding is determined by either a Unicode Byte-Order mark (BOM) or an encoding declaration in the XML document itself, such as: <?xml version="1.0" encoding="UTF-8" ?>

From a performance point of view, the goal is to avoid code page conversions as much as possible because they consume extra CPU cycles. Internally encoded XML data is preferred over externally encoded data because it can prevent unnecessary code page conversion.

This means that in your application you should prefer binary data types over character types. For example, in ODBC when you use SQLBindParameter() to bind parameter markers to input data buffers, you should use SQL_C_BINARY data buffers rather than SQL_C_CHAR, SQL_C_DBCHAR, or SQL_C_WCHAR. In host applications, use XML AS BLOB as the host variable type.

When inserting XML data from Java™ applications, reading in the XML data as a binary stream (setBinaryStream) is better than as a string (setString). Similarly, if your Java application receives XML from Db2 and writes it to a file, code page conversion may occur if the XML is written as non-binary data.

When you retrieve XML data from Db2 into your application, it is serialized. Serialization is the inverse operation of XML parsing. It is the process that Db2 uses to convert internal XML format, which is a parsed, tree-like representation, into the textual XML format that your application can understand. In most cases it is best to let Db2 perform implicit serialization. This means your SQL/XML statements simply select XML-type values as shown in the following example, and that Db2 performs the serialization into your application variables as efficiently as possible.

CREATE TABLE customer(info XML);

SELECT info FROM customer WHERE...;

SELECT XMLQUERY('$i/customerinfo/name' passing info as "i")
FROM customer
WHERE...;

If your application deals with very large XML documents, it might benefit from using LOB locators for data retrieval. This requires explicit serialization to a LOB type, preferably BLOB, because explicit serialization into a character type such as CLOB can introduce encoding issues and unnecessary code page conversion. Explicit serialization uses the XMLSERIALIZE function as shown in the following query.

SELECT XMLSERIALIZE(info as BLOB(1M)) FROM customer WHERE...;

Use the XMLMODIFY statement to update part of an XML document

When you need to modify only part of an XML document, you can use the XMLMODIFY function to make the changes to the XML data more efficiently than by replacing the entire XML document, especially in cases of large XML documents. For small XML documents, no performance advantage is provided by the XMLMODIFY statement for documents that fit within a single record.

When an application does not use the XMLMODIFY statement to update an XML column, the XML document from the XML column is entirely deleted and replaced by a new XML document. When the XMLMODIFY is used to update an XML column, only the rows in the XML table space that are modified by the XMLMODIFY function need to be deleted or replaced.

For more information, see:

Use the Extensible Dynamic Binary XML Db2 Client/Server Binary XML Format for input data for parsed XML documents

The parsing of XML data is one of the most significant factors that affect performance during INSERT, LOAD, and UPDATE operations for XML data. If you use Extensible Dynamic Binary XML Db2 Client/Server Binary XML Format when you insert, update, or load data, CPU overhead is reduced. Db2 DRDA zIIP redirect is not affected by binary XML, but z/OS XML System Services zIIP and zAAP are affected by binary XML because parsing is not needed.

Sending Extensible Dynamic Binary XML Db2 Client/Server Binary XML Format data from a Java application on a client reduces the CPU time needed on the Db2 server. However, parsing the XML data into binary form is handled by the IBM® Data Server Driver for JDBC and SQLJ running on the client. Class 1 elapsed time on the client might increase when compared to sending textual XML. Use binary XML to reduce CPU time on the Db2 server if the increase of elapsed time does not impact your environment.

Consider performance when you decide when to use XPath or XQuery

In general, if you perform similar operations using XQuery and XPath, your performance should be similar. However, in some situations, XPath might provide better performance.

For more information, see:

Set the XML_RANDOMIZE_DOCID subsystem parameter for best performance.

You can reduce wait times for inserting XML data by randomizing DOCID values for tables that contain XML columns. When the DOCID values are inserted in sequential order, hot spot situations might occur, in which multiple threads must wait for latches on the same data pages while inserting XML data concurrently.

However, when the value of the XML_RANDOMIZE_DOCID subsystem parameter is YES, Db2 randomizes the DOCID values at CREATE TABLE for any new table that contains XML columns, and at ALTER TABLE for any existing table when the first XML column is added. Changing the value of the XML_RANDOMIZE_DOCID subsystem parameter has no effect on existing tables that contain XML columns. Any table that contains randomized DOCID values cannot be converted to use sequential DOCID values. Similarly, any table that already contains sequential DOCID values cannot be converted to use randomized DOCID values.

You can check the value of the ORDER column in the SYSIBM.SYSSEQUENCES catalog table to find out whether a particular table has randomized DOCID values.

For more information, see:

Access XML data quickly by using FETCH WITH CONTINUE

Use the FETCH WITH CONTINUE statement to improve the performance of some queries that reference XML columns with unknown or very large maximum lengths.

Use binary XML to improve LOAD performance

You can reduce the load time of XML data when the input data is created by the UNLOAD utility using the BINARYXML option. In addition, validation of the XML data can be avoided if it has previously been validated and the following conditions are true:

Only one XML schema is defined for the XML column being loaded.
In the beginning of the XML document being loaded, the root element namespace and schema location hint match that of the XML schema.
The root element namespace matches, but xsi:schemaLocation does not exist or the first xsi:schemaLocation attribute pair does not contain a namespace that matches the root element namespace.

For more information, see: