XML parsing

XML parsing is the process of converting XML data from its serialized string format to its hierarchical format.

You can let the Db2® database server perform parsing implicitly, or you can perform XML parsing explicitly.

Implicit XML parsing occurs in the following cases:
  • When you pass data to the database server using a host variable of type XML, or use a parameter marker of type XML

    The database server does the parsing when it binds the value for the host variable or parameter marker for use in statement processing.

    You must use implicit parsing in this case.

  • When you assign a host variable, parameter marker, or SQL expression with a string data type (character, graphic or binary) to an XML column in an INSERT, UPDATE, DELETE, or MERGE statement. The parsing occurs when the SQL compiler implicitly adds an XMLPARSE function to the statement.

You perform explicit XML parsing when you invoke the XMLPARSE function on the input XML data. You can use the result of XMLPARSE in any context that accepts an XML data type. For example, you can use assign the result to an XML column or use it as a stored procedure parameter of type XML.

The XMLPARSE function takes a non-XML, character or binary data type as input. For embedded dynamic SQL applications, you need to cast the parameter marker that represents the input document for XMLPARSE to the appropriate data type. For example:
INSERT INTO MyCustomer (Cid, Info) 
 VALUES (?, xmlparse(document cast(? as clob(1k)) preserve whitespace))

For static embedded SQL applications, a host variable argument of the XMLPARSE function cannot be declared as an XML type (XML AS BLOB, XML AS CLOB, or XML AS DBCLOB type).

XML parsing and whitespace handling

During implicit or explicit XML parsing, you can control the preservation or stripping of boundary whitespace characters when you store the data in the database.

According to the XML standard, whitespace is space characters (U+0020), carriage returns (U+000D), line feeds (U+000A), or tabs (U+0009) that are in the document to improve readability. When any of these characters appear as part of a text string, they are not considered to be whitespace.

Boundary whitespace is whitespace characters that appear between elements. For example, in the following document, the spaces between <a> and <b> and between </b> and </a> are boundary whitespace.
<a> <b> and between </b> </a>

With explicit invocation of XMLPARSE, you use the STRIP WHITESPACE or PRESERVE WHITESPACE option to control preservation of boundary whitespace. The default is stripping of boundary whitespace.

With implicit XML parsing:
  • If the input data type is not an XML type or is not cast to an XML data type, the Db2 database server always strips whitespace.
  • If the input data type is an XML data type, you can use the CURRENT IMPLICIT XMLPARSE OPTION special register to control preservation of boundary whitespace. You can set this special register to STRIP WHITESPACE or PRESERVE WHITESPACE. The default is stripping of boundary whitespace.
If you use XML validation, the Db2 database server ignores the CURRENT IMPLICIT XMLPARSE OPTION special register and uses only the validation rules to determine stripping or preservation of whitespace in the following cases:
xmlvalidate(? ACCORDING TO XMLSCHEMA ID schemaname)
xmlvalidate(?)
xmlvalidate(:hvxml ACCORDING TO XMLSCHEMA ID schemaname)
xmlvalidate(:hvxml)
xmlvalidate(cast(? as xml) ACCORDING TO XMLSCHEMA ID schemaname)
xmlvalidate(cast(? as xml)) 
In these cases, ? represents XML data, and :hvxml is an XML host variable.

See XML validation for information on how XML validation influences whitespace handling.

The XML standard specifies an xml:space attribute that controls the stripping or preservation of whitespace within XML data. xml:space attributes override any whitespace settings for implicit or explicit XML parsing.

For example, in the following document, the spaces immediately before and after <b> are always preserved, regardless of any XML parsing options, because the spaces are within a node with the attribute xml:space="preserve":
<a xml:space="preserve"> <b> <c>c</c>b </b></a>
However, in the following document, the spaces immediately before and after <b> can be controlled by the XML parsing options, because the spaces are within a node with the attribute xml:space="default":
<a xml:space="default"> <b> <c>c</c>b </b></a>

XML parsing in a non-Unicode database

When an XML document is passed into a non-Unicode database, code page conversion can occur first as the document is passed from the client to the target database server, and then as the document is passed to the Db2 XML parser. Passing an XML document using a host variable or parameter marker of type XML prevents code page conversion from occurring. If an XML document is passed using a character data type (CHAR, VARCHAR, CLOB, or LONG VARCHAR), code page conversion can result in the introduction of substitution characters for any character in the XML data that is not part of the target database code page.

To prevent substitution characters from being introduced, and from potentially degrading the inserted XML data, ensure that if XML data is parsed using a character data type, all code points in the source document are a part of the target database code page. For any character that is not part of this code page you can use a decimal or hexadecimal character entity reference specifying the correct Unicode code point. For example, either &#x003E or &#0062 can be used to specify the > (greater-than) sign character.

You can also use the ENABLE_XMLCHAR configuration parameter to control whether or not XML parsing is enabled for character data types. Setting ENABLE_XMLCHAR to "NO" blocks both explicit and implicit XML parsing when character data types are used.

XML parsing and DTDs

If the input data contains an internal document type declaration (DTD) or references an external DTD, the XML parsing process also checks the syntax of those DTDs. In addition, the parsing process:
  • Applies default values that are defined by the internal and external DTDs
  • Expands entity references and parameter entities

Examples

The following examples demonstrate how whitespace in an XML document is handled in different situations.

Example: File c8.xml contains the following document:
<customerinfo xml:space="preserve" Cid='1008'>
  <name>Kathy Smith</name>
  <addr country='Canada'>
    <street>14 Rosewood</street>
    <city>Toronto</city>
    <prov-state>Ontario</prov-state>
    <pcode-zip>M6W 1E6</pcode-zip>
  </addr>
  <phone type='work'>416-555-3333</phone>
</customerinfo>
In a JDBC application, read the XML document from the file, and insert the data into XML column INFO of table MYCUSTOMER, which is a copy of the sample Customer table. Let the Db2 database server perform an implicit XML parse operation.
PreparedStatement insertStmt = null;
String sqls = null;
int cid = 1008;
sqls = "INSERT INTO MyCustomer (Cid, Info) VALUES (?, ?)";
insertStmt = conn.prepareStatement(sqls);
insertStmt.setInt(1, cid);
File file = new File("c8.xml");
insertStmt.setBinaryStream(2, new FileInputStream(file), (int)file.length());
insertStmt.executeUpdate();
No whitespace handling is specified, so the default behavior of stripping whitespace is assumed. However, the document contains the xml:space="preserve" attribute, so whitespace is preserved. This means that the carriage returns, line feeds, and spaces between the elements in the document remain.
If you retrieve the stored data, content looks like this:
<customerinfo xml:space="preserve" Cid='1008'>
  <name>Kathy Smith</name>
  <addr country='Canada'>
    <street>14 Rosewood</street>
    <city>Toronto</city>
    <prov-state>Ontario</prov-state>
    <pcode-zip>M6W 1E6</pcode-zip>
  </addr>
  <phone type='work'>416-555-3333</phone>
</customerinfo>
Example: Assume that the following document is in BLOB host variable blob_hostvar.
<customerinfo xml:space="default" Cid='1009'>
  <name>Kathy Smith</name>
  <addr country='Canada'>
    <street>15 Rosewood</street>
    <city>Toronto</city>
    <prov-state>Ontario</prov-state>
    <pcode-zip>M6W 1E6</pcode-zip>
  </addr>
  <phone type='work'>416-555-4444</phone>
</customerinfo>
In a static embedded C application, insert the document from the host variable into XML column Info of table MyCustomer. The host variable is not an XML type, so you need to execute XMLPARSE explicitly. Specify STRIP WHITESPACE to remove any boundary whitespace.
EXEC SQL BEGIN DECLARE SECTION;
   SQL TYPE BLOB (10K) blob_hostvar;
EXEC SQL END DECLARE SECTION;
...
EXEC SQL INSERT INTO MyCustomer (Cid, Info) 
  VALUES (1009,
  XMLPARSE(DOCUMENT :blob_hostvar STRIP WHITESPACE));
The document contains the xml:space="default" attribute, so the XMLPARSE specification of STRIP WHITESPACE controls whitespace handling. This means that the carriage returns, line feeds, and spaces between the elements in the document are removed.
If you retrieve the stored data, you see a single line with the following content:
<customerinfo xml:space="default" Cid='1009'>
<name>Kathy Smith</name><addr country='Canada'><street>15 Rosewood</street>
<city>Toronto</city><prov-state>Ontario</prov-state><pcode-zip>M6W 1E6</pcode-zip>
</addr><phone type='work'>416-555-4444</phone></customerinfo>
Example: In a C language application, host variable clob_hostvar contains the following document, which contains an internal DTD:
<!DOCTYPE prod [<!ELEMENT description (name,details,price,weight)> 
   <!ELEMENT name (#PCDATA)> 
   <!ELEMENT details (#PCDATA)> 
   <!ELEMENT price (#PCDATA)> 
   <!ELEMENT weight (#PCDATA)> 
   <!ENTITY desc  "Anvil">
]>
<product pid=''110-100-01'' >
 <description>
 <name>&desc;</name>
 <details>Very heavy</details>
 <price>         9.99            </price>
 <weight>1 kg</weight>
 </description>
</product>'
Insert the data into table MYPRODUCT, which is a copy of the sample PRODUCT table:
EXEC SQL BEGIN DECLARE SECTION;
   SQL TYPE CLOB (10K) clob_hostvar;
EXEC SQL END DECLARE SECTION;
...
EXEC SQL insert into 
  Product ( pid, name, Price, PromoPrice, PromoStart, PromoEnd, description ) 
  values ( '110-100-01','Anvil', 9.99, 7.99, '11-02-2004','12-02-2004', 
  XMLPARSE ( DOCUMENT :clob_hostvar STRIP WHITESPACE ));
XMLPARSE specifies stripping of whitespace, so boundary whitespace within the document is removed. In addition, when the database server executes XMLPARSE, it replaces the entity reference &desc; with its value.
If you retrieve the stored data, you see a single line with the following content:
<product pid="110-100-01"><description><name>Anvil
</name><details>Very heavy</details><price>         9.99            </price>
<weight>1 kg</weight></description></product>