Calling the collation service

This topic describes how the z/OS® support for the Unicode Standard collation service is called.

Collation works under two basic schemes — the binary comparison between two Unicode strings, and the generation of a sort key vector. Following is a description of how the service is called, followed by an explanation of the uses of the two types of calls.

Binary comparison:

The 31-bit caller has to provide:

Source1 buffer pointer (31-bit pointer), ALET (4 byte), and length (8 byte)
Source2 buffer pointer (31-bit pointer), ALET (4 byte), and length (8 byte)
Target1 buffer pointer (31-bit pointer), ALET (4 byte), and length (8 byte)
Target2 buffer pointer (31-bit pointer), ALET (4 byte), and length (8 byte)
Collation level
Work1 buffer pointer (31-bit pointer), ALET (4 byte), and length (8 byte)
Work2 buffer pointer (31-bit pointer), ALET (4 byte), and length (8 byte)
Dynamic data area pointer (DDA) (31-bit pointer), ALET (4 byte), and length (8 byte)
Flag1 (handle options)
Collation mask options (sort key option=0)

The 64-bit caller has to provide:

Source1 buffer pointer (64-bit pointer), ALET (4 byte), and length (8 byte)
Source2 buffer pointer (64-bit pointer), ALET (4 byte), and length (8 byte)
Target1 buffer pointer (64-bit pointer), ALET (4 byte), and length (8 byte)
Target2 buffer pointer (64-bit pointer), ALET (4 byte), and length (8 byte)
Collation level
Work1 buffer pointer (64-bit pointer), ALET (4 byte), and length (8 byte)
Work2 buffer pointer (64-bit pointer), ALET (4 byte), and length (8 byte)
Dynamic data area pointer (DDA) (64 bit pointer), ALET (4 byte), and length (8 byte)
Flag1 (handle options)
Collation mask options (sort key option=0)

For collation features (UCA400R1, UCA410, and UCA600), there are two ways to set the APIs as part of Unicode Dynamic Capabilities:

Long Path. This way to perform Collation API settings has the intention to continue to use the existing collation settings "plus" the new ones
Short Path. This new way to set Collation API is a very simple and easy for all the collation features supported.

Another option is to use SETUNI or SET UNI=xx commands as part of an static initialization. For more information, see SETUNI command in z/OS MVS System Commands.

Long Path:

The 31-bit caller has to provide:

Set parameter area version2
Source1 buffer pointer (31-bit pointer), ALET (4 byte), and length (8 byte)
Source2 buffer pointer (31-bit pointer), ALET (4 byte), and length (8 byte)
Target1 buffer pointer (31-bit pointer), ALET (4 byte), and length (8 byte)
Target2 buffer pointer (31-bit pointer), ALET (4 byte), and length (8 byte)
Collation level
Work1 buffer pointer (31-bit pointer), ALET (4 byte), and length (8 byte)
Work2 buffer pointer (31-bit pointer), ALET (4 byte), and length (8 byte)
Dynamic data area pointer (DDA) (31-bit pointer), ALET (4 byte), and length (8 byte)
Flag1 (handle options)
Collation mask options (sort key option=0)
Case Options Flags
Hiragana support
Locale or User Collation Rules file + DSN + Vol

The 64-bit caller has to provide:

Set parameter area version2
Source1 buffer pointer (64-bit pointer), ALET (4 byte), and length (8 byte)
Source2 buffer pointer (64-bit pointer), ALET (4 byte), and length (8 byte)
Target1 buffer pointer (64-bit pointer), ALET (4 byte), and length (8 byte)
Target2 buffer pointer (64-bit pointer), ALET (4 byte), and length (8 byte)
Collation level
Work1 buffer pointer (64-bit pointer), ALET (4 byte), and length (8 byte)
Work2 buffer pointer (64-bit pointer), ALET (4 byte), and length (8 byte)
Dynamic data area pointer (DDA) (64 bit pointer), ALET (4 byte), and length (8 byte)
Flag1 (handle options)
Collation mask options (sort key option=0)
Case Options Flags
Hiragana support
Locale or User Collation Rules file + DSN + Vol

Short Path:

The 31-bit caller has to provide:

Set parameter area version2
Source1 buffer pointer (31-bit pointer), ALET (4 byte), and length (4 byte)
Source2 buffer pointer (31-bit pointer), ALET (4 byte), and length (4 byte)
Target1 buffer pointer (31-bit pointer), ALET (4 byte), and length (4 byte)
Target2 buffer pointer (31-bit pointer), ALET (4 byte), and length (4 byte)
Work2 buffer pointer (31-bit pointer), ALET (4 byte), and length (4 byte)
Dynamic data area pointer (DDA) (31 bit pointer), ALET (4 byte), and length (4 byte)
Collation Keyword

The 64-bit caller has to provide:

Set parameter area version2
Source1 buffer pointer (64-bit pointer), ALET (4 byte), and length (8 byte)
Source2 buffer pointer (64-bit pointer), ALET (4 byte), and length (8 byte)
Target1 buffer pointer (64-bit pointer), ALET (4 byte), and length (8 byte)
Target2 buffer pointer (64-bit pointer), ALET (4 byte), and length (8 byte)
Work2 buffer pointer (64-bit pointer), ALET (4 byte), and length (8 byte)
Dynamic data area pointer (DDA) (64 bit pointer), ALET (4 byte), and length (8 byte)
Collation Keyword

Note: Short path settings has high priority over long path.

Sort key vector:

How you generate the sort key vector depends on how you set the sourceX buffer length. For example, you can use any of the following input combinations:

Source1
Source2
Source1 and source2

In the first two cases, you only need to provide the pointers for the applicable source, work, and target buffers. In case number three, you must provide pointers for both sets of buffers.

You must always provide the following, regardless of which of the three cases applies:

Collation level
Dynamic data area pointer (DDA), ALET, and length
Flag1 (handle options)
Collation mask options (sort key option=1)

Following is an explanation of the two types of calls to the collation service.

Binary comparison:
This is the most common use of the collation service. Two Unicode strings are input by the caller to be compared (collated) in a culturally correct manner. Prior to collation, the caller must provide a desired collation level and optionally, the alternate weighting, and other options in the collation parameter area, to specify a particular comparison type. Once the collation service is called, it will return a compare result and a return and reason code. For two given Unicode input strings A and B, the compare result shows how one string is related to the other in the following way:
- -1, if A < B
- 0, if A = B
- 1, if A > B
The compare result and return codes are returned in the fields CUNBOPRM_Result, CUNBOPRM_Return_Code, and CUNBOPRM_Reason_code (for 31-bit), or CUN4BOPR_Result, CUN4BOPR_Return_Code and CUN4BOPR_Reason_code (for 64-bit), respectively. To set alternate weighting options and a collation level, parameter fields CUNBOPRM_Mask and CUNBOPRM_Coll_Level (for 31-bit) or CUN4BOPR_Mask and CUN4BOPR_Coll_Level (for 64-bit) are used, respectively.

For more information on how to use these fields, see Description of parameters in area CUNBOPRM.

The two input Unicode strings to be compared are set in the same way as the other Unicode Services source buffers. A buffer pointer, length, and ALET are set for each source buffer.

The target buffers that are used to hold the converted bytes in the other Unicode services are not needed to be set in this case. That is because no bytes will be converted, except if the CUNBOPRM_Norm_Type or CUN4BOPR_Norm_Type field is equal to NFD, NFKD, NFC or NFKC.

For UCA400R1, UCA410, and UCA600 versions, only NFD are supported. If Collation API is set with version 2 and there is an NF (Normalization Form) set differently from NFD, the NF will be ignored and Normalization will no longer be considered. Also RC = CUN_RC_WARN, RS = CUN_RS_INVALID_NORMALIZATION_VALUE will be set, even the process continues without any Normalization Form.

The results obtained from the comparison are returned in the result,return and reason code fields as described in the paragraph above. The work buffers are used as auxiliary buffers to hold data during the collation process. The work buffers should always be set in each collation call with the sufficient length needed during the collation process, otherwise a work buffer error will result.

For more information about the target and work buffers, see Target buffer length considerations and Work buffer length considerations.
Sort Key:
A sort key, or sort key vector, is a collection of weights for a given Unicode string which can be binary compared against another sort key to produce a compare result.

Sort keys can result from the collation process if the user sets the parameter area field CUNBOPRM_Coll_Mask or CUN4BOPR_Coll_Mask with constant CUNBOPRM_MASK_SK (see call samples). An associated comparison level and alternate weighting option can be specified by the user to form a particular sort key. Also, as part of new settings for Collation versions UCA400R1, UCA410, and UCA600, consider the long and short path for sort key generation settings.

The sort key can be considered a "compare file", because it can be created as a data set if properly specified by the user. The usefulness of a sort key is that once created for an input string, it can be kept and used repeatedly by the caller in binary comparisons with other sort keys. This can represent a performance advantage for the caller, because in this case there would be no need to call the collation services, but only perform a binary comparison with the caller's preferred compare routine.

A sort key for a given Unicode character is formed by reading and processing the level weights found in the AllKeys.txt file provided by the Unicode consortium at: http://www.Unicode.org/Unicode/reports/tr10/allkeys.txt. Collation version 3.0.1 follows sort key generation as described on the Unicode Consortium TR#10, while recent Collation versions UCA400R1, UCA410, and UCA600 do not due to tailoring features.

In order to use this collation functionality, the target buffers must be set by the caller in addition to the source and work buffers. The target buffers will hold the resulting sort key for their respective source buffers. Both or only one sort key can be generated on each call to the collation services. To assume that one of the source buffers is not being used you must set its length at zero.

If you plan on using your own binary compare algorithms for sort keys, it is important you can interpret the sort key format. This is explained in Sort key vector format. The size of the sort key is determined by the collation level chosen. The greater the collation level, the longer the sort key will be.

z/OS Unicode Services collation does not provide a way of making a binary comparison for any pair of sort keys provided by the user. It is the user's responsibility to do the binary comparisons. If, after a call to z/OS, collation returns a zero return code, you can check for the sort key left in the target buffer(s). Otherwise, you must interpret the return and reason code, and retry a collation call after taking the appropriate steps.

For Collation versions UCA400R1, UCA410, and UCA600, sort key weights have different values than their respective versions from the DUCET (Default Unicode Collation Element Table - http://www.unicode.org/Public/UCA/latest/allkeys.txt) because they were modified for tailoring reasons (Locales or User Collation Rules - UCR).

According to each UCA (Unicode Collation Algorithm) version and settings (Locales or UCR) the Sort keys might contain different weights and then comparisons between different UCA version sort keys, in combination with some Locales or UCR, might return with an undesired comparison result. A good practice to avoid undesired results with sort key previously generated would be making sort key comparisons if and only if they comes from the same settings, that is, same UCA version, Locale, Collation Level, case options, etc. Otherwise, results might be inconsistent.

General considerations:

A successful call to collation always returns a valid collation handle. This handle can be used as a fast path when recalling the collation services, because it specifies a direct access to the collation tables. IBM® recommends providing the collation handle if successive collation calls are to be performed. If the caller only desires to request a collation handle, the fields CUNBOPRM_Get_New_Handle or CUN4BOPR_Get_New_Handle must be set to X'80'. See description of the field CUNBOPRM_Flag1 in Description of parameters in area CUNBOPRM. A sample program, CUNSOSMC, is provided in SYS1.SAMPLIB.

The caller can put the source parameters in any data space. To allow the service to access data not in primary space, an ALET must be specified. An ALET of 0 indicates that the data is in the primary address space (default value), which is the case for most callers.

A dynamic data area (DDA) must always be specified. The required length is defined by constant CUNBOPRM_DDA_Req or CUN4BOPR_DDA_Req. Refer to the interface definition file (CUNBOIDF).