Processing string data by the natural size of each character
String data can have characters of different sizes.
- UTF-8 data can have characters with 1, 2, 3, or 4 bytes.
For example, the character 'a' has one byte, and the character 'á' has two bytes.
UTF-8 data is defined as alphanumeric with CCSID(*UTF8) or CCSID(1208).
- UTF-16 data can have characters with 2 or 4 bytes.
UTF-16 data is defined as UCS-2 with CCSID(*UTF16) or CCSID(1200).
- EBCDIC mixed SBCS/DBCS data can have characters with 1 or 2 bytes.
Additionally, double-byte data is surrouned by shift bytes. The shift-out byte x'0E'
begins a section of DBCS data and the shift-in byte x'0F' ends the section of DBCS data.
EBCDIC mixed SBCS/DBCS data is defined as alphanumeric. The CCSID can be
- CCSID(*JOBRUNMIX).
This is the mixed SBCS/DBCS CCSID related to the job CCSID.
Warning: This is the default CCSID for alphanumeric data if Control keyword CCSID(*CHAR) is not specified.Specify Control keyword CCSID(*CHAR:*JOBRUN) to prevent RPG from making this assumption about definitions that do not have an explicit CCSID keyword.
When you specify Control keyword CCSID(*CHAR:*JOBRUN), RPG assumes that data in the job CCSID is mixed SBCS/DBCS only when the job CCSID itself is mixed SBCS/DBCS.
When the job CCSID only supports SBCS data, RPG can assume that all characters have only one byte, and avoid additional processing to check for DBCS characters.
- CCSID(*JOBRUN) when the job CCSID supports mixed SBCS/DBCS data.
- A CCSID that represents mixed SBCS/DBCS data such as 937.
- CCSID(*JOBRUNMIX).
- ASCII mixed SBCS/DBCS data can have characters with 1 or 2 bytes.
ASCII mixed SBCS/DBCS data is defined as alphanumeric with a CCSID that represents mixed SBCS/DBCS data such as 950.
Default behaviour, CHARCOUNT STDCHARSIZE
By default, data is processed using the standard-character-size mode. The compiler processes string data by bytes or double bytes without regard for size of each character.
- The start position for the %SCAN built-in function represents the byte to start the scan for alphanumeric data, or the double-byte for UCS-2 or graphic data.
- The length for the %SUBST built-in function represents the number of bytes or double-bytes to return.
- When the target variable for an assignment is shorter than the source data, the additional bytes or double-bytes are truncated without regard for whether the result ends with a complete character.
CHARCOUNT NATURAL
- The compiler processes string operations by the natural size of each character.
- The compiler sets the CHARCOUNT NATURAL mode for a file
if the CHARCOUNT is not specified for the file.
The CHARCOUNT mode for the file affects the movement of data from RPG fields to the output buffer and key buffer used for the file operations. See CHARCOUNT(*NATURAL | *STDCHARSIZE).
- The start position for the %SCAN built-in function represents the character to start the scan.
- The length for the %SUBST built-in function represents the number of characters to return.
- When the target variable for an assignment is shorter than the source data, complete characters are truncated from the result.
- The length prefix for a varying-length variable always represents the number of bytes for alphanumeric data, or the number of double bytes for UCS-2 or graphic data.
- The %LEN built-in function always represents the number of
bytes for alphanumeric data, or the number of double bytes for UCS-2 or graphic data.
Use the %CHARCOUNT built-in function to obtain the number of characters in a string.
- The length parameter for the %STR built-in function always represents the number of bytes.
- When data is processed by characters, it is more likely that invalid data will be discovered. For example, if the source data for the %SUBST built-in function contains an invalid UTF-8 character, the invalid data might not be noticed using the STDCHARSIZE mode, but the invalid data would cause an exception with status code 125 using the NATURAL mode.
- However, there is no guarantee that invalid data will always be discovered using the NATURAL mode.
How to control the CHARCOUNT processing mode
- Use the CHARCOUNTTYPES Control keyword to select the data types that are affected by the CHARCOUNT processing mode. See CHARCOUNTTYPES(*UTF8 *UTF16 *JOBRUN *MIXEDEBCDIC *MIXEDASCII).
- Use the CHARCOUNT Control keyword to set the default processing mode for the module. See CHARCOUNT(*NATURAL | *STDCHARSIZE).
- Use the /CHARCOUNT compiler directive to set the default mode for file definitions or calculation statements following the directive. See /CHARCOUNT NATURAL | STDCHARSIZE.
- The CHARCOUNT mode for a file affects the movement of data from RPG fields to the output buffer and key buffer used for the file operations.
- Specify *NATURAL or *STDCHARSIZE as the final operand
of a built-in function to override the current CHARCOUNT mode for the statement.
- %CHECK (Check Characters)
- %CHECKR (Check Reverse)
- %CONCAT (Concatenate with Separator)
- %CONCATARR (Concatenate Array Elements with Separator)
- %LOWER (Convert to Lower Case)
- %REPLACE (Replace Character String)
- %SCAN (Scan for Characters)
- %SCANR (Scan Reverse for Characters)
- %SCANRPL (Scan and Replace Characters)
- %SPLIT (Split String into Substrings)
- %SUBST (Get Substring)
- %TRIM (Trim Characters at Edges)
- %UPPER (Convert to Upper Case)
- %XLATE (Translate)
Warning: *NATURAL has no effect if any operand of the built-in function has CCSID(*HEX), including hexadecimal literals.