Processing string data by the natural size of each character

String data can have characters of different sizes.

  • UTF-8 data can have characters with 1, 2, 3, or 4 bytes. For example, the character 'a' has one byte, and the character 'á' has two bytes.

    UTF-8 data is defined as alphanumeric with CCSID(*UTF8) or CCSID(1208).

  • UTF-16 data can have characters with 2 or 4 bytes.

    UTF-16 data is defined as UCS-2 with CCSID(*UTF16) or CCSID(1200).

  • EBCDIC mixed SBCS/DBCS data can have characters with 1 or 2 bytes. Additionally, double-byte data is surrouned by shift bytes. The shift-out byte x'0E' begins a section of DBCS data and the shift-in byte x'0F' ends the section of DBCS data.
    EBCDIC mixed SBCS/DBCS data is defined as alphanumeric. The CCSID can be
    • CCSID(*JOBRUNMIX).

      This is the mixed SBCS/DBCS CCSID related to the job CCSID.

      Warning: This is the default CCSID for alphanumeric data if Control keyword CCSID(*CHAR) is not specified.

      Specify Control keyword CCSID(*CHAR:*JOBRUN) to prevent RPG from making this assumption about definitions that do not have an explicit CCSID keyword.

      When you specify Control keyword CCSID(*CHAR:*JOBRUN), RPG assumes that data in the job CCSID is mixed SBCS/DBCS only when the job CCSID itself is mixed SBCS/DBCS.

      When the job CCSID only supports SBCS data, RPG can assume that all characters have only one byte, and avoid additional processing to check for DBCS characters.

    • CCSID(*JOBRUN) when the job CCSID supports mixed SBCS/DBCS data.
    • A CCSID that represents mixed SBCS/DBCS data such as 937.
  • ASCII mixed SBCS/DBCS data can have characters with 1 or 2 bytes.

    ASCII mixed SBCS/DBCS data is defined as alphanumeric with a CCSID that represents mixed SBCS/DBCS data such as 950.

Default behaviour, CHARCOUNT STDCHARSIZE

By default, data is processed using the standard-character-size mode. The compiler processes string data by bytes or double bytes without regard for size of each character.

For example
  • The start position for the %SCAN built-in function represents the byte to start the scan for alphanumeric data, or the double-byte for UCS-2 or graphic data.
  • The length for the %SUBST built-in function represents the number of bytes or double-bytes to return.
  • When the target variable for an assignment is shorter than the source data, the additional bytes or double-bytes are truncated without regard for whether the result ends with a complete character.

CHARCOUNT NATURAL

When CHARCOUNT NATURAL is in effect:
  • The compiler processes string operations by the natural size of each character.
  • The compiler sets the CHARCOUNT NATURAL mode for a file if the CHARCOUNT is not specified for the file.

    The CHARCOUNT mode for the file affects the movement of data from RPG fields to the output buffer and key buffer used for the file operations. See CHARCOUNT(*NATURAL | *STDCHARSIZE).

Note: The CHARCOUNT NATURAL mode is only in effect for the data with the type and CCSID listed in the CHARCOUNTTYPES keyword.
For example
  • The start position for the %SCAN built-in function represents the character to start the scan.
  • The length for the %SUBST built-in function represents the number of characters to return.
  • When the target variable for an assignment is shorter than the source data, complete characters are truncated from the result.
Note:
  • The length prefix for a varying-length variable always represents the number of bytes for alphanumeric data, or the number of double bytes for UCS-2 or graphic data.
  • The %LEN built-in function always represents the number of bytes for alphanumeric data, or the number of double bytes for UCS-2 or graphic data.

    Use the %CHARCOUNT built-in function to obtain the number of characters in a string.

  • The length parameter for the %STR built-in function always represents the number of bytes.
  • When data is processed by characters, it is more likely that invalid data will be discovered. For example, if the source data for the %SUBST built-in function contains an invalid UTF-8 character, the invalid data might not be noticed using the STDCHARSIZE mode, but the invalid data would cause an exception with status code 125 using the NATURAL mode.
  • However, there is no guarantee that invalid data will always be discovered using the NATURAL mode.

How to control the CHARCOUNT processing mode