IBM Support

Character truncation when SELECTing data outside the invariant character set

Troubleshooting


Problem

When querying fields containing characters outside the invariant character set, characters at the end of the data may be truncated.

Environment

Linux OS
IBM i Access Client Solutions - Application Package for Linux

Diagnosing The Problem

The problem is easily recreated using unixODBC's "isql" utility:
amarquis@LAPTOP-Ubuntu18:~$ isql myODBCDSN myUSRPRF myPASSWD
+---------------------------------------+
| Connected!                            |
|                                       |
| sql-statement                         |
| help [tablename]                      |
| quit                                  |
|                                       |
+---------------------------------------+
SQL> create table testchar (c varchar(14))
[ISQL]INFO: SQLExecute returned SQL_SUCCESS_WITH_INFO
SQLRowCount returns 0
SQL> insert into testchar values('Champs-Élysées')
SQLRowCount returns 1
SQL> select * from testchar
+---------------+
| C             |
+---------------+
| Champs-Élysé|
+---------------+
SQLRowCount returns -1
1 rows fetched
SQL> quit
Notice that the results of the SELECT statement truncated the 'es' at the end of the string.
Running the same query from the IBM i Access Client Solutions "Run SQL Scripts" function will return the complete CHAR string because it is a Unicode application. 
The unixODBC project also provides a Unicode-enabled command that should not result in CHAR truncation - 'iusql'.

Resolving The Problem

The problem is that the character string contains 14 bytes of EBCDIC CHAR data with non-7bit ASCII characters.
When converted to UTF-8 (the default encoding on all modern Linux distributions) it will be more than 14 bytes.
If the application uses the column description to determine the buffer size to allocate (ie. 14 bytes), there will be truncation.
In Windows, the ODBC driver only provides wide APIs and UTF-16 can store all Unicode values inside the Basic Multilingual Plan in 2 bytes (or one "character"). If the application is not wide, it will get converted to the default ANSI code page which, 99.9% of the time, is a single-byte code page like CP1252 (not UTF-8) and so 1 EBCDIC "character" becomes 1 ANSI/UTF-16 "character".
There is no standardized way to determine an adequate buffer size to allocate when there is varying-length encodings involved. Even UTF-16 has issues as there are Unicode code points which require 4 bytes in UTF-16, though there are very few EBCDIC code pages which contain such characters.
One option may be to set the application CCSID to 1252 or 923 (ISO-8859-15) in the DSN (ie. CCSID=923), but this may require UTF-8 conversion in customer applications.
Another option is to cast the column to a UTF-8 column of appropriate size. For example:
SQL> select cast(c as CHAR(16)) from testchar
+-----------------+
| 00001           |
+-----------------+
| Champs-Élysées|
+-----------------+
SQLRowCount returns -1
1 rows fetched
SQL>
Creating a view over the table with CAST and then querying the view may result in fewer application changes being needed. For example:
SQL> Create view testcharOK as (select CAST(c as CHAR(16)) as c from testchar)
SQLRowCount returns 0
SQL> select * from testcharOK
+-----------------+
| C               |
+-----------------+
| Champs-Élysées|
+-----------------+
SQLRowCount returns -1
1 rows fetched
SQL>

Document Location

Worldwide

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG60","label":"IBM i"},"ARM Category":[{"code":"a8m0z0000000CTzAAM","label":"Data Access->Access for Linux"}],"ARM Case Number":"TS003572527","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB57","label":"Power"}}]

Document Information

Modified date:
23 April 2020

UID

ibm16198368