Text encoding and the Illegal Character Detection tool

Before upgrading from Rational® Synergy 7.0 or 7.1, install and run the Illegal Character Detection tool on your database. This tool finds characters that might be modified incorrectly during the conversion to UTF-8.

Before you begin

Verify that your current version of the software requires the completion of this task.
Current®Rational Synergy version Conversion required
7.0 Yes
7.1 Yes
7.1a No

Rational Synergy 7.1a already uses UTF-8 encoding, so no encoding conversion is performed; users upgrading from 7.1a can skip this section.

About this task

When a Rational Synergy database is upgraded from 7.0 or 7.1 to Rational Synergy 7.2 or later, all text metadata (type and object names, string and text attribute values, and similar items) in that database is converted from the Windows CP1252 encoding used in previous releases to the UTF-8 encoding used in 7.2 or later.

No changes are made to the contents of files controlled in the Rational Synergy database; only the text metadata stored in Informix® or Oracle is re-encoded.

In releases 7.0 and 7.1, Rational Synergy expects text data to be encoded in CP1252 (or its Latin-1 subset). However, it is possible that some characters might have been entered in other encodings, perhaps using the Classic clients where encoding was not checked.

The following hexadecimal byte values are undefined in CP1252:
  • 0x81
  • 0x8D
  • 0x8F
  • 0x90
  • 0x9D
If any of these byte values are encountered in text metadata during database upgrade, they are converted into the sequence “\x” followed by the hexadecimal value as a string. For example, if the byte value 0x81 is encountered during upgrade, it is converted to the string “\x81”. Each such byte value encountered during upgrade is noted in the upgrade log, and a list of such occurrences is stored in the database for later retrieval.
Illegal byte value Converted string literal
0x81 “\x81“
0x8D “\x8D“
0x8F “\x8F“
0x90 “\x90“
0x9D “\x9D“
By default, this tool scans for and reports any of the byte values noted in the table. If users in a database have run the Classic clients in encodings other than Windows CP1252 or ISO Latin-1, you must identify the code points that differ between CP1252 and your encoding. Those code points must also be scanned as instructed in step 6 below. For instance, suppose your database contains data in Latin-2 (ISO-8859-2), the following table contains some code points that would differ between CP-1252 and ISO-8859-2. You must include these code points during the detection scan tool since such data is not converted to the correct UTF-8 values during upgrade to 7.2 or later.
Table 1. Code points that differ between CP-1252 and ISO-8859-2
Code point CP-1252 ISO-8859-2
0xB1 ± ą
0xB3 ³ ł
0xB6 ś
0xC0 À Ŕ

Procedure

Before upgrading to Rational Synergy 7.2 or later, use the Illegal Character Detection tool:

  1. Download the detection library db_illegal.a.

    Download the library from https://www.ibm.com/support/docview.wss?uid=swg27021595.

  2. In Rational Synergy 7.0 or 7.1, start a Classic CLI session.
  3. Run the following command to load the library: ccm load –a detection_library_location For example: ccm load -a /tmp/db_illegal.a
  4. Define the command for the Illegal Character Detection tool: ccm define detection db_illegal_detection cmd
  5. Run the Illegal Character Detection tool to begin scanning: ccm detection html_output_file_location For example: ccm detection /tmp/database1.html
  6. Optional: You can add your own set of illegal byte values by adding the -a or -additional option followed by one or more hexadecimal values.
    Note: Do not include spaces between multiple hexadecimal values.
    For example, to add two Latin-2 characters, use the command: ccm detection –a B1B3B6C0 /tmp/database1_scan.html This command detects the five illegal CP1252 characters and the listed Latin-2 characters.

Results

The time taken for the scan depends on the size of your database and speed of your system. It is not unusual for the scan to take several hours on a large database. If possible, run the scan when the database is less busy. The scan is read only and does not require shutting down the server or protecting the database.

The output of the Illegal Character Detection tool is in HTML and can be viewed in a browser. Each object containing illegal data as defined during the command is listed in a header block. Click the header block to view the attributes containing illegal values. The illegal characters are highlighted in red. Undefined CP1252 characters show as their hexadecimal value surrounded by angle brackets. For example, if the character hexadecimal value 81 is encountered, the report shows “<81>” in a large red font.

What to do next

After running the Illegal Character Detection tool, review the report to see if your database contains illegal text data. If it does, inspect the objects and their attribute values in an appropriate Rational Synergy or Change interface. You might decide to remove or correct the data manually, write a script to fix a repeating error, or leave the data without making further edits.

During database upgrade to Rational Synergy 7.2 or later, any remaining byte values that are not legal CP1252 characters are converted to the “\xNN” sequence described earlier. All other text data is assumed to be in the CP1252 encoding and is converted from that encoding to UTF-8.


Feedback