IBM Support

UTF-8 Special Characters being replaced by Question Marks (�)in Pipelines

Troubleshooting


Problem

  1. This behavior indicates that your Java Runtime Environment is using a non-UTF-8 character set which does not support these special characters, so the special characters are being replaced by the � (U+FFFD) REPLACEMENT CHARACTER.

  2. By default, SDC will try to set your JAVA_OPTS to use the UTF-8 encoding, however it is possible that the JAVA_OPTS parameters are being set outside of SDC and are overriding either of the following JVM parameters which tells the JRE what encoding to use at runtime: file.encoding and sun.jnu.encoding.

    1. The JRE has a default character set which it will use if these two parameters are not specified.

    2. The default character set can vary between JDK’s and also can depend on the locale configured in your operating system.

Symptom

  1. When running an SDC pipeline which processes records with Strings containing UTF-8 special characters, these special characters are being replaced by question marks (? or �) in your pipeline’s records.

Example

Input record:

{"productName": "My Product™"}

Output record:

{"productName": "My Product�"}

Resolving The Problem

  1. Check what JRE character encoding is being used at runtime

    1. You can run this script to see what encoding is being set when SDC starts:

    2. # set your JAVA_OPTS and SDC_JAVA_OPTS env variables which are used when starting SDC
      source $SDC_HOME/libexec/sdcd-env.sh
      echo $SDC_JAVA_OPTS
      source $SDC_HOME/libexec/sdc-env.sh
      echo $SDC_JAVA_OPTS
      export JAVA_OPTS="${SDC_JAVA_OPTS}"
      echo $JAVA_OPTS
      
      # run a test Java script which prints what encoding is being used by your JRE
      pushd /tmp > /dev/null &&
      echo 'import java.nio.charset.Charset;
      
      class CharsetTest {
      
          public static void main(String[] args) {
              System.out.println("file.encoding:\t" + System.getProperty("file.encoding"));
              System.out.println("sun.jnu.encoding:\t" + System.getProperty("sun.jnu.encoding"));
              System.out.println("default charset:\t" + Charset.defaultCharset().displayName());
          }
      
      }' > CharsetTest.java && javac CharsetTest.java && java CharsetTest && popd > /dev/null
    3. This may output something like:

      1. file.encoding: ANSI_X3.4-1968
        sun.jnu.encoding: ANSI_X3.4-1968
        default charset: US-ASCII
      2. The above example indicates that your JRE is using ASCII rather than UTF-8

  2. To force your JRE to use your desired character set:

    1. Append the following parameters to your SDC_JAVA_OPTS in your $SDC_HOME/libexec/sdc-env.sh and $SDC_HOME/libexec/sdcd-env.sh files to force UTF-8 encoding in your JVM.:

      • -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 

    2. You can check these parameters are being applied correctly by running the above script again.

      • file.encoding: UTF-8
        sun.jnu.encoding: UTF-8
        default charset: UTF-8
      • Your output should now look like this (if the default character set still is not UTF-8, it shouldn’t matter as long as the other two parameters are set to UTF-8).

    3. Now you can restart your SDC service for these changes to be applied.

Document Location

Worldwide

[{"Line of Business":{"code":"LOB76","label":"Data Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSM7CU","label":"IBM StreamSets Data Collector"},"ARM Category":[{"code":"","label":""}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)"}]

Document Information

Modified date:
15 March 2025

UID

ibm17186082