UTF-8 Special Characters being replaced by Question Marks (�)in Pipelines

Troubleshooting

Problem

This behavior indicates that your Java Runtime Environment is using a non-UTF-8 character set which does not support these special characters, so the special characters are being replaced by the � (U+FFFD) REPLACEMENT CHARACTER.
By default, SDC will try to set your JAVA_OPTS to use the UTF-8 encoding, however it is possible that the JAVA_OPTS parameters are being set outside of SDC and are overriding either of the following JVM parameters which tells the JRE what encoding to use at runtime: file.encoding and sun.jnu.encoding.
1. The JRE has a default character set which it will use if these two parameters are not specified.
2. The default character set can vary between JDK’s and also can depend on the locale configured in your operating system.

Symptom

When running an SDC pipeline which processes records with Strings containing UTF-8 special characters, these special characters are being replaced by question marks (? or �) in your pipeline’s records.

Example

Input record:

{"productName": "My Product™"}

Output record:

{"productName": "My Product�"}

Resolving The Problem

Check what JRE character encoding is being used at runtime

You can run this script to see what encoding is being set when SDC starts:

# set your JAVA_OPTS and SDC_JAVA_OPTS env variables which are used when starting SDC
source $SDC_HOME/libexec/sdcd-env.sh
echo $SDC_JAVA_OPTS
source $SDC_HOME/libexec/sdc-env.sh
echo $SDC_JAVA_OPTS
export JAVA_OPTS="${SDC_JAVA_OPTS}"
echo $JAVA_OPTS

# run a test Java script which prints what encoding is being used by your JRE
pushd /tmp > /dev/null &&
echo 'import java.nio.charset.Charset;

class CharsetTest {

    public static void main(String[] args) {
        System.out.println("file.encoding:\t" + System.getProperty("file.encoding"));
        System.out.println("sun.jnu.encoding:\t" + System.getProperty("sun.jnu.encoding"));
        System.out.println("default charset:\t" + Charset.defaultCharset().displayName());
    }

}' > CharsetTest.java && javac CharsetTest.java && java CharsetTest && popd > /dev/null

This may output something like:
1. ```
file.encoding: ANSI_X3.4-1968
sun.jnu.encoding: ANSI_X3.4-1968
default charset: US-ASCII
```
2. The above example indicates that your JRE is using ASCII rather than UTF-8

To force your JRE to use your desired character set:
1. Append the following parameters to your SDC_JAVA_OPTS in your $SDC_HOME/libexec/sdc-env.sh and $SDC_HOME/libexec/sdcd-env.sh files to force UTF-8 encoding in your JVM.:
  - -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8
2. You can check these parameters are being applied correctly by running the above script again.
  - file.encoding: UTF-8 sun.jnu.encoding: UTF-8 default charset: UTF-8
  - Your output should now look like this (if the default character set still is not UTF-8, it shouldn’t matter as long as the other two parameters are set to UTF-8).
3. Now you can restart your SDC service for these changes to be applied.

Document Location

Worldwide

[{"Line of Business":{"code":"LOB76","label":"Data Platform"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSM7CU","label":"IBM StreamSets Data Collector"},"ARM Category":[{"code":"","label":""}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)"}]

Tips

UTF-8 Special Characters being replaced by Question Marks (�)in Pipelines

Troubleshooting

Problem

Symptom

Example

Resolving The Problem

Document Location

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?