Protecting HDFS data

By default, business data contained in raw events, time series, and summary events is written to the HDFS data lake without any obfuscation. You can define rules to obfuscate or remove sensitive data from the events before this data is written to the data lake.

Objectives

Protecting your data can help you meet your objectives for GDPR compliance.

Important: You can protect only business data and user names. Any rule that is defined on any other field is invalid and causes all rules to be ignored.

You can define three types of rules.

Anonymization: Any means of identifying the business data is irreversibly destroyed.
Pseudonymization: Identifiable business data is replaced with a reversible and consistent value in such a way that additional information is required to re-identify the data.
Data removal: The business data is removed from the event that is being processed.

Rule definitions

The rules are provided as a JSON file with the following format.

{
   "actions": [ 
      rule1, 
      rule2, 
      ..., 
      ruleN 
   ]
}

Each rule is a JSON object. Its content depends on the protection mode you choose for the business data.

Anonymization

In the following rule, each field is a string, which is enclosed between quotation marks and contains the field path in JSON path format. To protect the data, the values of the fields that are listed in the fields property are hashed.

Note: The listed fields may not be all present for each event at run time. Only the listed fields that can be matched in the event are hashed.

{
  "type": "hash",
  "fields": [
    fieldPath1,
    fieldPath2,
    ...
    fieldPathN
  ]
}

Pseudonymization

In the following rule, each field is a string, which is enclosed between quotation marks and contains the field path in JSON path format.

Note: The listed fields may not be all present for each event at run time. Only the listed fields that can be matched in the event are encrypted.

{
  "type": "encrypt",
  "params": {
    "key": "AES_key",
    "iv": "Initialization Vector"
  },
  "fields": [
    fieldPath1,
    fieldPath2,
    ...
    fieldPathN
  ]
}

Data removal

Each field is a string, which is enclosed between quotation marks and contains the field path in JSON path format. To protect the data, the listed fields are removed.

Note: The listed fields may not be all present for each event at run time. Only the listed fields that can be matched in the event are removed.

{
  "type": "remove",
  "fields": [
    fieldPath1,
    fieldPath2,
    ...
    fieldPathN
  ]
}

Example of data protection rules

{{
   "actions": [
   {
       "type": "remove",
       "fields": [
         $['trackedFields']['MyStringField.string']
       ]
   },
   {
       "type": "hash",
       "fields": [
         $['trackedFields']['MyStringField.string']
       ]
   },
   {
       "type": "encrypt",
       "params": {
         "key": "MTIzNDU2NzgxMjM0NTY3OA==",
         "iv": "MTIzNDU2Nzg4NzY1NDMyMQ=="
       },
       "fields": [
         "$['trackedFields']['MyStringField.string']",
         "$['data.at1523259151214']['MyString.string']"
       ]
   }
   ]
}

Uploading rules

You can upload data protection rules by using a JSON file.

New in 18.0.2

The API endpoint for uploading rules uses NodePort services.

https://<ICP_IP>:<Admin_Node_Port>/api/upload

curl -k -X POST -u <username>:<password> 
https://mycluster.icp:<admin_node_port>/api/actions -H 
'content-type: multipart/form-data' -F "file=@/users/bai/validRules.json;type=application/json"

For 18.0.0 and For 18.0.1

The file size is limited to 4096 bytes. The API endpoint for uploading rules differs depending on whether you use Ingress or NodePort services.

Ingress: https://<Ingress_IP>/ibm-bai-<release name>/admin/api/actions

NodePort: https://<ICP_IP>:<Admin_Node_Port>/api/upload

curl -k -X POST -u <username>:<password> 
https://mycluster.icp/ibm-bai-<release name>/admin/api/actions -H 
'content-type: multipart/form-data' -F "file=@/users/bai/validRules.json;type=application/json"