Protecting HDFS data
By default, business data contained in raw events, time series, and summary events is written to the HDFS data lake without any obfuscation. You can define rules to obfuscate or remove sensitive data from the events before this data is written to the data lake.
Objectives
Protecting your data can help you meet your objectives for GDPR compliance.
Important: You can protect only business data and user names. Any rule that is defined on any other field is
invalid and causes all rules to be ignored.
You can define three types of rules.
- Anonymization
- Any means of identifying the business data is irreversibly destroyed.
- Pseudonymization
- Identifiable business data is replaced with a reversible and consistent value in such a way that additional information is required to re-identify the data.
- Data removal
- The business data is removed from the event that is being processed.
Rule definitions
The rules are provided as a JSON file with the following
format.
{
"actions": [
rule1,
rule2,
...,
ruleN
]
}
Each rule is a JSON object. Its content depends on the protection mode you choose for the
business data.
- Anonymization
- In the following rule, each field is a string, which is enclosed between quotation marks and
contains the field path in JSON path format. To protect the data, the values of the fields that are
listed in the fields property are hashed. Note: The listed fields may not be all present for each event at run time. Only the listed fields that can be matched in the event are hashed.
{ "type": "hash", "fields": [ fieldPath1, fieldPath2, ... fieldPathN ] }
- Pseudonymization
- In the following rule, each field is a string, which is enclosed between quotation marks and
contains the field path in JSON path format.Note: The listed fields may not be all present for each event at run time. Only the listed fields that can be matched in the event are encrypted.
{ "type": "encrypt", "params": { "key": "AES_key", "iv": "Initialization Vector" }, "fields": [ fieldPath1, fieldPath2, ... fieldPathN ] }
- Data removal
- Each field is a string, which is enclosed between quotation marks and contains the field path in
JSON path format. To protect the data, the listed fields are removed.Note: The listed fields may not be all present for each event at run time. Only the listed fields that can be matched in the event are removed.
{ "type": "remove", "fields": [ fieldPath1, fieldPath2, ... fieldPathN ] }
Example of data protection rules
{{
"actions": [
{
"type": "remove",
"fields": [
$['trackedFields']['MyStringField.string']
]
},
{
"type": "hash",
"fields": [
$['trackedFields']['MyStringField.string']
]
},
{
"type": "encrypt",
"params": {
"key": "MTIzNDU2NzgxMjM0NTY3OA==",
"iv": "MTIzNDU2Nzg4NzY1NDMyMQ=="
},
"fields": [
"$['trackedFields']['MyStringField.string']",
"$['data.at1523259151214']['MyString.string']"
]
}
]
}
Uploading rules
You can upload data protection rules by using a JSON file.
- New in 18.0.2
- The API endpoint for uploading rules uses NodePort
services.https://<ICP_IP>:<Admin_Node_Port>/api/upload
curl -k -X POST -u <username>:<password> https://mycluster.icp:<admin_node_port>/api/actions -H 'content-type: multipart/form-data' -F "file=@/users/bai/validRules.json;type=application/json"
- The API endpoint for uploading rules uses NodePort
services.
- For 18.0.0 and For 18.0.1
- The file size is limited to 4096 bytes. The API endpoint for uploading rules differs depending
on whether you use Ingress or NodePort services.
- Ingress: https://<Ingress_IP>/ibm-bai-<release name>/admin/api/actions
- NodePort:
https://<ICP_IP>:<Admin_Node_Port>/api/upload
curl -k -X POST -u <username>:<password> https://mycluster.icp/ibm-bai-<release name>/admin/api/actions -H 'content-type: multipart/form-data' -F "file=@/users/bai/validRules.json;type=application/json"