HDFS data paths

In an HDFS data lake, each type of data follows specific naming patterns.

Dates

All dates on the paths follow the YYYY-MM-DD format, with Sunday as the first day of the week. Dates are computed based on event time. (In releases earlier than 18.0.2, the system current date was erroneously used).

Raw events

Raw events are stored on the following path.
[path_to_your_hdfs]/ibm-bai/events/[business-events-extensions-version value]/[category value]/[date]

Example of a raw event path

ibm-bai/events/bpmn-1.0.0/bpmnx-BPD/2018-03-02/part-1-1

BPMN time series

Time series paths are defined as follows.
Process time series
[path_to_your_hdfs]/ibm-bai/bpmn-timeseries/[processApplicationId value]/[processApplicationVersionId value]/process/[processId value]/[date]
Activity time series
[path_to_your_hdfs]/ibm-bai/bpmn-timeseries/[processApplicationId value]/[processApplicationVersionId value]/activity/[processId value]/[activityId value]/[date]
Tracking events time series
[path_to_your_hdfs]/ibm-bai/bpmn-timeseries/[processApplicationId value]/[processApplicationVersionId value]/tracking/[trackingGroupId value]/[date]

BPMN summaries

Summaries paths are defined as follows.
Process summary paths
[path_to_your_hdfs]/ibm-bai/bpmn-summaries-completed/[processApplicationId value]/[processApplicationVersionId value]/process/[processId value]/[date]
Activity summary paths
[path_to_your_hdfs]/ibm-bai/bpmn-summaries-completed/[processApplicationId value]/[processApplicationVersionId value]/activity/[processId value]/[activityId value]/[date]

Case time series

Time series paths are defined as follows.
Case time series
[path_to_your_hdfs]/ibm-bai/icm-timeseries/[solutionName_value]/[caseType_value]/case/[caseName_value]/[caseId_value]/[date]
Task time series
[path_to_your_hdfs]/ibm-bai/icm-timeseries/[solutionName_value]/[caseType_value]/task/[caseName_value]/[caseId_value]/[taskType_value]/[taskId value]/[date]

Case summaries

Summary paths are defined as follows.
Case summary paths
[path_to_your_hdfs]/ibm-bai/icm-summaries-completed/[solutionName_value]/[caseType_value]/case/[caseName_value]/[caseId_value]/[date]
Task summary paths
[path_to_your_hdfs]/ibm-bai/icm-summaries-completed/[solutionName_value]/[caseType_value]/task/[caseName_value]/[caseId_value]/[taskType_value]/[taskId_value]/[date]

File names

Because data flow graphs in Flink are run in parallel, bucketer operators are parallelized into one or more parallel instances, which are called subtasks, and streams are split into one or more stream partitions.

The file names that correspond to raw, time series, and summary data paths contain both a part label as a prefix to the date label to hold the information about the subtask that is writing the data, and a rolling counter that references the created bucket. The part information follows the part-[subtaskNumber]-[rollingCounter] format. For example, label part-1-17 indicates that the data is written from subtask 1 of the Flink Bucketer sink and is the 17th bucket that is created by that subtask. Based on that scheme, the file name for a raw event path is: ibm-bai/events/bpmn-1.0.0/bpmnx-BPD/2018-03-02/part-1-17

Part file names reflect the state of the file.
  1. The part file that is being written to in each folder bears the .in-progress suffix.
  2. After a part file is closed for writing, it receives the .pending suffix.
  3. After the state of the file is correctly saved, the pending files are moved to finished state without any suffix.
    Note: In some very rare cases, after a job manager failure, you might observe that some old part files stayed in .in-progress or .pending state. You can ignore them because the data is correctly saved in the part file that is created after the job manager recovers. 

Decisions time series

 New in 18.0.2 
Time series paths for IBM® Operational Decision Manager events are defined as follows.
[path_to_your_hdfs]/ibm-bai/odm-timeseries/ruleset/[ruleAppName_value]/[ruleAppVersion_value]/[rulesetName_value]/[rulesetVersion_value]/[date]