HDFS data paths
In an HDFS data lake, each type of data follows specific naming patterns.
Dates
All dates on the paths follow the YYYY-MM-DD format, with Sunday as the first day of the week. Dates are computed based on event time. (In releases earlier than 18.0.2, the system current date was erroneously used).
Raw events
[path_to_your_hdfs]/ibm-bai/events/[business-events-extensions-version value]/[category value]/[date]
Example of a raw event path
ibm-bai/events/bpmn-1.0.0/bpmnx-BPD/2018-03-02/part-1-1
BPMN time series
- Process time series
- [path_to_your_hdfs]/ibm-bai/bpmn-timeseries/[processApplicationId value]/[processApplicationVersionId value]/process/[processId value]/[date]
- Activity time series
- [path_to_your_hdfs]/ibm-bai/bpmn-timeseries/[processApplicationId value]/[processApplicationVersionId value]/activity/[processId value]/[activityId value]/[date]
- Tracking events time series
- [path_to_your_hdfs]/ibm-bai/bpmn-timeseries/[processApplicationId value]/[processApplicationVersionId value]/tracking/[trackingGroupId value]/[date]
BPMN summaries
- Process summary paths
- [path_to_your_hdfs]/ibm-bai/bpmn-summaries-completed/[processApplicationId value]/[processApplicationVersionId value]/process/[processId value]/[date]
- Activity summary paths
- [path_to_your_hdfs]/ibm-bai/bpmn-summaries-completed/[processApplicationId value]/[processApplicationVersionId value]/activity/[processId value]/[activityId value]/[date]
Case time series
- Case time series
[path_to_your_hdfs]/ibm-bai/icm-timeseries/[solutionName_value]/[caseType_value]/case/[caseName_value]/[caseId_value]/[date]
- Task time series
[path_to_your_hdfs]/ibm-bai/icm-timeseries/[solutionName_value]/[caseType_value]/task/[caseName_value]/[caseId_value]/[taskType_value]/[taskId value]/[date]
Case summaries
- Case summary paths
[path_to_your_hdfs]/ibm-bai/icm-summaries-completed/[solutionName_value]/[caseType_value]/case/[caseName_value]/[caseId_value]/[date]
- Task summary paths
[path_to_your_hdfs]/ibm-bai/icm-summaries-completed/[solutionName_value]/[caseType_value]/task/[caseName_value]/[caseId_value]/[taskType_value]/[taskId_value]/[date]
File names
Because data flow graphs in Flink are run in parallel, bucketer operators are parallelized into one or more parallel instances, which are called subtasks, and streams are split into one or more stream partitions.
The file names that correspond to raw, time series, and summary data paths contain both a part label as a prefix to the date label to hold the information about the subtask that is writing the data, and a rolling counter that references the created bucket. The part information follows the part-[subtaskNumber]-[rollingCounter] format. For example, label part-1-17 indicates that the data is written from subtask 1 of the Flink Bucketer sink and is the 17th bucket that is created by that subtask. Based on that scheme, the file name for a raw event path is: ibm-bai/events/bpmn-1.0.0/bpmnx-BPD/2018-03-02/part-1-17
- The part file that is being written to in each folder bears the .in-progress suffix.
- After a part file is closed for writing, it receives the .pending suffix.
- After the state of the file is correctly saved, the pending files are moved to finished state
without any suffix.Note: In some very rare cases, after a job manager failure, you might observe that some old part files stayed in .in-progress or .pending state. You can ignore them because the data is correctly saved in the part file that is created after the job manager recovers.
Decisions time series
New in 18.0.2[path_to_your_hdfs]/ibm-bai/odm-timeseries/ruleset/[ruleAppName_value]/[ruleAppVersion_value]/[rulesetName_value]/[rulesetVersion_value]/[date]