Root causes are the end of the road. They should be the original event in the line of causation—the first domino, as it were—and mostly explain the issue. If that root cause of the data issue doesn’t occur, neither should any of the proximal causes. It is directly causal to all of them.
Root causes, of course, aren’t always clear, and correlations aren’t always exact. If you aren’t feeling confident about your answer, a probabilistic way to tease out your true confidence score is to try this thought experiment: Say your boss tells you your team will go all-in on your hypothesis and nobody’s going to check it before it goes into production, and your name will be all over it. If it’s wrong, it’s all your fault. What 0-100 confidence score would you give your hypothesis? If it’s lower than 70, keep investigating.
Common root cause data issues include:
1. User error: We’ll start with user errors because they’re common. Perhaps someone entered the wrong schema or wrong value, which means the pipeline doesn’t read the data, or did the right thing with incorrect values, and now you have a task failure.
2. Improperly labeled data: Sometimes rows shift on a table and the right labels get applied to the wrong columns.
3. Data partner missed a delivery: Also very common. You can build a bulletproof system but you can’t control what you can’t see and if the data issues are in the source data, it’ll cause perfectly good pipelines to misbehave.
4. There’s a bug in the code: This is common when there’s a new version of the pipeline. You can figure this out pretty quickly with versioning software like Git or GitLab. Compare the production code to a prior version and run a test with that prior version.
5. OCR data error: Your optical scanner reads the data wrong, leading to strange (or missing) values.
6. Decayed data issue: The dataset is so out of date as to be no longer valid.
7. Duplicate data issue: Often, a vendor was unable to deliver data, and so the pipeline ran for last week’s data.
8. Permission issue: The pipeline failed because the system lacked permission to pull the data, or conduct a transformation.
9. Infrastructure error: Perhaps you maxed out your available memory or API call limit, your Apache Spark cluster didn’t run, or your data warehouse is being uncharacteristically slow, causing the run to proceed without the data.
10. Schedule changes: Someone (or something) changed the scheduling and it causes the pipeline to run out of order, or not run.
11. Biased data set: Very tricky to sort out. There’s no good way to suss this out except by running some tests to see if the data is anomalous compared to a similar true data set, or figuring out how it was collected or generated.
12. Orchestrator failure: Your pipeline scheduler failed to schedule or run the job.
13. Ghost in the machine (data ex machina): It’s truly unknowable. It’s tough to admit that’s the case, but it’s true for some things. The best you can do is document and be ready for next time when you can gather more data and start to draw correlations.
And then, of course, there’s the reality where the root cause isn’t entirely clear. Many things are correlated, and they’re probably interdependent, but there’s no one neat answer—and after making changes, you’ve fixed the data issue, though you’re not sure why.
In those cases, as with any, note your hypothesis in the log, and when you can return to it, continue testing historical data, and be on the lookout for new issues and more explanatory causes.