Say you monitor metrics on a daily basis and have daily anomaly detection jobs. These anomaly detection jobs run every day at 4am. These jobs pull data from your data warehouse.
Let’s say your data warehouse is refreshed twice every day, at 1 am and 1 pm. Last night, during the 1 am run, some pipelines failed. As a result, you have some tables with partial data for yesterday.
Now when the anomaly jobs run at 4 am, they will generate lots of false anomalies. When these jobs compare actual values with the expected values for yesterday, many metrics will be far below expected values.
We faced this problem early on. One option before us was to just do nothing. Assume data quality as a prerequisite for anomaly detection. But we didn’t take this option.
We reminded ourselves of the following:
- Design for failure. Assume pipelines will fail, not once but repeatedly.
- Our job doesn’t end at generating anomalies. Rather it starts from there. It is our job to deliver high signal-to-noise ratio for anomalies.
And that’s what we did. Cuebook anomaly detection gracefully handles scenarios like mentioned above.