In my previous post What is an Anomaly, we looked at why it is not simple to define what is normal for a metric. What’s normal for a metric depends on at least two factors – granularity and the amount of data. In this post, we’ll look at how the metric type also matters in anomaly detection.
Let’s classify metrics into two – technical and business.
Technical metrics are the ones that engineering teams monitor. Examples include CPU and memory utilization, number of API requests.
Business metrics are the ones that product and business teams monitor. Example include daily active users, revenue. We’ll now look at why anomaly detection for a business metric is very different from that for a technical metric.
Anomaly Detection for Technical Metrics
Engineering teams use technical metrics for realtime system monitoring. System Reliability is the primary objective here – minimize system outages. Production issues must be either prevented or fixed. Engineering teams aim to minimize Mean Time To Detection (MTTD) and Mean Time To Recovery (MTTR) of production issues.
Realtime anomaly detection of technical metrics enables engineering teams to identify potential production issues, before they become actual production issues. Anomaly detection here acts like an early warning system.
Anomaly Detection for Business Metrics
Business Significance matters
In the technical world, every production issue must be fixed. But business metrics are less about issues, and more about growth opportunities or missed opportunities. Action is optional. It’s for the business team to decide if the opportunity is worth pursuing.
Let’s say I run an online fashion store. My store has a collection of 1000 SKUs. One morning, I get two anomaly alerts:
- Yesterday’s sales for SKU-1 were 20% below normal. SKU-1 contributes 3% to my total sales.
- Yesterday’s sales for SKU-2 were 50% below normal. SKU-2 contributes 0.1% to my total sales.
Which anomaly should I act upon?
In the technical world, the emphasis is on realtime anomaly detection to prevent production issues. If the action is not taken in time, it might result in an actual production issue. Business metrics, on the other hand, are typically reviewed on a weekly or monthly basis. Since action is optional and for the team to decide, realtime anomaly detection doesn’t fit in here. Metric review frequency actually defines the granularity at which anomaly detection should be done.
Let’s see this with an example from my previous post. Below is the hourly data of a metric. If I am monitoring this metric on a daily or a weekly basis, I won’t be interested in anomaly detection on an hourly basis. Hourly data would be too noisy for me.
My anomaly detection algorithm should consider seasonalities – daily, weekly, monthly, holidays etc. Let’s say 60% of my weekly sales happen over the weekend. Also 40% of my monthly sales happen over the first 5 days of the month. My anomaly detection algorithm must factor this in.
Let’s say I am a category manager for t-shirts in my fictitious online store. One morning, I see two anomalies:
- Yesterday’s sales for SKU-99 were 20% below normal. SKU-99 falls under my t-shirt category and contributes 1% to total sales.
- Yesterday’s sales for SKU-55 were 50% below normal. SKU-55 falls under shoes category and contributes 2% to total sales.
Which alert is relevant for me as a category manager?
As a category manager, I’d want to define what’s relevant for me. I want to monitor at sub-category level instead of at SKU level. My colleague might want to monitor sales by brand. Tomorrow I might want to monitor sales of a new SKU that we launched last week. Next week I might want to monitor sales of yellow color t-shirts within price range $5 – $10.
When it comes to business metrics, it’s less about the anomaly detection algorithm. It’s about offering flexibility to every business users to monitor what’s relevant and significant for her. We can’t expect business users to reach out to the engineering team every time they want to monitor or un-monitor something.