Author: Dr Joël Henry
Read Time: 5 mins
Summary: This article outlines the steps for evaluating anomaly detection systems to ensure their effectiveness in real-world applications.
In previous articles, we have explained how our Reconstruction-based Anomaly Detector works (see Figure 1 below):
Figure 1: Reconstruction-based Anomaly Detector workflow
The questions we haven't addressed yet are: How do I evaluate my Anomaly Detector? How do I know I can trust that the process described above was correctly set up? How do I know I can use these methods in production?
For more details on anomaly detection methods, please refer to our previous articles.
Reconstruction error can tell you how well you can learn a nominal signal (see point 1 in Figure 1).
It’s helpful and can even help you understand to a certain degree whether more data would aid your model's learning. However, there are a few limitations to using this approach for an anomaly detection workflow:
In the end, you want to evaluate your model as you plan to use it.
This is why, in this case, the best way to evaluate a model is to evaluate it in the same way you are planning to use it: assessing how well it classifies different signals as anomalous or non-anomalous (see point 2 in Figure 1). The immediate benefit of this method is that it will evaluate the entire workflow, including the model, score calculation, and threshold choice. This is what you need, as it encompasses the entire workflow used in production.
A common way to evaluate classification methods is to consider the classification matrix shown in Figure 2 below, which compares the prediction of your method with the actual result. In the end, what you want is for most of your predictions to be either true positives or true negatives, which means that you accurately predict whether a signal was anomalous or not. These values can be normalised, in which case they are referred to as rates.
One way to utilise these metrics practically when assessing different models and thresholds is to examine the ROC (Receiver Operating Characteristic) curves (see Figures 3 and 4).
These curves visualise the trade-off between the True Positive Rate (TPR) and False Positive Rate (FPR) across different thresholds. A perfect model has a curve that hugs the top-left corner, indicating a high TPR and a low FPR. The closer the curve is to the diagonal line, the less predictive the model. The area under the curve (AUC) quantifies overall performance, with a value of 1 indicating perfect performance and 0.5 indicating no better than random guessing.
|
|
Figure 3: The plot compares 3 models. The closer to the top left corner, the better the model. | Figure 4: The threshold will increase from 0 (TPR=1 and FPR=1) to 1 (TPR=0 and FPR=0). The highlighted point has a threshold of 0.7 (TPR=0.8 and FPR=0.2). |
These curves help you identify the best models and inform you of the trade-offs involved when choosing a "good" threshold. This depends on your specific use case. Do you prefer a solution that identifies all anomalies, albeit with a high rate of false alarms, or would it be better to miss a few anomalies in exchange for fewer false alarms?
This all looks great, I hear you say, but what’s the catch? Well, to evaluate if a signal or channel is correctly labelled by the solution, you need… labelled data.
Not only this, but you also need labels at the level at which you plan to use the method. Let’s take a few examples:
Now, to quickly wrap up, how do you accomplish this in the Monolith platform? It’s fairly simple. You use the AD Evaluation step to select all the models you want to evaluate, choose your labelled data, and adjust a few setup parameters.
That's it.
You will gain access to each model's ROC curve and additional information on its performance (see Figure 5).
Figure 5: Evaluating anomaly detection models using Monolith. The more labelled data is available, the more granular the curves will be.
When it comes to anomaly detection, the goal isn’t just accuracy — it’s impact. A model that performs well on abstract metrics but fails in the real world isn’t useful. That’s why evaluation should mirror reality.
Label some data. Simulate actual scenarios. Invest the effort to test your models in the way you will use them. It’s the difference between theoretical performance and practical value.
After all, you wouldn’t judge a firefighter’s readiness by how fast they can run — you'd judge them by how they handle a fire.
The same goes for anomaly detection: evaluate what matters.