How To Evaluate an Anomaly Detection Model? | Monolith

Written by Dr Joël Henry | Apr 22, 2025 12:46:29 PM

Author: Dr Joël Henry

Read Time: 5 mins

Summary: This article outlines the steps for evaluating anomaly detection systems to ensure their effectiveness in real-world applications.

How does reconstruction-based AD work?

In previous articles, we have explained how our Reconstruction-based Anomaly Detector works (see Figure 1 below):

First, a time series auto-encoder learns to reconstruct all the channels of the system (see our article on multivariate analysis)
Then, any deviation from the original signal is converted into an anomaly score (see our article on the anomaly score)
Finally, a threshold is applied to classify the signal as anomalous or non-anomalous

Figure 1: Reconstruction-based Anomaly Detector workflow

The questions we haven't addressed yet are: How do I evaluate my Anomaly Detector? How do I know I can trust that the process described above was correctly set up? How do I know I can use these methods in production?

For more details on anomaly detection methods, please refer to our previous articles.

Beyond Reconstruction Error: Why Model Accuracy Isn't Enough

Reconstruction error can tell you how well you can learn a nominal signal (see point 1 in Figure 1).

It’s helpful and can even help you understand to a certain degree whether more data would aid your model's learning. However, there are a few limitations to using this approach for an anomaly detection workflow:

It primarily focuses on nominal data and does not reveal how well the model will detect anomalies.
Signal reconstruction on nominal data is not directly correlated to how well the model will find anomalies. You could have trained a model that has learned some noise or other artefact that won’t show up in the reconstruction error, but that would mean the model cannot identify anomalies.
It only evaluates the Reconstruction part of the process and does not evaluate the anomaly score calculation or the threshold applied.

Evaluate As You Implement: The Classification Approach

In the end, you want to evaluate your model as you plan to use it.

This is why, in this case, the best way to evaluate a model is to evaluate it in the same way you are planning to use it: assessing how well it classifies different signals as anomalous or non-anomalous (see point 2 in Figure 1). The immediate benefit of this method is that it will evaluate the entire workflow, including the model, score calculation, and threshold choice. This is what you need, as it encompasses the entire workflow used in production.

A common way to evaluate classification methods is to consider the classification matrix shown in Figure 2 below, which compares the prediction of your method with the actual result. In the end, what you want is for most of your predictions to be either true positives or true negatives, which means that you accurately predict whether a signal was anomalous or not. These values can be normalised, in which case they are referred to as rates.

Figure 2: Example of a confusion matrix showing classification metrics. 45 out of 50 anomalies were identified, but an additional 20 false alarms were raised.

Want to learn about applying the confusion matrix to your model training? Check out this blog.

ROC Curves

One way to utilise these metrics practically when assessing different models and thresholds is to examine the ROC (Receiver Operating Characteristic) curves (see Figures 3 and 4).

These curves visualise the trade-off between the True Positive Rate (TPR) and False Positive Rate (FPR) across different thresholds. A perfect model has a curve that hugs the top-left corner, indicating a high TPR and a low FPR. The closer the curve is to the diagonal line, the less predictive the model. The area under the curve (AUC) quantifies overall performance, with a value of 1 indicating perfect performance and 0.5 indicating no better than random guessing.


Figure 3: The plot compares 3 models. The closer to the top left corner, the better the model.	Figure 4: The threshold will increase from 0 (TPR=1 and FPR=1) to 1 (TPR=0 and FPR=0). The highlighted point has a threshold of 0.7 (TPR=0.8 and FPR=0.2).

These curves help you identify the best models and inform you of the trade-offs involved when choosing a "good" threshold. This depends on your specific use case. Do you prefer a solution that identifies all anomalies, albeit with a high rate of false alarms, or would it be better to miss a few anomalies in exchange for fewer false alarms?

For this, you need labelled data.

This all looks great, I hear you say, but what’s the catch? Well, to evaluate if a signal or channel is correctly labelled by the solution, you need… labelled data.

Not only this, but you also need labels at the level at which you plan to use the method. Let’s take a few examples:

When performing ageing tests on a battery cell, to determine when a cycle is anomalous, you will need examples of both good and anomalous cycles to evaluate the method. Knowing which ageing block was anomalous will not be enough.
In a car racing context, if you want to know at which point an anomaly occurred on a track, you will need data labelled at the timestamp level, and knowing which lap is anomalous will not be enough.

I reiterate here because this is important: this labelling is not used to train the models, but it is necessary to evaluate them realistically, so consider it nearly as important as if it were needed to train the model.

Reducing Complexity: Evaluating AD models in Monolith

Now, to quickly wrap up, how do you accomplish this in the Monolith platform? It’s fairly simple. You use the AD Evaluation step to select all the models you want to evaluate, choose your labelled data, and adjust a few setup parameters.

That's it.

You will gain access to each model's ROC curve and additional information on its performance (see Figure 5).

Figure 5: Evaluating anomaly detection models using Monolith. The more labelled data is available, the more granular the curves will be.

Conclusion

When it comes to anomaly detection, the goal isn’t just accuracy — it’s impact. A model that performs well on abstract metrics but fails in the real world isn’t useful. That’s why evaluation should mirror reality.

Label some data. Simulate actual scenarios. Invest the effort to test your models in the way you will use them. It’s the difference between theoretical performance and practical value.

After all, you wouldn’t judge a firefighter’s readiness by how fast they can run — you'd judge them by how they handle a fire.

The same goes for anomaly detection: evaluate what matters.

View full post