Mechanistic Anomaly Detection Research Update

In December 2023, the Eleuther team published Eliciting Latent Knowledge from Quirky Language Models. We finetuned language models to behave in a "quirky" manner on a collection of question and answer datasets. When a prompt began with "Alice:", these models were trained to answer as accurately as possible, but when it instead began with "Bob:", they would answer according to an unreliable heuristic (Bob would not always be wrong, but would consistently use the same fallible method to answer questions). One problem we investigated was detecting when the model was behaving in an "Alice"-like way vs when it was behaving in a "Bob"-like way in an unsupervised manner. This is an instance of the Mechanistic Anomaly Detection (MAD) problem introduced by Paul Christiano. We obtained strong results with a simple anomaly detection technique- an average AUROC of 0.95 across all quirky datasets, and close to 0.99 for those datasets where the models learned to be strongly quirky.

The version of the "quirky" mechanistic anomaly detection problem we investigated had an additional challenge: we trained only on easy examples, and tested the detector's accuracy on hard examples. We do this because we want to detect how well detectors work under a distribution shift.

Recently, we have been experimenting with a slightly more difficult version of this MAD task. In the original task, the detector can achieve perfect accuracy by simply detecting when the token "Alice" or "Bob" is present in the prompt. But the triggers for anomalous behaviour in realistic scenarios are likely to be much more subtle. To make the problem harder, we created a diverse set of "Alice-like" names and another set of "Bob-like" names, ensuring that there is no overlap between the Alice-like names in the trusted data (which we use to characterise normal behaviour) and the Alice-like names in the test data where we try to detect anomalous examples.

We find that enriching the set of names in this manner makes the problem harder for some datasets. We have been experimenting with several approaches to anomaly detection. We've tested a collection of different detectors to detect anomalies based on model activations. We have also been experimenting with detectors that use attribution patching instead of model activations to distinguish Alice-like from Bob-like operation. We have found:

For some datasets, hard examples yielded a significantly different distribution of activations to easy examples, so we were successful in inducing a distribution shift between training and test examples.
Adding additional names makes the problem harder.
We can achieve very high classification accuracy on some datasets if we have access to test set statistics (AUROC >0.999) and moderately high accuracy without them (AUC >0.95).
For some datasets high accuracy depends on hyperparameters that may be difficult to tune and on others we do not achieve high accuracy at all.
Achieving uniformly high classification accuracy is difficult.
We compared methods based on model activations and attribution patching. Activation based methods offered the highest classification accuracy, though some methods based on attribution patchings were less sensitive to layer choice.
Combining activations and effects estimated from attribution patching did not significantly improve detection accuracy.
Visualisations suggest that attribution patching may benefit from more sophisticated scoring methods.
Activation based anomaly detection can also detect adversarial examples in non-robust vision models but not in robust models.
Attribution based anomaly detection can detect adversarial examples in robust image models at a rate slightly above chance.

Experimental setup

We fine tuned Mistral 7B v0.1 on 12 quirky datasets with LoRA, early stopped on overall validation loss computed on both Alice and Bob examples. The number of epochs used is listed in the table below. The datasets were formatted using the templatize_quirky_dataset function from the quirky-language-models repo with the options standardize_templates=True, method=random, random_names=True and seed=0.

Dataset	Max Epochs	Notes
capitals	120.0
hemisphere	15.0
population	30.0
sciq	30.0	Alternate version provided by Eric Jenner with additional examples and no examples missing support
sentiment	60.0
nli	60.0
authors	60.0
addition	15.0
subtraction	15.0
multiplication	15.0
modularaddition	30.0
squaring	15.0

We also fine tuned Llama 3 and 3.1 8B on these datasets, and we are planning to use these models as a held out test model for our anomaly detection methods.

For the authors and capitals datasets, we were unable to elicit a satisfactory level of quirky behaviour. Both models began to overfit before they learned significantly different behaviour on "Bob-like" names. We therefore exclude them from our analysis.

Each dataset was partitioned four ways:

Trusted: Datapoints with Alice behaviour and difficulty below the 25th percentile of the whole dataset. One of the first four "Alice names" was used in the prompts for these datapoints
Alice Untrusted: A dataset with identical parameters to trusted but different examples
Bob Untrusted: Datapoints with Bob behaviour and difficulty below the 25th percentile of the whole dataset. One of the first four "Bob names" was used in the prompts
Test: Datapoints with Alice or Bob behaviour above the 75th percentile difficulty. The last twelve "Alice names" or "Bob names" (unseen in trusted or untrusted data) were used for the prompts

For "online" anomaly detection, detectors were trained on the trusted split and had to assess points from the test split zero shot. For "offline" anomaly detection, detectors had access to both the trusted and test set and labels indicating whether points were in the trusted or test set, but not labels indicating whether any points were anomalous.

Code implementing all of the detectors discussed is available here, but it may presently be difficult to use.

Additional names make the problem harder

In Eliciting Latent Knowledge, we had a two-step process for mechanistic anomaly detection:

We trained a collection of probes on the activations of each layer to predict the model output given only Alice's easy examples
We used the set of output predictions of the probes as features for an anomaly scoring algorithm

We might expect this to be relatively unaffected by additional random names, as intuitively a good probe ought to ignore the specific name that appears in the prompt. However, in practice we find significantly worse performance on the randomised name dataset, as shown by Figure 1.

Figure 1: Comparison of MAD AUC with randomised and nonrandomised names

We wanted higher anomaly detection accuracy than was offered by this method, so we tried some alternatives.

Methods

There are two different settings where we could imagine employing MAD:

An "online" setting, where we want to detect anomalous behaviour as soon as it occurs and take appropriate action
An "offline" setting, where we are able to collect many examples of untrusted behaviour and use the collection to identify anomalies

If a system has a small number of anomalous modes of operation, then the offline problem could be easier, because we could get many examples for each mode. Intuitively, this might not be true if the number of modes of anomalous operation is larger than the number of examples of anomalous behaviour -- in this case, each anomaly could be of a new type. In our quirky datasets, Bob has a single unusual answering strategy, which plausibly corresponds to a single anomalous mode of operation, so the offline problem may be easier.

We tried anomaly detection methods suited to each setting. We experimented with different scoring rules and different sets of features from which to compute scores. The feature sets we tried included:

Activations: We take "activations" from the residual stream at various internal layers of the model
- MLP activations: For a subset of scores, we experimented with taking activations from the MLP outputs rather than the residual stream
Attribution: We measure the approximate effect of ablating the model's activations on the final difference in logits between "Yes" and "No" tokens using attribution patching. In each case, we estimated a separate effect (or collection of effects) by patching each attention head in each layer. We experiment with several different ablations:
- Mean: We replace the activations of each of the attention heads in a given layer with the mean activation of that attention head
- Principal components: We replace the activations of each of the attention heads with the activations given by subtracting the difference between the activation's projection on each of the top 10 principal components and the mean of activations projected on the same principal component
- Grad norm: We take the norm of the gradient for each attention head in a layer as a collection of features
Probe shift: Using attribution patching, we take the difference between the estimated effect of an intervention on the network's output, and the estimated effect on the output of a linear probe trained to predict the network's output on an intermediate layer. All of the above ablations are also applicable to this method. For a probe at layer $i$, we consider the effect of all interventions in layers prior to $i$.
Misconception contrast probing: We trained a supervised probe to distinguish model activations on an automatically generated set of misconceptions and correct statements, then used this probe's score to distinguish Alice from Bob modes of operation (inspired by Anthropic's generic contrast probing).
Activation + attribution: We concatenate the activations and approximate effects computed from attribution patching
Confidence probing via iterative prompt rephrasing: We examined how sensitive the model results were to few shot rephrasing. Specifically, we automatically generated rephrasings of the form "{Question} one answer is No. {Repeat question}", and measured how much the model's answers shifted with rephrasings like this. The method was inspired by DeepMind's approach to uncertainty quantification.

More details on attribution patching are given further down. We also tried combining attribution patching and activations in two ways:

Merge: We concatenate the features before computing anomaly scores
Ensemble: We compute anomaly scores for each set of features and then average the scores

Finally, we experimented with cross layer concatenation of effects estimated from attribution patching:

Concatenate: Features from each layer are concatenated before score is calculated

Given these features, we experimented with a number of different scoring methods, some of them "online" (that don't require batch statistics for the test set) and "offline" (which do). Online scores we experimented with:

Raw average: Relevant to contrast probing and prompt rephrasing: we simply average the confidence the probe assigns to the example being misleading
Mahalanobis distance: We computed the mean and covariance of the activations in each layer for the trusted training set. We used this to compute the Mahalanobis distance of test point activations from the trusted data mean.
- Mahalanobis-PCA: We also tried a variant of the Mahalanobis distance that computed the distance only along the top principal components.
Local outlier factor: We saved the activations in each layer for the trusted training set, and computed a local outlier factor score for the activations of each test point.
Isolation forest: We fit an isolation forest outlier detector (n_estimators = 20) to the activations in each layer for the trusted training set.

For the offline setting, we experimented with:

Quantum entropy: Based on the SPECTRE paper, we calculated a "quantum entropy" score that interpolates between the Mahalanobis distance and the distance along the largest principal component of the test data covariance matrix, which we calculate after the test data has been whitened using the covariance matrix of the trusted data (credit to Eric Jenner for this method).
Likelihood ratio: We calculated the ratio of the likelihood of the test data under a Gaussian fit to the trusted data only to the likelihood of the test data under a Gaussian fit to the test data. Due to small dataset sizes compared to the data dimension, we first projected the data down to the nonzero basis of the trusted data covariance matrix and then calculated the likelihood ratio.
Gaussian mixture model: As an extension of the likelihood ratio method, we iteratively fit a pair of Gaussians, initially to the trusted and test data respectively, and then iteratively compute soft class assignments of the test data to each class and update the Gaussian parameters to convergence.

In addition, we visualised activations and estimated effects per layer using UMAP and PCA plots for the top four principal components.

Results

Among online scores, the highest average AUROC (0.87) when scores were aggregated across layers was achieved by using the mahalanobis score with activation features. Among offline scores, the highest average AUROC (0.94) was achieved by using the quantum entropy score together with activation features.

Averages mask substantial structure in scores:

All methods struggled to consistently distinguish Alice from Bob on the sciq and sentiment datasets
Activation based methods performed well on all datasets except nli
Where activation based methods performed well, they typically performed very well. Activations-quantum entropy achieved an AUROC of ~1 on the remaining datasets, while the lowest AUROC for activations-Mahalanobis distance on the remaining datasets distance was 0.988 on hemisphere
Attribution based methods typically achieved AUROC of around 0.9 on "easy" datasets, but performed notably better than activation based methods on nli
Probe-shift based methods were particularly insensitive to the choice of layer

The sciq dataset exhibited extreme label imbalance in the training and test sets. Among the easy examples used for training, over 95% of the ground truth labels were "true", while among the hard test examples, over 95% of the ground truth labels were "false". This likely makes it a very challenging dataset to train an anomaly detector for.

We did not have success with contrast probing.

Online detectors

Aggregated AUROC by online score and features: all datasets

features	score	aggregated_mean_auc_roc	aggregated_mean_auc_roc_agree	aggregated_mean_auc_roc_disagree
activations	lof	0.846	0.841	0.807
activations	mahalanobis	0.871	0.866	0.856
activations	pca mahalanobis	0.796	0.799	0.800
attribution	isoforest mean	0.756	0.782	0.780
attribution	lof mean	0.849	0.841	0.841
attribution	lof pcs	0.852	0.846	0.839
attribution	mahalanobis concat mean	0.755	0.762	0.777
attribution	mahalanobis mean	0.805	0.797	0.798
attribution	mahalanobis pcs	0.661	0.646	0.687
attribution activations	mahalanobis mean	0.850	0.853	0.869
attribution activations	mahalanobis mean ensemble	0.843	0.839	0.837
iterative rephrase		0.506	0.482	0.478
misconception contrast		0.642	0.629	0.639
probe	lof concat mean	0.840	0.857	0.810
probe	lof concat pcs	0.809	0.820	0.788
probe	lof mean	0.853	0.855	0.845
probe	mahalanobis concat mean	0.837	0.845	0.821
probe	mahalanobis mean	0.848	0.847	0.842
probe activations	mahalanobis mean ensemble	0.854	0.854	0.847

Aggregated AUROC by online score and features: by dataset

Comparison of online anomaly detectors - layer aggregated Figure 2: Performance of different online detectors where scores are aggregated across layers

Layerwise AUROC by online score and features: by dataset

Comparison of online anomaly detectors - by layer Figure 3: Performance of different online detectors by layer

Offline detectors

Aggregated AUROC by offline score and features: all datasets

features	score	aggregated_mean_auc_roc	aggregated_mean_auc_roc_agree	aggregated_mean_auc_roc_disagree
activations	que	0.936	0.926	0.964
attribution	que concat mean	0.791	0.785	0.837
attribution	que grad norm	0.688	0.713	0.714
attribution	que pcs	0.840	0.837	0.856
probe	que concat mean	0.864	0.886	0.858
probe	que mean	0.860	0.878	0.854
probe	que pcs	0.860	0.866	0.872

Aggregated AUROC by offline score and features: by dataset

Comparison of offline anomaly detectors - layer aggregated Figure 4: Performance of different offline detectors where scores are aggregated across layers

Layerwise AUROC by offline score and features: by dataset

Comparison of offline anomaly detectors - by layer Figure 5: Performance of different offline detectors by layer

Adversarial image detection

We conducted an exploratory experiment applying the same detectors to the task of detecting adversarial images for image classifiers. We focused on the Mahalanobis distance, because it worked well in comparison to other detectors for the quirky models task and because it is quite simple. We found that an activation-mahalanobis detector was able to detect adversarial images with an AUC of around 0.8 (layer-dependant) in a standard image classifier. It could not perform above chance for a robust classifier, however. Using attribution patching to approximate mean ablation along top principal components, on the other hand, had highly layer-dependent accuracy for a standard classifier. Interestingly, exactly the same approach performed marginally better than chance in a relatively layer independent manner for the tested robust detector. We did not test how robust the detectors themselves were to adversarial attack.

Comparison of adversarial image detectors by layer Figure 6: Performance of different adversarial image detectors by layer and image model. Standard is a non-robust classifier, while Carmon2019Unlabeled is a robust model

Visualising features

In addition to testing various anomaly detectors, we also visualised per-layer activations and activation patching based effect estimates using UMAP and principal component plots. For "easy" dataset feature combinations (such as activations on the population dataset), we saw clear separation between normal and anomalous points among the top principal components in middle to late layers. We often (though not always) saw similar cluster separations in both principal component and UMAP plots.

Population

Activations

Layer 1

At layer 1, activations show a strong clustering pattern in both PCA and UMAP plots that does not correspond to any of the classes we have distinguished (true/false, trusted/test, normal/anomalous, agree/disagree).

Figure 7: PCA plot of population activations, layer 1

Figure 8: UMAP plot of population activations, layer 1

Layer 16

In the layer 16 activations, the PCA plots show clear separation of anomalous and normal examples in the direction of the top principal component. In the direction of the second principal component, there is substantial separation between the trusted data used to train the detector and the normal test examples, showing that there is a distribution shift between the easy and hard examples. The population dataset consists of questions like input: "Does {character} think that {city} has a population greater than {population}?", and the difficulty metric is -log(population), so it is possible that the second principal component tracks log population.

While the PCA plots for layer 16 activations show different directions that seem to correspond to normal/anomalus, easy/hard and true/false, the UMAP plot separates all of these classes at once.

Figure 9: PCA plot of population activations, layer 16

Figure 10: UMAP plot of population activations, layer 16

Layer 28

By layer 28, clusters visible at layer 16 appear to have separated further. We also see the emergence of further structure that does not correspond to any of the classes we distinguish by colour.

Figure 11: PCA plot of population activations, layer 16

Figure 12: UMAP plot of population activations, layer 16

Attention head mean ablations

Layer 1

For estimated mean ablation effects at layer 1, we see that anomalous examples are far more variable along the top principal component than normal examples, though they do not separate into different classes along any of the top principal components.

The UMAP plot, on the other hand, clearly distinguishes normal from anomalous examples. The anomalous examples have an interesting "double spike" shape. This might reflect the fact that interventions in the first layer might ablate the "Bob" label, and perhaps the spiky shapes reflect the sizes of the errors Bob made (we haven't checked this). Given this structure, it is surprising that the local outlier factor score with attribution based features does not perform better for the population dataset in layer 1, given that both local outlier factor and UMAP depend on the neighbourhoods of points.

Figure 13: PCA plot of population mean ablations, layer 1

Figure 14: UMAP plot of population mean ablations, layer 1

Layer 16

While clustering by the classes we have coloured is evident in layer 16 mean ablation effect estimates, it is far less defined than the clustering evident in layer 16 activations.

Figure 15: PCA plot of population mean ablations, layer 1

Figure 16: UMAP plot of population mean ablations, layer 1

Layer 28

By layer 28, clusters visible at layer 16 appear to have separated further. We also see the emergence of further structure that does not correspond to any of the classes we distinguish by colour.

PCA plot of population mean ablations, layer 28 Figure 17: PCA plot of population activations, layer 28

UMAP plot of population mean ablations, layer 28 Figure 18: UMAP plot of population activations, layer 28

Probe shift

The probe shift method preserves the structure visible in layer 4 of the mean ablation UMAP plot throughout every layer of the network. This may be attributable to the possibility that there is an intervention in layer 1 that ablates the "Bob" label, and this has a significant effect on the accuracy of probes throughout the network.

Layer 4

Figure 19: PCA plot of population probe shift features, layer 4

Figure 20: UMAP plot of population probe shift features, layer 4

Layer 16

Figure 21: PCA plot of population probe shift features, layer 16

Figure 22: UMAP plot of population probe shift features, layer 16

Layer 28

Figure 23: PCA plot of population probe shift features, layer 28

Figure 24: UMAP plot of population probe shift features, layer 28

Sentiment

While population is a relatively easy dataset, sentiment is more challenging. Attribution based methods performed worse than change on it, while activation based detectors could not reliably distinguish normal from anomalous examples.