When Your AI Explains Itself (And Gets It Wrong): Rethinking Trust in Language Models for Food Safety

October 15, 2025

Imagine you’re feeling unwell. You visit an online clinic and describe your symptoms: stomach pain, diarrhea, and a slight fever right after eating some questionable chicken. The system listens and says, “High likelihood of salmonella poisoning.” Makes sense. But what really reassures you is what comes next: the AI explains itself.

“If you hadn’t mentioned a fever, I would have said it’s unlikely to be Salmonella.”

It’s a clean, logical explanation. You feel better, not because of the diagnosis, but because the AI seems to reason like a doctor would.

Now picture this: someone removes the word “fever” from your report and resubmits it. The AI still says “high likelihood of salmonella.” Nothing changed. So, was that explanation… a lie?

Welcome to the world of plausible but unfaithful or invalid self-explanations: a critical blind spot in how we evaluate and trust AI systems, especially in high-stakes areas like food safety.

The Problem: When AI Reasoning Is Just Window Dressing

With the rapid rise of large language models (LLMs), we’re seeing these systems embedded in everything from chatbots to healthcare assistants. But alongside their fluency comes a pressing challenge: interpretability. If a model suggests quarantining a food product line or initiating an outbreak investigation, we need to know why.

That’s where self-explanations come in. These are pieces of text that LLMs generate, sometimes even by default, to justify their outputs. Here, the LLMs have learned to justify their conclusion as many humans would do (remember that LLMs are trained on human-written texts and therefore mimic human behaviour). In our paper “Mind the Gap: From Plausible to Valid Self-Explanations in Large Language Models,“ we explore the issue of plausible but inaccurate self-explanations on the following two types:

Extractive self-explanations: Highlight parts of the input that were important (e.g., “The phrase ‘mild fever’ influenced my decision.”).
Counterfactual self-explanations: Hypothesize what would happen if the input were different (e.g., “If the patient had not reported a fever, I would have predicted a low likelihood of Salmonella.”).

But before we dive in, let’s clarify some nomenclature. In the scientific community, the accuracy of explanations is measured in different terms:

Faithfulness: Does the explanation correctly describe why the model made its decision? This applies broadly to all explanations.
Validity: Is a counterfactual explanation actually true? That is, if the model says “Without X, I’d predict Y,” does removing X really change the output to Y?

Faithfulness asks: Does the explanation reflect the model’s internal workings? Validity asks: Is the claimed “what if” scenario actually true?

Faithfulness is a general property; it applies whether the explanation is a sentence, a heatmap, or an attention score. But validity is specific to counterfactuals.

And here’s the crux: plausibility does not guarantee faithfulness or validity. As this paper and others (e.g. “Comparing zero-shot self-explanations with human rationales in text classification“ by Brandl and Eberle, “Evaluating the Reliability of Self-Explanations in Large Language Models” by us) show, the opposite is the case. LLM generated self-explanations are generally plausible but not trustworthy.

The Problem: When AI Reasoning Is Just Window Dressing

So, how do we get LLMs to produce valid counterfactual self-explanations?

We propose a simple but effective strategy: sample multiple counterfactual candidates and keep the one that checks out. Think of it as a “brute-force” search for truth.

Here’s a simple breakdown of how it works:

1. Determine the next most probable classification outcome of the LLM. This can be done by inspecting the internal variables of the LLM and is described in detail in our paper. In our example this could be an allergic reaction instead of salmonella for instance.
2. Prompt for a counterfactual by asking the LLM to provide a version of its previous input, minimally altered so that it leads to the prediction of the alternative outcome determined in the first step.
3. Check if the produced counterfactual is valid (i.e., does it actually change the model’s output when the input is edited accordingly?). If so, congratulations: you’re done.
4. Otherwise try again: Either with the same outcome or the next most probable one.

Mathematically, this corresponds to sampling with replacement until you hit a valid candidate. Based on the paper’s results, sampling as few as 2–5 candidates is often enough to achieve high validity. Importantly, this method scales well even for more complex tasks, especially with larger models like LLaMA 3.1–70B.

However, model performance varied:

All models except Gemma 2.0 and LLaMA 3.2 reached at least 90% valid samples on the first try on simpler tasks.
Harder tasks (like subjective classification) saw lower first-try success but retained high textual similarity, indicating models understood the semantics even if they missed the classification nuance.

What the Results Say: Conclusions from the Study

To wrap it all up, the paper provides several important takeaways:

Yes, extractive self-explanations are plausible and correlate well with human judgments. But:
No, they are not usually faithful, especially in subjective tasks that require inference or ambiguity resolution.
Yes, valid counterfactuals can be generated automatically, but success depends on model size, task complexity, and clever prompting.

Perhaps most importantly:
Valid counterfactuals act like semantic mirrors: not windows into the black box, but reflections of how well the model interprets what you’re saying.

Final Bite

In an era where AI systems are making decisions that affect our health, safety, and trust, understanding how they explain themselves is not just a technical detail, but a matter of accountability. Self explanations can be powerful tools for building trust. But only if they are more than just plausible stories. They must be valid or faithful! Because when it comes to AI, a “reasonable-sounding” explanation might just be the most dangerous thing on the menu.

Article originally posted on Medium.