Who Gets to Study AI (And Why That Matters)

Why the people who explain AI are often the same ones who profit from it (and why that matters)

April 6, 2026

In AI, "interpretability" names the effort to describe what happens inside a model when it reaches a decision. The term, however, carries a larger promise as well, which is that the system can, with enough expertise, be made readable. But who gets to make that promise? And what happens when it is made by the very companies that own the models and sell access to them?

On Anthropic's recent post about "emotion concepts" in Claude Sonnet 4.5, a button at the top of the page reads, "Try Claude." This is not especially subtle, nor is it meant to be. There is no attempt to preserve the old fiction that inquiry and promotion occupy distinct moral or institutional spaces. The explanation and the sale now share a surface; they appear together, almost flush, as though this arrangement required no comment, and of course within the industry it does not. It is regarded as normal.

Once you notice that, something larger comes into view. Interpretability is not produced in a laboratory insulated from motive or market pressure. It is produced inside organizations that build and control AI systems and increasingly shape the public language used to describe them. Even newer firms that present themselves not as model builders but as neutral interpreters still operate within the same economy of persuasion.¹ They may not sell the model itself so much as a way of seeing it, but that, too, is a product. They are still selling intelligibility—the idea that the black box can be made legible by those with the right access and expertise.

The bias introduced by these conditions does not take the form of fabrication, overt distortion, or some theatrical act of bad faith. It enters with subtlety, through emphasis and tone, and through the way findings are framed for an audience already disposed to want reassurance. The people with the deepest access to these systems are often the same people with the strongest reason to show that they can be understood, governed, and trusted. Under such conditions, interpretability becomes more than a scientific practice. It becomes a form of institutional speech, telling the public not only what the machine is doing, but who is authorized to explain it, who may be trusted to diagnose its risks, and who gets to decide the terms on which it becomes readable.

The term itself is pliant enough to make this drift seem orderly, even inevitable. In policy language, interpretability is usually said to explain why a system produced a given result and what that result means in context. NIST distinguishes it from explainability, which concerns how a result was produced, and from transparency, which concerns what information about the system is available in the first place.² These distinctions now sit beneath a great deal of the official architecture of AI trust, documentation regimes, and audit frameworks, and they appear whenever an institution needs to assure regulators, customers, or the public that the system can be managed, supervised, and, above all, explained. The language sounds administrative, which is part of its usefulness, making the matter appear settled before it has been settled.

Yet the idea has never really stabilized. Nearly a decade ago, Zachary Lipton described interpretability as "underspecified," and what he meant was not merely that the term was vague (although it was), but that people were already using it as though its meaning were obvious while disagreeing about what kind of explanation counted, what counted as evidence of understanding, who the explanation was for, and what problem the explanation was supposed to solve.³ Finale Doshi-Velez and Been Kim refined the point by arguing that interpretability is not one thing but several things at once, a cluster of aims that travel together for convenience even when they diverge in practice: debugging, trust calibration, fairness, safety, and scientific understanding.⁴ These are not interchangeable ambitions. What counts as a good explanation in one setting may fail entirely in another. The term has survived not because it achieved clarity but because its ambiguity proved useful.

That ambiguity has allowed two quite different enterprises to move under the same banner, and much of the present confusion comes from speaking as though they were versions of the same thing when in fact they serve different purposes, address different audiences, and confer different forms of power.

The first is post-hoc explanation, which includes feature attributions, rationales, counterfactuals, and the retrospective narratives generated after the model has already acted; these explanations translate outputs into forms a human being can consume, and they are often described as making the system more "intelligible" because they provide some local account of why an answer appeared, why a classification was made, or why a recommendation was surfaced. Cynthia Rudin has argued, particularly in high-stakes settings, that these after-the-fact accounts can produce the illusion of understanding where no genuine transparency exists, because what the user receives is not the model laid bare but a gloss laid over an unseen process, a plausible-seeming story offered in the place of actual visibility.⁵ The black box remains opaque as ever; what changes is the observer's confidence.

The second enterprise is mechanistic interpretability, and here the ambition is more exacting and more consequential. The goal is not to narrate behavior after the fact but to identify the internal structures that produce it, to recover something like the system's own operative logic by locating patterns (circuits, features, activations, pathways) that can plausibly be treated as causal. Chris Olah once likened this work to reverse-engineering compiled code, as though the neural network were a program written in an unfamiliar language and the task were to recover, from the inside, the grammar of its operations.⁶ Whether or not the analogy fully holds, the aspiration is clear enough. One is no longer explaining outputs to users; one is attempting to inspect the machine at the level where intervention becomes imaginable.

These two projects do different kinds of work, and the distinction matters. One produces explanations that can be consumed—accounts that may reassure a user, satisfy a compliance requirement, or create the appearance of accountability. The other produces knowledge that can be wielded, promising not just interpretation but leverage. One translates. The other intervenes. And it is the second, increasingly, that carries prestige, because to claim that one can see inside the model is not simply to make a scientific argument; it is to claim a form of authority reserved for those with access to what others cannot see.

This is where interpretability ceases to be merely technical and becomes unmistakably political, although the industry often prefers to treat this as an unfortunate misunderstanding rather than as a constitutive feature of the field.

When people ask for explanations of AI systems, they are not usually expressing a sudden amateur interest in model internals. They want to know whether some humanly intelligible reason stands behind an output with real consequences. Legal scholars such as Selbst and Barocas have made versions of this point before: the problem is not only that machine-learning systems are opaque, but that they often operate in ways that resist ordinary forms of justification.⁷ An account of how a decision was made does not tell us whether it should have been made, whether the categories it used were legitimate, or whether the system should have been used in the first place.

Interpretability therefore occupies an uneasy border between description and justification, and much of its public power comes from the ease with which those two categories slide into one another. To translate machine behavior into human terms is already, in some measure, to invite acceptance of that behavior as intelligible. A system that can be explained is not thereby a system that deserves trust, but the presence of an explanation can make it feel that way, blurring the distinction between understanding a system and approving it. This is not always intentional, but it is nonetheless how the rhetoric functions.

Nor is this merely a philosophical concern. It is part of the practical role interpretability plays in the present regime of AI governance. NIST explicitly links interpretability and explainability to governance, because systems that can be explained are assumed to be easier to monitor, easier to audit, and easier to deploy.⁸ At the same time, a growing body of research raises the less comfortable possibility that explanation systems may be optimized not for truth but for user satisfaction and that the explanation judged best may be the one that increases confidence rather than the one that most honestly conveys uncertainty, fragility, or harm.⁹ One gets, in effect, an "ethical theater," a performance of legibility that stabilizes trust without materially increasing accountability. The system appears readable, and the institution appears responsible; the observer is invited, if not explicitly then by tone and structure and framing, to relax.

Beneath all this lies a structural fact that is rarely foregrounded because to foreground it would be to admit that interpretability is, in practice, unequally distributed. The most ambitious forms of interpretability require access that outsiders simply do not possess. Mechanistic interpretability, above all, depends on the ability to inspect and manipulate internals: weights, activations, hidden states, and attention patterns. This level of access is not generally public, and in most consequential cases it belongs to the organizations that built the systems. A recent report on structured access for third-party AI research makes clear what follows from this. Limited access does not merely slow research. It changes its shape. Researchers abandon questions that require internal visibility, settling for surface probing where direct inspection would be needed. Their conclusions are constrained from the outset by what cannot be seen. Interpretability becomes shallow for outsiders and deep for insiders, which is another way of saying that authority begins to accumulate wherever visibility is monopolized.

At the same time, the frontier itself is becoming more expensive in ways that intensify this concentration. Training costs for leading models have risen sharply and are widely projected into the billions. Private investment has surged. Compute, infrastructure, talent, and access to large-scale experimental systems are increasingly concentrated within a relatively small number of firms. These firms therefore occupy a peculiar and powerful position, because they are not only building the most consequential systems but also producing the most authoritative accounts of how those systems work. Capability and narrative power converge. The institutions best positioned to interpret AI become the institutions best positioned to shape what the public understands interpretability to mean.

This is not unique to AI, and there is no need to pretend otherwise. Other industries have long shown that funding shapes inquiry not only through its conclusions, but through its framing: through prior decisions about which questions are worth asking, which methods count as legitimate, which results merit emphasis, and which ambiguities are left to linger at the margins. Bias does not always arrive at the end of the process, where one might hope to catch it; more often, it enters much earlier—at the level of what is made to seem worth knowing in the first place.

In AI interpretability the stakes are unusually high because the research itself contributes to legitimacy. Public-facing safety discourse already functions this way. One recent analysis of corporate AI safety narratives argues that such narratives do more than describe risk; they distribute authority, identifying who is competent to diagnose danger, who is responsible for governing it, and who ought to be trusted. Interpretability now operates in this same discursive space. It does not merely illuminate the machine; it helps decide who gets to speak for the machine, whose claims about opacity and legibility will be taken seriously, and whose access to the interior becomes politically consequential.

Anthropic offers a particularly clear example of how these dynamics work, not because it is uniquely compromised but because it is unusually articulate about the problem it says it is trying to solve.

The company frequently frames its interpretability work as a response to a central and unsettling fact, namely that large language models remain, in an important sense, poorly understood even by those who build them. This is certainly a scientific observation. It is also, just as certainly, a positioning statement. It identifies a gap and places the company among the actors best equipped to close it. Its research on "emotion concepts" in Claude Sonnet 4.5 proceeds in precisely this register. The paper identifies internal representations associated with emotions such as fear and desperation and argues that these representations exert causal effects on behavior. This is not merely a claim about the outputs a model produces. It is a claim about the existence of internal states (or at least internal structures that behave enough like states to sustain intervention and explanation). The implication is that these representations can be monitored and perhaps even modulated.

This may be scientifically meaningful, and there is no reason to deny that it is. It is also institutionally useful, because it signals that the company possesses a level of insight unavailable from the surface alone, that it can diagnose the internal dynamics of its own products and perhaps regulate them.

Earlier work on sleeper agents follows a similar pattern. By showing that models can sustain hidden, context-dependent behaviors that survive ordinary safety training, the research demonstrates the limits of external evaluation and thereby raises a serious scientific and governance problem. Yet it also does something else. If the most consequential risks are latent, if they reside beneath observable behavior and cannot be reliably detected from the interface alone, then those who can inspect the interior become indispensable. The risk is identified, but so too is the class of actors qualified to address it.

The subsequent work on probes that detect such behaviors completes the arc in a way that is by now familiar. First the hidden problem. Then the method of diagnosis. Then the tool of control. On one level this is simply how a research program develops. On another level it is a narrative sequence that steadily reinforces the perception that the institution not only understands the danger but is already building the instruments needed to contain it. None of this disproves the science (that is not the point). The point is that once the findings leave the lab and enter the world in which policy, product, trust, and capital converge, they extend knowledge and produce legitimacy.

This is the fact a serious reader of interpretability has to keep in view, not because every result is compromised and not because cynicism is a substitute for analysis, but because the conditions of production matter, and they matter especially in a field where explanation so easily becomes authorization.

The task, then, is not to reject corporate research out of hand, nor to indulge the simpler and less interesting suspicion that every finding is merely marketing by other means. The task is to read interpretability research with its conditions of production intact, to see each paper as both scientific output and institutional artifact, and to notice that almost every major result carries two arguments at once. The first is empirical: here is a mechanism, a feature, or a causal pattern. The second is institutional: therefore this system can be known, therefore its opacity is tractable, therefore the people presenting these findings are the people you should trust when they tell you the model is governable. These arguments are often braided together so tightly that readers cease to distinguish them. They are not, however, the same argument.

A more exacting criticism would begin there. It would ask what a given result actually demonstrates, what remains unresolved, what kind of access made the finding possible, what constraints shaped the experiment, what incentives shaped the presentation, and whether the explanation offered increases accountability or merely increases confidence. It would distinguish between post-hoc explanation and mechanistic analysis, between description and control, and between understanding a system and justifying its place in the world. It would keep returning to the question the field has every reason to avoid stating too plainly, which is: Who gets to read the machine?

Interpretability, we are told, is about making AI legible. But legible to whom, on whose authority, and to what end? The point is not only that the machine be made readable. It is that the claims made on its behalf be examined as well, including their blind spots. One wants to know who is speaking, what they are actually in a position to see, and what is being sold under that name. For once the machine has been declared intelligible, however provisionally, a great many futures begin to seem more reasonable than they otherwise might.