The AI detector delusion: when a percentage score replaces judgement
An AI detector is not a breathalyser for content. It does not prove who wrote the work or how responsibly it was produced. At best, it raises a question. Here is why businesses should stop treating AI scores as verdicts and start asking better questions about judgement, accountability and results.
A marketing director receives a finished article from an external writer.
It is clear and structured. It answers the brief and explains a complicated subject without making the reader work too hard. There is a point to it, a purpose, and a plausible place in the wider marketing plan.
Then someone runs it through an AI detector.
The result comes back:
64% AI-generated.
Panic follows.
Has the writer cheated? Has the agency passed off machine-written copy as paid human work? Should the article be rejected, rewritten or sent back with a frosty note about authenticity?
Possibly.
But there is another, more awkward explanation.
The article may simply be doing what most corporate content has been trained to do. It may be clear, orderly, grammatical, evenly paced and written in a familiar business style.
In other words, it may not be fraudulent at all.
It may just be predictable.
The process matters. But so does judgement.
Let us clear away one misunderstanding first.
The way content is produced does matter.
A business is entitled to care whether a supplier has used AI, protected confidential information, checked sources, respected copyright and done the thinking they were paid to do.
That is not paranoia.
It is sensible management.
The worry is not really about keystrokes. Nobody serious believes a piece of content becomes morally superior because every word was typed slowly by hand. The real concern is whether the expertise, judgement and accountability genuinely exist, or whether they have been quietly outsourced to a system that has none of those things.
That is a legitimate concern.
But it is not answered by a percentage score.
A detector result may raise a question or justify a conversation. It may be one signal among several.
It should not be mistaken for editorial judgement.
AI detectors are not forensic tests
The first mistake is treating an AI detector like a drugs test.
It is not taking a sample, finding a prohibited substance and proving the presence of machine-written copy. It is making a statistical judgement about whether the language resembles patterns associated with AI-generated writing.
That distinction matters.
A detector does not know who wrote the piece or what happened before the draft arrived. It cannot tell whether a writer used AI for research, structure, summarising, editing, phrasing or not at all.
Nor can it see whether a human draft has been heavily edited, whether a corporate style guide has stripped out all sense of joy or originality, or whether three egos on some committee have quietly removed every interesting sentence.
It sees the text.
Then it produces a score.
That score may look satisfyingly objective. It gives managers something to point at. It turns a difficult editorial judgement into a simple compliance decision.
But the simplicity is false.
OpenAI’s own AI classifier was withdrawn in 2023 because of its low accuracy. Even before that, OpenAI warned that the classifier was not fully reliable, should not be used as a primary decision-making tool, could incorrectly label human-written text as AI-written, and struggled with predictable text.
That should give any serious business pause.
What the score is really noticing
AI detectors are suspicious of tidiness.
Smooth sentences. Familiar transitions. Arguments that proceed in the expected order. Prose that says the next sensible thing, in the next sensible place.
That may indicate machine-written text.
It may also indicate corporate writing that has been briefed, edited, approved, softened and sanded down until every splinter has gone.
The detector is not necessarily finding a cheat.
It may just be finding the house style.
Much corporate content is not written to astonish anyone. Its job is to reassure, explain and move a reader towards a decision. It avoids unnecessary eccentricity, keeps paragraphs short, uses headings and chooses clarity over verbal performance.
Those are not crimes.
They are often the point.
So when a detector flags a clear, useful, well-structured piece of B2B content, the question is not simply, “Did the writer cheat?”
It is, “Are we asking a probability score to make a judgement that belongs to an editor, marketer or commercial lead?”
The sanitisation trap
There is a neat irony here.
For years, businesses have asked writers to make copy smoother, cleaner and more consistent.
They have imposed brand guidelines and tone of voice rules. They have asked for shorter sentences, plainer English, fewer quirks, fewer risks and fewer phrases that might upset legal, compliance, product, sales or the senior partner who does not like contractions.
And they have also encouraged the use of automated tools to tidy grammar, smooth syntax and remove rough edges.
Then, having trained the prose into uniformity, they run it through another automated tool and complain that it looks too uniform.
This is the sanitisation trap.
A human writer is asked to produce clean, frictionless, on-brand copy. The copy is edited and polished until it resembles every other competent piece of business communication in the category. Then an AI detector looks at the result and says, in effect, “This is suspiciously smooth.”
Well, yes.
That is what the business asked for.
What if detection gets better?
There is a fair objection here.
AI detection will improve.
Watermarking systems such as Google’s SynthID are not the same as the crude percentage-score tools that try to infer AI use from style alone. SynthID Text works by embedding an invisible statistical signal into AI-generated text so that it can later be detected. Google says the watermark is robust to some changes, including mild paraphrasing, but also says the method has limits. It is less effective on factual responses, and detector confidence can be greatly reduced when AI-generated text is thoroughly rewritten or translated.
Soon, a business may be able to say with more confidence: this paragraph, image, video, voiceover or draft began in an AI system.
Fine.
Then what?
That fact may matter a great deal. It may matter in regulated work, academic integrity, journalism, evidence, political content, copyright, confidentiality, disclosure or any situation where origin is central.
It may matter in marketing too.
If a supplier promised not to use AI and used it anyway, that is a trust problem. Pasting a confidential brief into the wrong system is a confidentiality problem. Sending generated filler under a person’s name is a professional problem.
But AI involvement is not, by itself, the whole judgement.
A watermark may tell you that AI was involved. It cannot tell you whether that involvement was lazy or intelligent. It cannot tell you whether the writer abdicated responsibility or used a tool under proper editorial control.
That distinction matters.
A junior marketer pasting a vague prompt into a chatbot and publishing the output is not doing the same thing as a senior consultant using AI to test structure, interrogate objections, summarise source material or accelerate an early draft before rewriting, checking and owning the finished piece.
Both may leave some trace of AI involvement.
They are not the same professional act.
The detector has skin in the game
There is another question businesses should ask before turning an AI detector score into policy.
Who benefits when the score is treated as a verdict?
AI detection companies are not disinterested public utilities. They are businesses. Their market grows when institutions become anxious about AI authorship. It grows when universities, publishers, agencies, marketing teams and procurement departments decide they need a tool to protect themselves from synthetic work.
That does not make the tools useless.
It does mean their incentives deserve scrutiny.
Pangram is an interesting example because it does not look like one of the old, flimsy detectors throwing percentages at any pasted text. The company claims 99.9%+ accuracy, third-party verification, detection of AI assistance and a false positive rate of one in 10,000. It also says AI-generated text is not inherently bad, while arguing that society benefits from being able to distinguish between human and machine-authored work.
That is the reasonable version of the argument.
There is a lot of synthetic filler online. Publishers, educators and businesses do need better ways to understand where material has come from. Some detector companies may well build useful tools.
But the public performance around detection is messier.
Pangram’s CEO, Max Spero, has described himself as a “slop janitor”. The Atlantic’s account of him is tellingly double-edged: Pangram may be quite good, but it is not perfect, and detection gets harder as models improve and the internet itself becomes more synthetic.
Detection is also a market. The company selling the mop benefits when everyone agrees there is a spill.
That does not make the tool useless or the score wrong. It means the score should be handled as evidence, not as a verdict – especially when it moves from private concern to public accusation.
The recent controversy around the Commonwealth Short Story Prize shows how quickly that can happen. Jamir Nazir’s winning story, “The Serpent in the Grove”, was publicly suspected of being AI-generated. Pangram reportedly returned a 100% AI-generated result, while the Commonwealth Foundation said shortlisted writers had stated that no AI was used and that, until a reliable process exists for unpublished fiction, the prize has to operate on trust.
That is not a small thing.
To the detector company, this may look like cleaning up the internet. To the writer on the receiving end, it may feel like something else – a reputational allegation shaped by a commercial product with its own interest in making detection matter.
None of that proves Pangram was wrong.
It means “the tool said so” is not enough.
The same scrutiny cuts both ways. I sell judgement, editing and content strategy, so I have skin in this game too. Of course, I would rather businesses valued those things over a detector score.
So do not take my argument on trust either.
Paying experts to game the wrong test
Say a senior content consultant delivers a strong piece of thought leadership.
The brief is understood. So is the commercial angle. The claims have been checked, the argument supports a real buyer need, and the piece is clear enough for a busy director and substantial enough to justify its place on the website.
Then it fails a detector.
Not because anyone has found an error, or because the argument is thin, or because the piece is derivative, inaccurate or useless.
Because the score says 34% AI-generated.
If the business treats that score as a verdict, the response is not improvement. It is evasion.
The consultant now has to make the work look less suspicious to software. Smooth sentences are roughed up. Clean transitions are disturbed. Sensible word choices are swapped for stranger ones. Paragraphs are broken where they do not need breaking.
The work may pass the detector.
It may also be worse.
And the client has paid for that.
Not for sharper thinking, better evidence or a more useful argument. For a lower number on a tool that was never capable of judging the work in the first place.
That is not quality control.
It is money that is being spent on compliance theatre.
The better questions are not complicated.
Is it true and useful? Are the claims properly sourced? Has the brief been protected? Does the writer understand the buyer? Can they explain the decisions behind the piece and take responsibility for the final version?
That is content management.
A detector score is not.
But performance is not a licence for bad practice, and a detector score is not a substitute for either responsibility or results.
“Get the AI score below 10%” is not a serious content standard.
Stop mistaking authorship signals for judgement
The panic around AI-written content is understandable.
Nobody wants to pay expert rates for lazy work or generic filler dressed up as insight. Nor does any sensible buyer want to discover that a supplier has ignored the brief, pasted in a prompt and sent over the result with an invoice attached.
But flawed AI detector scores are a poor answer to a real problem.
They give businesses a feeling of control while often pushing attention away from the things that matter. In practice, they can penalise clarity, encourage evasive editing and make people ask how to satisfy the tool rather than how to improve the work.
When better provenance tools arrive, they will not remove the need for judgement. They may show that AI was involved. They will not show whether the work was lazy, careful, expert-led, commercially useful or properly owned.
The companies that sell detection do not sit outside that argument. They have their own market to build. That does not make them wrong. It does mean businesses should be careful about treating their scores as verdicts.
If you are paying a skilled writer or consultant, you should not be paying them to make strong content look less suspicious to software.
You should be paying them to make the content more useful, more accurate, more persuasive and more commercially effective.
Stop mistaking an authorship signal for editorial judgement.