Final winter, a senior nurse in our psychiatric unit informed me, “The dashboard says we’re low-risk. However throughout night time shifts, I don’t even really feel protected strolling to the toilet.”
The month-to-month high quality report on her desk stated the identical factor it had stated for practically a 12 months: “Violence incidents: no vital distinction among the many three wards (p > .05).”
On paper, her ward regarded regular. On the bedside, it was something however.
Her unit cared for extra high-acuity sufferers, had a lot greater turnover, and used restraints extra ceaselessly. Employees weren’t the issue. Sufferers weren’t the issue. The statistics have been.
The error: treating occasion counts as in the event that they have been common scores
The reassuring report was based mostly on a quite common statistical error. The analyst used ANOVA, a technique designed to match averages, to match counts of violent incidents.
In hospitals, there are two very totally different sorts of numbers:
- Counts: What number of instances one thing occurred (20 violent incidents, 7 falls, 6 code blues).
- Means: How giant one thing is on common (common documentation hours, common ache scores, common blood strain).
Counts reply “what number of.” Means reply “how a lot.” They don’t seem to be interchangeable.
In our hospital, the three wards reported:
- Ward A (psychiatric): 20 violence incidents
- Ward B (medical): 7 incidents
- Ward C (surgical): 6 incidents
To any clinician, the distinction is clear. However ANOVA doesn’t see “20 vs. 7 vs. 6” the best way we see it. It transforms them into averages per affected person. If every ward cared for about 100 sufferers, the numbers grow to be:
- 0.20 incidents per affected person
- 0.07 incidents per affected person
- 0.06 incidents per affected person
As soon as transformed, the dramatic distinction collapses into three small decimals. As a result of the occasion counts are low and since ANOVA just isn’t designed for yes-or-no occasions, it simply concludes that the distinction is likely to be random. The official report then states: no vital distinction.
It’s like utilizing a ruler to resolve what number of cats you might have. The mistaken software makes very totally different teams seem the identical. A chi-square check, which is designed for categorical counts, would virtually actually have flagged Ward A as really greater threat.
However utilizing the mistaken technique produced the mistaken message: All wards are the identical.
The human penalties of “no vital distinction”
As soon as the report was distributed, the results have been fast and painful.
- Requests for extra workers from the psychiatric unit have been denied. Management believed the ward’s threat was not statistically greater.
- Issues from frontline nurses have been reframed as emotional relatively than evidence-based.
- Directors felt assured within the p-value, pondering they have been being honest.
In the meantime, the hole between information and actuality grew wider.
Nurses realized a irritating lesson: The numbers on the slide deck don’t describe the world they work in. Some left. Those that stayed carried the workload and the emotional weight.
Then the AI system arrived, skilled on the identical flawed numbers
Three months later, the hospital launched an AI software to foretell agitation and violence. The concept was easy: practice the mannequin on previous incidents, then flag high-risk sufferers.
However the AI realized from the identical statistical misunderstanding that claimed all three wards had the identical threat. To the algorithm, each ward regarded comparable.
The psychiatric ward quickly grew to become flooded with alerts. Medium-risk sufferers have been labeled high-risk, whereas genuinely unstable sufferers have been often missed. A junior nurse informed me, “When everyone seems to be high-risk, nobody is high-risk.”
Alert fatigue set in. A software designed to extend security was now undermining belief.
When AI overrules scientific instincts
Throughout one busy night, our 62-year-old attending doctor checked the AI overlay on a newly admitted affected person. The show confirmed a relaxed inexperienced label: low threat of agitation.
The cost nurse disagreed. She observed the affected person’s pacing, facial stress, and escalating voice. “I’ve a foul feeling about this,” she stated.
Pressed for time and seeing the AI’s assured label, the attending sided with the mannequin. 10 minutes later, the affected person punched a resident within the face.
Afterward, the attending stated quietly, “Perhaps I’m getting previous. Perhaps the AI sees issues I don’t.”
However the AI was not seeing extra. It was repeating the mistaken statistics it had been skilled on. The hurt was not solely the bodily damage. It was the self-doubt planted in a clinician with many years of expertise.
A second drawback: stopping at ANOVA and skipping post-hoc checks
One other mistake got here from a unique kind of research.
When the hospital in contrast common documentation time throughout three departments, ANOVA was appropriately used. The p-value was lower than 0.01, displaying an actual distinction. However the evaluation stopped there. Nobody requested the subsequent query: Precisely which departments differ from each other?
Put up-hoc checks, resembling Tukey’s check, reply that query. They’ll reveal findings resembling:
- Division Z paperwork considerably greater than Departments X and Y.
- Departments X and Y are usually not considerably totally different from one another.
With out that step, management responded with a blanket coverage: “Everybody should scale back documentation time by 20 minutes.”
The division drowning in paperwork acquired no focused assist. The opposite two have been pressured to chop time they didn’t have, simply to satisfy a quantity.
When outcomes like this feed AI fashions that try and establish “inefficient” items, the algorithm quietly learns the identical obscure message: Everyone seems to be a part of the issue.
How these statistical selections have an effect on clinicians
These errors don’t remain inside spreadsheets. They present up as:
- False reassurance
- False alarms
- Automation bias
- Erosion of scientific judgment
- Lack of belief in information and AI
- Frontline fatigue
That is how unhealthy statistics harm good clinicians.
The answer is fundamental, not high-tech.
Defending clinicians within the age of AI begins lengthy earlier than the algorithm. It begins with the info.
- Use chi-square for occasion counts.
- Use ANOVA for averages.
- Comply with ANOVA with post-hoc checks when acceptable.
- Pair p-values with easy counts and percentages.
- Acknowledge that “not vital” doesn’t all the time imply “no distinction.”
- Educate clinicians simply sufficient statistics to ask, “What precisely are we evaluating?”
- Make certain AI methods be taught from appropriately analyzed information.
This isn’t about turning clinicians into statisticians. It’s about giving them reliable numbers.
AI doesn’t erode scientific judgment; unhealthy information does
When our statistics are mistaken, our AI can be mistaken. When AI is mistaken, clinicians doubt themselves.
AI didn’t inform the psychiatric nurse her ward was protected. The misused ANOVA did. AI didn’t weaken the attending’s instincts. An extended chain of statistical shortcuts did.
Defending scientific judgment within the age of AI doesn’t begin with the algorithm. It begins with the numbers we feed into it, and with listening to the clinicians who knew one thing was mistaken lengthy earlier than the p-value did.
Gerald Kuo, a doctoral scholar within the Graduate Institute of Enterprise Administration at Fu Jen Catholic College in Taiwan, focuses on well being care administration, long-term care methods, AI governance in scientific and social care settings, and elder care coverage. He’s affiliated with the House Well being Care Charity Affiliation and maintains an expert presence on Fb, the place he shares updates on analysis and neighborhood work. Kuo helps function a day-care middle for older adults, working carefully with households, nurses, and neighborhood physicians. His analysis and sensible efforts give attention to decreasing administrative pressure on clinicians, strengthening continuity and high quality of elder care, and growing sustainable service fashions by way of information, expertise, and cross-disciplinary collaboration. He’s significantly curious about how rising AI instruments can help ageing scientific workforces, improve care supply, and construct better belief between well being methods and the general public.