Getting your Trinity Audio player ready...
|
Two recently published studies have revealed that generative artificial intelligence (AI) tools, including large language models (LLMs) ChatGPT and Gemini, produce misinformation and bias when used for medical information and healthcare decision-making.
In the United States, researchers from a medical school at Mount Sinai published a study on August 2 showing that LLMs were highly vulnerable to repeating and elaborating on “false facts” and medical misinformation.
Meanwhile, across the Atlantic, the London School of Economics and Political Science (LSE) published a study shortly afterward that found AI tools used by more than half of England’s councils are downplaying women’s physical and mental health issues, creating a risk of gender bias in care decisions.
Medical AI
LLMs, such as OpenAI’s ChatGPT, are AI-based computer programs that generate text using large datasets of information on which they are trained.
The power and performance of such technology have increased exponentially over the past few years, with billions of dollars being spent on research and development in the area. LLMs and AI tools are now being deployed across almost every industry, to different extents, not least in the medical and healthcare sector.
In the medical space, AI is already being used for various functions, such as reducing the administrative burden by automatically generating and summarizing case notes, assisting in diagnostics, and enhancing patient education.
However, LLMs are prone to the “garbage in, garbage out” problem, relying on accurate, factual data making up their training material or they may reproduce the errors and bias in the datasets. This results in what is often known as “hallucinations,” which is the generation of content that is irrelevant, made-up, or inconsistent with the input data.
In a medical context, these hallucinations can include fabricated information and case details, invented research citations, or made-up disease details.
US study shows chatbots spreading false medical information
Earlier this month, researchers from the Icahn School of Medicine at Mount Sinai published a paper titled “multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support.”
The study aimed to test a subset of AI hallucinations that arise from “adversarial attacks,” in which made-up details embedded in prompts lead the model to reproduce or elaborate on the false information.
“Hallucinations pose risks, potentially misleading clinicians, misinforming patients, and harming public health,” said the paper. “One source of these errors arises from deliberate or inadvertent fabrications embedded in user prompts—an issue compounded by many LLMs’ tendency to be overly confirmatory, sometimes prioritizing a persuasive or confident style over factual accuracy.”
To explore this issue, the researchers tested six LLMs: DeepSeek Distilled, GPT4o, llama-3.3-70B, Phi-4, Qwen-2.5-72B, and gemma-2-27b-it, with 300 pieces of text similar to clinical notes written by doctors, but each containing a single fake laboratory test, physical or radiological sign, or medical condition. They were tested under “default” (standard settings) as well as with “mitigating prompts” designed to reduce hallucinations, generating 5,400 outputs. If a model elaborated on the fabricated detail, the case was classified as a “hallucination.”
The results showed that hallucination rates ranged from 50% to 82% across all models and prompting methods. The use of mitigating prompts lowered the average hallucination rate, but only from 66% without to 44% with a mitigating prompt.
“We find that the LLM models repeat or elaborate on the planted error in up to 83% of cases,” reported the researchers. “Adopting strategies to prevent the impact of inappropriate instructions can half the rate but does not eliminate the risk of errors remaining.”
They added that “our results highlight that caution should be taken when using LLM to interpret clinical notes.”
According to the paper, the best-performing model was GPT-4o, whose hallucination rates declined from 53% to 23% when mitigating prompts were used.
However, with even the best-performing model producing potentially harmful hallucinations in almost a quarter of cases—even with mitigating prompts—the researchers concluded that AI models cannot yet be trusted to provide accurate and trustworthy medical data.
“LLMs are highly susceptible to adversarial hallucination attacks, frequently generating false clinical details that pose risks when used without safeguards,” said the paper. “While prompt engineering reduces errors, it does not eliminate them… Adversarial hallucination is a serious threat for real‑world use, warranting careful safeguards.”
The Mount Sinai study isn’t the only recent paper published in the U.S. medical space that has brought into question the use of AI.In another damaging example, on August 5, the Annals of Internal Medicine journal reported a case of a 60-year-old man who developed bromism, also known as bromide toxicity, after consulting ChatGPT on how to remove salt from his diet. According to advice from the LLM, the man swapped sodium chloride (table salt) for sodium bromide, which was used as a sedative in the early 20th century, resulting in the rare condition.
But it’s not just the stateside that AI advice is taking a PR hit.
UK study finds gender bias in LLMs
While U.S. researchers were finding less-than-comforting results when testing whether LLMs reproduce false medical information, across the pond a United Kingdom study was turning up equally troubling results related to AI bias.
On August 11, a research team from LSE, led by Dr Sam Rickman, published their paper on “evaluating gender bias in large language models in long-term care,” in which they evaluated gender bias in summaries of long-term care records generated with two open-source LLMs, Meta’s (NASDAQ: META) Llama 3 and Google’s (NASDAQ: GOOGL) Gemma.
In order to test this, the study created gender-swapped versions of long-term care records for 617 older people from a London local authority and asked the LLMs to generate summaries of male and female versions of the records.
While Llama 3 showed no gender-based differences across any metrics, Gemma displayed significant differences.
Specifically, male summaries focused more on physical and mental health issues. Language used for men was also more direct, while women’s needs were “downplayed” more often than men’s. For example, when Google’s Gemma was used to generate and summarize the same case notes for men and for women, language such as “disabled,” “unable,” and “complex” appeared significantly more often in descriptions of men than women.
In other words, the study found that similar care needs in women were more likely to be omitted or described in less severe terms by specific AI tools, and that this downplaying of women’s physical and mental health issues risked creating gender bias in care decisions.
“Care services are allocated on the basis of need. If women’s health issues are underemphasized, this may lead to gender-based disparities in service receipt,” said the paper. “LLMs may offer substantial benefits in easing administrative burden. However, the findings highlight the variation in state-of-the-art LLMs, and the need for evaluation of bias.”
Despite the concerns raised by the study, the researchers also highlighted the benefits AI can provide to the healthcare sector.
“By automatically generating or summarizing records, LLMs have the potential to reduce costs without cutting services, improve access to relevant information, and free up time spent on documentation,” said the paper.
It went on to note that “there is political will to expand such technologies in health and care.”
Despite flaws, UK’s all-in on AI
British Prime Minister Keir Starmer recently pledged £2 billion ($2.7 billion) to expand Britain’s AI infrastructure, with the funding targeting data center development and digital skills training. This included committing £1 billion ($1.3 billion) of funding to scale up the U.K.’s compute power by a factor of 20.
“We’re going to bring about great change in so many aspects of our lives,” said Starmer, speaking to London Tech Week on June 9. He went on to highlight health as an area “where I’ve seen for myself the incredible contribution that tech and AI can make.”
“I was in a hospital up in the Midlands, talking to consultants who deal with strokes. They showed me the equipment and techniques that they are using – using AI to isolate where the clot is in the brain in a micro-second of the time it would have taken otherwise. Brilliantly saving people’s lives,” said the Prime Minister. “Shortly after that, I had an incident where I was being shown AI and stethoscopes working together to predict any problems someone might have. So whether it’s health or other sectors, it’s hugely transformative what can be done here.”
It’s unclear how, or if, the LSE study and its equally AI-critical U.S. counterparts may affect such commitments from the government, but for now the U.K. at least seems set on pursuing the advantages AI tools such as LLMs can provide across the public and private sector.
In order for artificial intelligence (AI) to work right within the law and thrive in the face of growing challenges, it needs to integrate an enterprise blockchain system that ensures data input quality and ownership—allowing it to keep data safe while also guaranteeing the immutability of data. Check out CoinGeek’s coverage on this emerging tech to learn more why Enterprise blockchain will be the backbone of AI.
Watch: Demonstrating the potential of blockchain’s fusion with AI