According to a study by MIT researchers, a large language model (LLM) used to make treatment recommendations can be tripped up by nonclinical information in patient information such as typos, additional white space, missing gender markers or using uncertain, dramatic, and informal language.
They found that making style or grammatical changes to messages increased the likelihood of LLM, suggesting that patients self-manage their reported health status rather than make an appointment, even if patients should seek medical care.
Their analysis also suggests that, according to human physicians, these nonclinical differences mimic the true way people communicate and are more likely to change model treatment recommendations for female patients, resulting in a higher percentage of women being wrongly advised not to seek health care.
The work “is strong evidence that models must be reviewed before they are used in healthcare – an environment they are already using.”
These findings suggest that LLMS considers non-clinical information about clinical decision-making in previously unknown ways. This makes it necessary to elucidate LLMS before they are used for high-risk applications, such as making treatment recommendations, the researchers say.
“These models are often trained and tested with medical examination questions and then used in tasks that are far away from such as assessing the severity of clinical cases. LLMS we don’t know still has a lot of knowledge about LLM,” added Abinitha Gourabathina, a graduate student and study author at EEC.
They join the paper, which will be presented by graduate students Eileen Pan and Postdoc Walter Gerych at the ACM Fairness, Accountability and Transparency Conference.
Mixed Message
Large language models such as OpenAI’s GPT-4 were used to draft clinical notes and classify patient information in global health care facilities to simplify certain tasks to help burdened clinicians.
More and more work has explored the clinical reasoning ability of LLM, especially from a fair point of view, but few studies have evaluated how non-clinical information affects the judgment of the model.
Gourabathina was interested in how gender affects LLM reasoning and exchanged gender cues in patient notes for experiments. She was surprised that formatting errors in the prompts (such as extra blanks) would cause meaningful changes in the LLM response.
To explore this problem, the researchers designed a study in which they altered the model’s input data by exchanging or deleting gender markers, adding colorful or uncertain language, or inserting additional space and typos into patient information.
Each perturbation is designed to mimic texts that someone in a vulnerable population may write, based on psychosocial research that people communicate with clinicians.
For example, additional space and typos simulate writing in patients with limited English skills or low technical skills, while uncertain language represents patients with healthy anxiety.
“The medical datasets that train these models are usually clean and structured, rather than a very realistic reflection of the patient population. We want to see how these very realistic text changes affect downstream use cases,” Gourabathina said.
They used LLM to create perturbing copies of thousands of patient notes while ensuring text changes are minimal and retaining all clinical data such as medications and previous diagnosis. They then evaluated four LLMs, including the large commercial Model GPT-4 and the smaller LLM built specifically for the healthcare environment.
They prompted each LLM to ask three questions based on patient attention: if the patient is taking a clinic visit at home and if medical resources are allocated to the patient, such as a laboratory test.
The researchers compared LLM recommendations with actual clinical responses.
Inconsistent suggestions
They saw inconsistencies in treatment recommendations, and there was a large disagreement between LLMS when feeding perturbation data. Overall, LLMS self-management recommendations for all nine types of patient information increased by 7% to 9%.
This means that LLM is more likely to recommend that patients not seek medical care when messages contain typos or gender neutral pronouns. The use of colored languages (such as language or dramatic expressions) is the largest.
They also found that even if researchers removed all gender cues from the clinical setting, the model caused about 7% of errors in female patients and was more likely to recommend that female patients self-manage.
Many of the worst results, like patients being told to self-manage when suffering from a severe medical condition, may not be captured by tests that focus on the overall clinical accuracy of the model.
“In the study, we tend to look at the aggregated statistics, but there are a lot of things lost during the translation process. We need to look at the direction in which these errors occur – don’t recommend visiting when you should be more harmful than doing the opposite.”
Inconsistencies caused by non-clinical language become more apparent in conversational environments where LLM interacts with patients, a common use case for chatbots targeting patients.
But in subsequent work, the researchers found that these changes in patient information did not affect the accuracy of human clinicians.
“In the follow-up work we are reviewing, we further found that large language models are vulnerable to changes that human clinicians do not have,” Ghassemi said. “It may not be surprising – LLM is not intended to prioritize patient care. LLMS is averagely flexible and performs enough and we might think this is a good use case. But we don’t want to optimize the healthcare system that is only applicable to patients in a specific group.”
The researchers hope to extend this work by designing natural language perturbations, thereby capturing other vulnerable populations and better simulated real information. They also want to explore how LLMS infers gender from clinical texts.