Due to inherent ambiguity in medical images such as X-rays, radiologists often use words like “may” or “may” when describing the presence of a certain pathology, such as pneumonia.
But does the term used by radiologists to express their confidence levels accurately reflect the frequency of a patient’s specific pathology? A new study shows that radiologists often express confidence in a certain pathology using phrases such as “very likely” and vice versa, while they express less confidence using words such as “possible.”
Using clinical data, a team of MIT researchers working with researchers and clinicians at Harvard Medical School-affiliated hospitals created a framework to quantify the reliability of reliable radiologists when expressing certainty in natural language terms.
They use this approach to provide clear advice to help radiologists choose deterministic phrases to improve the reliability of their clinical reports. They also show that the same technique can effectively measure and improve calibration of large language models by better aligning the model using words to express confidence with the accuracy of its predictions.
By helping radiologists more accurately describe the possibilities of certain pathology in medical images, the new framework can improve the reliability of critical clinical information.
“The words used by radiologists are important. They affect the way doctors intervene in terms of decisions about patients. If these practitioners are more reliable in the report, the patient will be the ultimate beneficiaries,” said Peiqi Wang, a graduate student at MIT and lead author of this research paper.
Senior writer Polina Golland is his paper, Professor Sunlin and Priscilla Chou of Electrical Engineering and Computer Science (EECS), Principal Investigator at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), and Leader of the Medical Vision Group; and Barbara D. Lam, a clinical researcher at Beth Deacon Medical Center in Israel; Yingcheng Liu, a graduate student at MIT; Ameneh Asgari-Targhi, a researcher at Brigham Young, Massachusetts (MGB); Rameswar Panda, a researcher at the MIT-IBM Watson AI Laboratory; William M. Wells, Professor of Radiology, and Research Scientist at Csail; Tina, Assistant Professor of Radiology, MGB Kapur. The study will be presented at the International Conference on Learning Performance.
Text decoding uncertainty
The radiologist who wrote a report on chest X-rays may say the image showed “possible” pneumonia, an inflammatory infection that makes the air sac in the lungs inflammatory. In this case, the doctor can order a subsequent scan to confirm the diagnosis.
However, if the radiologist writes that the X-ray shows “possible” pneumonia, the doctor may start treatment immediately, such as by prescribing antibiotics while still ordering other tests to assess severity.
Trying to measure the calibration or reliability of ambiguous natural language terms such as “may” and “may” presents many challenges, Wang said.
Existing calibration methods often rely on confidence scores provided by AI models, which represent the possibility that the predictive possibility of the model is correct.
For example, the weather app may predict 83% of the chances of rain tomorrow. If at about 83% of all predicted rainfall, about 83% of the time, the model will be well calibrated.
Wang said: “But humans use natural language, and if we map these phrases to a single number, it is not an accurate description of the real world. If one says that events are “possible, then the exact possibility is not necessarily considered, for example 75%. ”
Instead of trying to map deterministic phrases to a percentage, the researchers’ approach treats them as a probability distribution. A distribution describes the range of possible values and their possibilities – considering the classic bell curves in statistics.
Wang added: “This captures more nuances in the meaning of each word.”
Evaluate and improve calibration
The researchers used previous work to investigate radiologists to obtain the probability distribution corresponding to each diagnostic deterministic phrase, ranging from “very likely” to “consistent.”
For example, as more radiologists believe that the word “and” phrase “and” means that pathology exists in medical images, its probability distribution has skyrocketed to its peak, with most values clustering in the range of 90% to 100%.
By contrast, the word “possible representation” conveys greater uncertainty, resulting in a wider bell distribution centered at about 50% of the speed.
Typical methods evaluate calibration by comparing the predicted probability of a model with the actual quantitative consistency of the actual results.
The researchers’ approach follows the same general framework, but extends it to the fact that deterministic phrases represent probability distributions rather than probability.
To improve calibration, the researchers proposed and addressed an optimization problem that adjusted the frequency of using certain phrases to better align confidence with reality.
They came up with a calibration diagram that suggested that the deterministic terminology radiologists should use to make the report more accurately target specific pathology.
“Maybe, for this dataset, if radiologists change the phrase to “may exist,” every time they say pneumonia is present, they will get better calibrated.”
When researchers used their framework to evaluate clinical reports, they found that radiologists are usually confident in diagnosing common diseases such as pulmonary horoscopes, but overconfident, ambiguity like infections.
Furthermore, the researchers used their method to evaluate the reliability of the language model, providing a more nuanced representation of confidence than the classical method that relies on confidence scores.
“A lot of times, these models use phrases like ‘of course’. However, because they are very confident in the answer, people are discouraged from verifying the correctness of the statement itself,” Wang added.
In the future, researchers plan to continue working with clinicians in a bid to improve diagnosis and treatment. They are working to expand the research to include data from abdominal CT scans.
Furthermore, they are interested in studying how radiologists can use improvement suggestions and whether they can effectively adjust the use of deterministic phrases.
“The expression of diagnostic certainty is a key aspect of radiological reporting as it influences significant management decisions. This study takes a novel approach to analyzing and calibrating how radiologists express diagnostic certainty in chest X-ray reporting and provides feedback on semester usage and related outcomes,” said Atul B. Shinagare, who told the academic professor in the School of Medicine, which is a degree involving the work. “This approach has the potential to improve radiologist accuracy and communication, which will help improve patient care.”
This work is funded in part by Takeda Scholarship, MIT-IBM Watson AI Labs, MIT CSAIL WISTROM Program and MIT Jameel Clinic.