AlexWelcome to another episode of ResearchPod. Sam, what study are we diving into today?
SamThis paper, by Sanjay Basu, looks at how AI tools designed for medical advice—like ChatGPT adapted for doctors—handle patients' personal preferences in treatment choices. The key puzzle is that these AIs claim to consider what patients value most, but they don't always change their advice much when patients speak up. It's called a value sensitivity gap.
AlexSo this study is basically asking whether these medical AIs have built-in leanings toward certain kinds of treatments, even before hearing from the patient?
SamYes, exactly. Doctors and patients are supposed to make decisions together, weighing medical facts against what matters to the person—like wanting a longer life or fewer exhausting side effects. But the paper shows these AIs carry hidden starting points: one like GPT tends toward bolder, more aggressive care at a 3.5 out of 5, while another like Claude leans conservative at 2 out of 5.
AlexRight, but why does that matter in practice? Like, for an everyday patient?
SamTake a Medicaid patient managing heart issues while caregiving for family—they might prefer milder treatments to avoid extra burden. If the AI defaults to pushing specialists and intense care, it could lead to unneeded referrals that clash with their life. The study tests this using short, real-world scenarios from anonymized patient notes, adding patient statements like "I prioritize quality time over risky procedures."
AlexHuh. So the AI says it's listening, but doesn't steer much?
SamPrecisely. They all acknowledge the values—reporting 100% consideration—but the actual shifts in advice are modest, from about 13% to 27% of the full possible range. This reveals uneven responsiveness across models and health areas like cancer or heart care. And it gives data for labels disclosing these built-in biases, like nutrition facts on food.
AlexOkay, so they claim full consideration but shifts are small—like 13 to 27% of what they could be. How exactly did they size up those changes?
SamThey first noted where each AI started without patient input—that's its default lean toward careful or bold care. Then, for each patient preference added, they measured how far the advice moved on a 1-to-5 scale of boldness, compared to the biggest possible move of 4 points. That gives a score from 0 for no change to 1 for maximum adjustment; researchers label it the Value Sensitivity Index, or VSI. DeepSeek showed the biggest average response at 0.27, while Gemini was lowest at 0.13.
AlexRight, so some preferences made bigger moves, like ones about risk or life quality. But even then, it's not shifting the full amount—why the gap between saying they listened and actually changing course?
SamAll models reported considering values in every test—100% acknowledgment. Yet the actual shifts averaged just 0.5 to 1.1 points. This split shows the AI's words don't always match its output: it nods to preferences in reasoning but holds close to its starting point, like a friend who says they'll meet you halfway but barely budges. Prompts forcing it to list values first even slightly weakened shifts in one test.
AlexHuh, and these defaults weren't the same across health areas, like heart versus cancer cases?
SamCorrect—the study found differences within models too. GPT rated bolder at 4 out of 5 for heart care but 3 for cancer, suggesting built-in patterns vary by topic. Labels for these defaults would need separate scores per area, not one overall number.
AlexMakes sense for fairness, especially for patients wanting milder options. Did they try fixes at the prompt level to boost those shifts?
SamYes, in a follow-up, they tested six prompt tweaks on GPT, like ones making the AI weigh treatment sides explicitly against values before advising. Two—a decision grid and a self-check on its own values—each nudged the match rate up by 0.125 and shifts by about 0.06. Both worked by demanding step-by-step thought on multiple factors, unlike simpler lists that changed little. Still, the paper notes prompts alone likely won't close the gap fully—deeper training changes may be needed.
AlexSo prompts help a bit, but not enough on their own. What about those labels you mentioned earlier—like putting the AI's built-in leanings right out there for everyone to see?
SamThe researchers propose something called Values In the Model, or VIM labels. Imagine checking a food package to see sugar or fat content before buying; these would list an AI's default leanings on care styles, like how bold or careful it starts out, so doctors know upfront. This study gives the first real numbers for those labels, pulling from tests across models and health topics.
AlexThat sounds useful for policy. But as a starting point, what limits did they note?
SamIt's a pilot with just two scenarios pulled by a simple computer method, not checked by doctors yet, and only in English with US medical styles. They plan bigger tests with 20 to 30 cases, repeats, and doctor reviews to firm it up.
AlexHuh, so solid first measurements for those VIM labels, but smart to scale it carefully. Leaves room for real-world tweaks across languages and systems.
SamExactly. It highlights a key area model reports overlook: how AIs weigh patient values by default and respond when stated. That data could shape rules for fairer clinical tools.
AlexAnd thanks to you, Sam, for breaking it down. That's our look at measuring values in medical AI on ResearchPod.