top of page

My MD Should Be a Bot

David Cowles

Jul 10, 2025

“AI is 4 times better at diagnosing complex medical conditions than MDs. This…could be the death knell for the traditional practice of medicine.”

In an earlier post on this site, My PCP Should be a Bot, we argued that AI could take over many of the mundane tasks associated with Primary Care and assist physicians with their other, higher level functions. It is clear now that we undershot the mark…by a wide margin.


Yes, AI has a place in Primary Care, but it has a much larger role to play in Specialty Care, especially in advanced diagnostics.


Microsoft reports that AI consumer products like Bing and Copilot see over 50 million health-related inquiries every day. Search engines and AI companions are quickly becoming the new front line in healthcare.

But this is only scratching the surface of what’s possible. AI can do more, much more; it can be transformational. Along with CRISPR and Bespoke Medicine (N = 1), AI can be part of a huge advance in population health.


Toward the end of 2024, Microsoft launched a dedicated consumer health effort, led by clinicians, designers, engineers, and AI scientists. But for AI to make a real difference, clinicians and patients alike must be able to trust its performance.


No surprise here. New technologies always entail an adoption curve. Is it safe to drive one of them newfangled horseless carriages? Or fly in one those aero planes? (I mean, if God wanted us to fly, wouldn’t he have given us wings?) But 100 years later, we have Carvana and Travelocity.


To practice medicine in the United States, physicians need to pass the United States Medical Licensing Examination (USMLE), a rigorous and standardized assessment of clinical knowledge and decision making. USMLE questions were among the earliest benchmarks used to evaluate AI’s competence.


Not surprisingly, generative AI has now advanced to the point of scoring near-perfect scores on the USMLE and similar exams. But these tests primarily rely on multiple-choice questions, which favor memorization over deep understanding. By reducing medicine to one-shot answers on multiple-choice questions, such benchmarks undoubtedly overstate the apparent competence of AI systems and obscure their limitations. No one’s going to trust a Bot to manage a loved one’s healthcare purely because it performed well on some standardized test.


Algorithmic medicine might rely on discrete answers to specific questions but ars medica, the art of medicine, requires clinical reasoning capability. How does it perform on ‘sequential diagnosis’, a cornerstone of real-world medical decision making. 


In this process, a clinician begins with an initial patient presentation and then iteratively selects questions and diagnostic tests to arrive at a final diagnosis. For example, a patient presenting with cough and fever may lead the clinician to order and review blood tests and a chest X-ray before they feel confident enough to diagnose pneumonia.


Each week, the New England Journal of Medicine (NEJM) – one of the world’s leading medical journals – publishes a Case Record of the Massachusetts General Hospital, presenting a patient’s care journey in a detailed, narrative format. These cases are among the most diagnostically complex and intellectually demanding in clinical medicine, often requiring multiple specialists and diagnostic tests to reach a definitive diagnosis.


How does AI perform in this arena? Surely here we need the empathy and imagination of a well-trained and highly experienced human agent.


To answer these questions, Microsoft created a series of interactive diagnostic challenges based on 304 recent NEJM case studies. Clinicians, be they AI Bots or human physicians, were able to ask iterative questions and order appropriate tests. As information became available, the clinicians could update their reasoning, gradually leading to a final diagnosis.


This diagnosis can then be compared to the gold-standard outcome published in the NEJM. Each requested investigation also incurs a (virtual) cost, reflecting real-world healthcare expenditures so we can evaluate performance across two key dimensions: diagnostic accuracy and resource expenditure.  


The Microsoft AI Diagnostic Orchestrator (MAI-DxO) is a system designed to emulate a virtual panel of physicians with diverse diagnostic approaches collaborating to solve cases.  Orchestrator can integrate diverse data sources. It can turn any LLM into a ‘panel’ of virtual clinicians that can ask follow-up questions, order tests, deliver a diagnosis, and then run a cost check before deciding to proceed. 

 

MAI-DxO paired with OpenAI’s o3 correctly solved 85.5% of the NEJM benchmark cases. Pretty good! But how does this compare with the recommendations of real life, flesh and blood specialists?

 

For comparison, Microsoft created a panel of 21 practicing physicians from the US and UK, each with 5-20 years of clinical experience. On the same tasks, these experts achieved a mean accuracy of…wait for it…just 20%.

 

Astonishing! AI is 4 times better at diagnosing complex medical conditions than MDs. This is a shot in the arm for AI assisted medicine but, more importantly and quite unexpectedly, this could be the death knell for the traditional practice of medicine.

 

What are we missing? Is our AI system ordering every possible test – regardless of cost, patient discomfort, or delays in care? Nope! MAI-DxO is configured to operate within defined cost constraints. This allows for explicit exploration of the cost-value trade-offs inherent in diagnostic decision making.

 

In fact, Microsoft found that MAI-DxO delivered both higher diagnostic accuracy and lower overall testing costs than physicians. Microsoft’s findings also suggest that AI reduces unnecessary healthcare costs – not so surprising given that 25% of U.S. healthcare spending is wasted on interventions that are inappropriate for the patient’s actual condition.

 

So how can we account for this amazing result? On the plus side, no single physician, no team of physicians, can span the full complexity of the NEJM case series. There is always one more voice to be heard. AI, on the other hand, doesn’t face this limitation; it can blend breadth and depth of expertise.

 

But the discrepancy is so glaring that we cannot escape the conjecture that something ‘not so good’ must be happening at the physicians’ end of the see-saw. Follow-up studies should focus on why flesh and blood ‘agents’ perform so poorly.

 

Are they overworked? Are their information systems outdated? Do they have a harder time moving past their initial ‘blink’ diagnoses, even when evidence is pointing in another direction. Or are they limited by inappropriate sociological biases based on things like the patient’s gender, race, age, occupation, education level, income, or zip code?

 

But there is an overwhelming sense that such follow-up studies, important though they are, will amount to just rearranging deck chairs on The Titanic. If Microsoft’s findings are confirmed, traditional medicine is dead…period. The big question now is how long it will take society ‘to stop life support and pronounce’.

 

There is a meme of unknown origin that states that it was not until 1900 that medicine did more good than harm. The Microsoft study suggests that that assessment might be wildly overoptimistic. A better meme might be, “When will medicine finally begin to do more good than harm?”

 

Will the healthcare profession hasten the achievement of that benchmark…or retard it?” Brutal…I know; but consider what’s at stake!

Rembrandt. The Anatomy Lesson of Dr. Nicolaes Tulp. 1632, oil on canvas, 216.5 × 169.5 cm. Mauritshuis, The Hague. Baroque, Dutch Golden Age.

Rembrandt’s The Anatomy Lesson of Dr. Nicolaes Tulp captures a pivotal moment in the history of medicine—when science, observation, and education began to challenge traditional, experience-based practice. In much the same way, today’s AI diagnostic systems, like Microsoft’s MAI-DxO, are disrupting the modern medical profession. The painting shows medicine as a spectacle of human intellect dissecting the unknown; now, AI replaces the demonstrator’s scalpel with data-driven precision, automating what was once a learned, manual, and interpretive act. The cadaver becomes a metaphor for how patient care risks being depersonalized in the era of algorithmic medicine.

Do you like what you just read and want to read more Thoughts? Subscribe today for free!

- the official blog of Aletheia Today Magazine. 

Have a thought to share about today's 'Thought'.png
bottom of page