Harvard Study Finds OpenAI's o1 Model Outperformed Physicians on Emergency Room Diagnoses

Sara Montes de Oca
1 hour ago
2 min read

A study published this week in the journal Science found that OpenAI's o1 large language model reached the correct or near-correct diagnosis more frequently than two internal medicine attending physicians across a set of real emergency room cases, raising questions about the role AI could play in clinical decision-making.

The research was led by physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center. The team drew on 76 patients who were seen in the Beth Israel emergency room, presenting the same electronic medical record data to both human physicians and to OpenAI's o1 and 4o models.

Two separate attending physicians then evaluated the diagnoses without knowing which came from a human and which came from an AI.

On initial triage — the point at which the least information is available and the urgency is highest — the o1 model reached "the exact or very close diagnosis" in 67% of cases. One of the two physicians hit that mark 55% of the time, while the other did so 50% of the time.

"We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines," said Arjun Manrai, who leads an AI lab at Harvard Medical School and is one of the study's lead authors, in a press release.

The researchers stressed that the AI models received no pre-processed data — they worked from the same records available to clinicians at each diagnostic moment.

The study stopped well short of arguing that AI is prepared to make autonomous treatment decisions. Its authors instead called the findings evidence of "an urgent need for prospective trials to evaluate these technologies in real-world patient care settings."

The researchers also acknowledged a significant constraint: the study examined only text-based inputs, and existing research suggests current AI models perform less reliably when reasoning over non-text data such as imaging.

Adam Rodman, a Beth Israel physician and co-lead author, said there is "no formal framework right now for accountability" around AI diagnoses and that patients still "want humans to guide them through life or death decisions [and] to guide them through challenging treatment decisions."

The findings also drew scrutiny from practicing clinicians. Kristen Panthagani, an emergency physician, described the study as "an interesting AI study that has led to some very overhyped headlines," pointing out that the AI was compared to internal medicine physicians rather than emergency room specialists.

"If we're going to compare AI tools to physicians' clinical ability, we should start by comparing to physicians who actually practice that specialty," Panthagani said. She also challenged the framing of the diagnostic task itself: "As an ER doctor seeing a patient for a first time, my primary goal is not to guess your ultimate diagnosis. My primary goal is to determine if you have a condition that could kill you."

The study adds to a growing body of research examining where large language models may augment or, in some contexts, match clinical judgment — while underscoring that regulatory and accountability infrastructure has yet to catch up with the technology's demonstrated capabilities.