
There is a thought experiment that medical educators have been quietly dreading for several years, and it arrived with very little fanfare. In 2023 and 2024, a series of peer-reviewed studies began publishing performance data for ChatGPT on the Applied Knowledge Test section of the UK Medical Licensing Assessment. The results were, depending on your perspective, either impressive or alarming. Across multiple studies using the publicly available practice materials published by the Medical Schools Council, GPT-4 consistently answered between 73% and 86% of questions correctly. By 2024, the updated ChatGPT-4o model was scoring as high as 89.9% on AKT-style questions, including on entirely novel questions specifically written to test whether the model was reasoning or simply recalling content from its training data. Nobody has officially published the AKT pass mark, but studies of the PLAB examination - the predecessor assessment - suggest the passing threshold sits somewhere between 59% and 69%. On that basis, a widely available consumer AI tool appears capable of not merely passing the examination that will define whether a UK medical student can graduate, but of passing it comfortably.
Before exploring what this means for medical education, it is worth pausing on what is already a genuinely contested question within the research: is ChatGPT actually demonstrating medical reasoning, or is it doing something far less impressive? Critics of the headline figures have argued that models like ChatGPT are, in effect, pattern-matching against vast quantities of medical text that they have been trained on - including, potentially, the exact questions used to assess them. When one 2025 study published in Medical Science Educator specifically addressed this concern by rewriting AKT-style questions into novel formats and generating entirely new items written by an experienced assessment author, ChatGPT-4o still scored 89.9%. The rewritten versions of existing questions showed only marginal performance differences. This suggests the model is doing something more than simple recall, even if the nature of that reasoning is not identical to human clinical thinking.
The performance is not uniform across specialties, and this heterogeneity is itself instructive. Studies have found that ChatGPT performs best on questions relating to mental health (around 89%), oncology (around 79%), and cardiovascular medicine (around 77%). It performs considerably worse on perioperative medicine and anaesthesia (around 33%), clinical haematology (around 29%), and endocrine and metabolic conditions (around 40%). There is a plausible explanation for this pattern: the areas of strong performance tend to correspond to topics with large quantities of structured, well-organised online literature, while the areas of weakness may involve more nuanced, protocol-driven or context-dependent clinical decision-making that is harder to learn from text alone. A model trained on internet data will naturally reflect the distribution and quality of medical information available there.
To understand why these findings are so provocative, it helps to be clear about what the MLA AKT is supposed to do. The GMC introduced the UKMLA as a standardised national exit examination in part to address a long-recognised problem: that graduating standards varied significantly between the UK's forty-five medical schools, and that there was no robust way to guarantee equivalence of clinical competence across institutions. The AKT is designed to assess whether a final-year student possesses the knowledge required to practise safely as a foundation doctor in the UK. Its questions are clinical vignettes - scenario-based single best answer items mapped to the MLA Content Map - rather than simple factual recall questions. The intention is to test applied knowledge: not "what is the mechanism of metformin?" but "which drug is most appropriate for this specific patient in this specific context?"
The implicit assumption underlying this design is that applied knowledge of this kind is a meaningful proxy for clinical competence - that a student who can reliably select the best management option for a clinical scenario across two hundred questions has demonstrated something important about their preparedness to practise. It is precisely this assumption that AI performance calls into question. If a language model that has never examined a patient, never made a clinical decision under uncertainty, and never borne responsibility for an outcome can score nearly 90% on the same test, two uncomfortable possibilities arise. Either the test is not measuring what we think it is measuring, or the knowledge it measures is more separable from clinical competence than medical educators have traditionally assumed.
There is a version of this debate that quickly becomes reassuring. One response to the ChatGPT performance data is to argue that the model is essentially doing a very sophisticated form of retrieval - drawing on its training data in ways that superficially mimic reasoning but lack genuine clinical understanding. GPT-4 scored significantly lower on questions when the multiple-choice options were removed and it was required to generate an answer from scratch: dropping to between 62% and 75% depending on the paper. This is a meaningful gap. It suggests that the multiple-choice format itself provides a scaffolding that aids the model's performance, and that its ability to reason without that scaffold is more limited. Real clinicians, of course, do not have the luxury of five curated options.
The 2025 study from Cardiff University, published in Scientific Reports, found that GPT-4 also performed noticeably better on single-step questions - those requiring one inferential move from presentation to diagnosis or from diagnosis to management - than on multi-step questions requiring a chain of reasoning. Clinical practice in the real world is almost entirely composed of multi-step reasoning under conditions of genuine uncertainty, incomplete information, and patient-specific variables that do not appear in any training dataset. On this view, the AKT may be easier for AI than clinical practice is, and the gap between AI performance on the examination and AI performance in an actual clinical encounter may be large enough to be reassuring.
But this reassurance has a short shelf-life. The models are improving rapidly. GPT-3 performed considerably worse than GPT-4, which performs considerably worse than GPT-4o, which was itself released in 2024. The trajectory of improvement is steep and shows no obvious plateau. Whatever limitations currently prevent AI from matching human performance on multi-step reasoning tasks are likely to erode over the next five to ten years. Medical educators who take comfort in current AI weaknesses are, in effect, arguing that their assessment systems are safe because today's AI is not quite good enough - a position that seems unlikely to hold for long.
The more productive question is not whether AI "cheats" the AKT but what the AKT's vulnerability to AI performance reveals about the nature of the knowledge it assesses. Medical education has long debated the relationship between knowledge and competence. The acquisition of factual and applied knowledge is necessary but not sufficient for clinical excellence. It does not measure the ability to elicit a history from an anxious patient, to notice something unexpected on examination, to navigate the uncertainty of a diagnosis that does not fit a textbook pattern, or to have a difficult conversation with a family. These are precisely the capacities that the MLA's Clinical and Professional Skills Assessment (CPSA) is designed to evaluate - and they are, for now, capacities that large language models cannot replicate.
Some commentators have drawn the opposite conclusion from the AI data: not that the AKT is flawed, but that the ability to access medical knowledge from an AI tool is now a normal feature of clinical practice, and that assessments should evolve to reflect this. If doctors in 2030 will routinely use AI to support clinical decision-making, then perhaps the question is less "can candidates answer AKT questions without AI?" and more "can candidates use AI intelligently, critically, and safely?" This position has been articulated by several medical education researchers writing in response to the ChatGPT performance literature. It suggests a possible future in which examinations test not knowledge retrieval but knowledge governance: the ability to interrogate, verify, and act on AI-generated recommendations rather than to generate those recommendations independently.
This would represent a fundamental reimagining of what medical licensing means. For much of the history of medicine, a doctor's clinical value lay substantially in their possession of knowledge that patients could not easily access themselves. The internet began eroding that asymmetry, and AI tools threaten to erode it further. If medical knowledge - or at least the organised, textbook variety of it - becomes freely and accurately available to anyone with a smartphone, the question of what distinguishes a trained physician from an intelligent layperson becomes sharper and more urgent.
For students currently preparing for the AKT, there is something both humbling and clarifying about these findings. The humbling part is that the examination you are studying for can be passed, in aggregate, by a machine. The clarifying part is what that tells you about the ceiling of what examination preparation alone can deliver. Doing well on the AKT matters - it is a genuine high-stakes assessment, and the knowledge it tests is foundational to safe clinical practice. But performing well on it does not make you a good doctor, any more than GPT-4 scoring 86% makes it one. The knowledge is a necessary foundation. The clinical reasoning, the communication, the ethical judgement, and the capacity to function under pressure are what you build on top of it - and those are capacities that no question bank, however good, can fully develop on its own.
The broader debate about AI and the AKT is, in many respects, a proxy debate about something medical education has always struggled with: the gap between knowing and doing. The MLA represents the most serious attempt in UK medical history to standardise the knowing. The doing remains, appropriately, the harder and less tractable challenge - and for now, at least, the more distinctly human one.
Access 1000+ high-quality MCQs with detailed explanations. Perfect for exam preparation.
Unlimited attempts available • No credit card required
Test your knowledge with free tests. Top scorers appear on the leaderboard!
Join Challenge