Are Workplace-Based Assessments Actually Working? The Uncomfortable Evidence Behind Supervised Learning Events

March 17, 20269 min read

Workplace-based assessments underpin UK medical training but face serious validity, reliability, and feedback challenges in the literature. Understanding these limitations matters for MLA candidates and future clinical supervisors alike.

Introduction: A System Built on Good Intentions

Every doctor in UK postgraduate training is familiar with the rhythm of Supervised Learning Events. The mini-CEX completed after a challenging consultation. The case-based discussion logged after a complex admission. The DOPS signed off following a procedural skill. Collectively these tools sit at the heart of a competency-based training framework that has reshaped UK medical education over the past two decades, and they are now embedded firmly in both undergraduate and postgraduate curricula. The assumption underlying all of them is straightforward: direct observation of clinical performance, followed by structured feedback, accelerates the development of competence and provides valid, reliable evidence that trainees are safe and progressing.

That assumption is, it turns out, substantially contested in the academic literature. For candidates preparing for the AKT and MLA, understanding the theoretical and empirical foundations of the assessment tools used throughout their training is not merely of academic interest — it speaks directly to how they interpret feedback, how they learn, and what the MLA itself is attempting to measure.

The Policy Architecture: How WPBAs Became Central to UK Training

The formal embedding of workplace-based assessment into UK postgraduate medical education can be traced to the introduction of the Foundation Programme in 2005 and the subsequent roll-out of competency frameworks across specialty training programmes managed by the relevant royal colleges. The theoretical underpinning drew heavily on North American work, particularly the competency-based medical education movement led by the Royal College of Physicians and Surgeons of Canada through the CanMEDS framework, and on assessment scholarship from researchers including Eric Holmboe at the Accreditation Council for Graduate Medical Education in the United States.

The mini-CEX - the Mini Clinical Evaluation Exercise - was originally developed and validated by Norcini and colleagues at the American Board of Internal Medicine, with foundational papers published in the Annals of Internal Medicine in the 1990s and early 2000s. It was designed as a structured, brief observation of a clinical encounter, rated across domains including history-taking, physical examination, clinical judgement, and communication. Case-based discussion, procedural assessments, and multi-source feedback tools were added to produce the portfolio of tools that now constitutes what the GMC refers to as Supervised Learning Events.

The 2013 Shape of Training review, led by Professor David Greenaway, reinforced competency-based progression as a cornerstone of UK medical training and endorsed workplace assessment as the appropriate mechanism for demonstrating it. The subsequent Curriculum Review and the introduction of Generic Professional Capabilities frameworks by the GMC have further institutionalised WPBAs as the primary evidence base for Annual Review of Competence Progression decisions.

The Validity Problem: What Are We Actually Measuring?

The central academic question - and the most uncomfortable one for the policy consensus - is whether WPBAs measure what they purport to measure. Validity in assessment theory requires that an instrument captures the intended construct accurately, and that scores reflect genuine differences in the underlying attribute rather than extraneous variables.

The evidence on this point is troubling. A systematic review published in Medical Education in 2011 by Kogan and colleagues examined the validity evidence for direct observation tools including the mini-CEX, and found that while the tool had reasonable face validity and was broadly accepted by trainees and assessors, the evidence for construct validity - that ratings genuinely differentiated competent from less competent trainees - was weak. Assessor stringency varied substantially, and training in how to use the scale made only modest differences to rating behaviour. Studies consistently found a ceiling effect, with the majority of assessments being rated at the higher end of the scale regardless of trainee seniority, a phenomenon that fundamentally undermines the discriminative purpose of the tool.

Reliability presents an equally challenging picture. Because each clinical encounter is unique, reproducible measurement requires aggregation across multiple observations by multiple assessors - a principle known as generalisability theory, drawn from the work of Brennan and colleagues. Research by Crossley and colleagues, published in Medical Education in 2011, demonstrated that achieving acceptable reliability from mini-CEX ratings required far more observations than most training programmes mandated. The minimum numbers logged in most portfolios were determined by administrative feasibility rather than psychometric adequacy.

The Feedback Paradox

If validity and reliability are contested, the feedback function of WPBAs is perhaps even more problematic. The theoretical model holds that structured post-encounter feedback drives learning - that the assessor's comments scaffold the trainee's reflective process and identify specific developmental needs. The empirical reality is considerably more mixed.

Research by Watling and colleagues, published in a series of papers in Academic Medicine and Medical Education between 2008 and 2014, explored how trainees actually receive and use feedback in clinical settings. Their qualitative work found that feedback embedded in formal assessment encounters was often perceived by trainees as less credible and less useful than informal, relationship-based feedback from trusted supervisors. Trainees distinguished clearly between feedback designed to satisfy portfolio requirements and genuine mentorship, and often regarded the former with scepticism. The presence of a formal assessment form, paradoxically, appeared to reduce the quality of the educational conversation.

A quantitative dimension to this was added by Overeem and colleagues in a study of multi-source feedback published in Medical Education in 2010, which found that physicians who received negative MSF ratings showed only modest improvements on subsequent assessment, and that the feedback alone - without coaching or facilitated reflection - was insufficient to drive behavioural change. The implication is that WPBAs, as currently implemented, may generate data about performance without reliably translating that data into improved practice.

The Assessor Problem: Reluctance, Bias, and Time

Underpinning many of these validity and reliability concerns is a structural problem that no psychometric refinement can fully solve: the assessors are busy clinicians operating in an under-resourced NHS, and completing WPBAs is not their primary professional role. A qualitative study by Touchie and Choudhury, alongside broader UK-specific research by Bindal and colleagues published in Medical Education in 2011, found that many supervisors experienced WPBAs as administratively burdensome and clinically disruptive. There was evidence of assessors completing forms retrospectively, providing non-specific written feedback, and avoiding the lower rating categories to prevent conflict with trainees.

This is sometimes called the problem of "grade inflation" in WPBA systems, though the term understates the issue. It is not merely that assessors are generous - it is that the social dynamics of the clinical environment, where assessors and trainees work together over extended periods, fundamentally compromise the independence that valid assessment requires. The phenomenon of leniency error is well documented across assessment contexts, but it is particularly acute in settings where assessor and assessee have an ongoing professional relationship and where a poor rating has visible consequences for the trainee's career progression.

The differential attainment literature adds a further dimension here. Research by the GMC, and by Woolf and colleagues published in the BMJ in 2011 and in subsequent studies, has consistently found that doctors from minority ethnic backgrounds perform less well on structured clinical assessments and multiple-choice examinations. Whether the same differentials apply to WPBA ratings - and whether assessor bias contributes to any differential - is a live research question that has not been adequately resolved, with some studies finding no differential and others suggesting more complex patterns depending on assessor-trainee demographic pairing.

The Portfolio as Evidence: ARCP and the Downstream Consequences

The Annual Review of Competence Progression process relies on portfolios populated primarily by WPBA data to make consequential decisions about progression, remediation, and ultimately fitness to practise. If the underlying WPBAs have uncertain validity, inconsistent reliability, and are prone to leniency bias, the consequences for ARCP decision-making are significant. A systematic review of ARCP processes by Postgraduate Medical Education and Training Board predecessors, and more recent GMC oversight reports, have noted substantial variation in how panels interpret portfolio evidence and in how consistently outcomes are applied across deaneries and specialties.

The 2019 Briffa Review of Foundation Programme assessment, commissioned following concerns about consistency and rigour, acknowledged several of these tensions and recommended clearer guidance on minimum evidence standards. The GMC's own surveys of trainees, published annually, have consistently found that a minority of trainees feel their WPBA feedback has been genuinely useful to their development, suggesting a gap between the aspirations of competency-based frameworks and the lived experience of those subject to them.

The Reform Agenda: Entrustable Professional Activities

In response to these limitations, there has been growing interest in Entrustable Professional Activities as an alternative or complementary framework. EPAs, conceptualised originally by ten Cate and Scheele in an influential paper in Academic Medicine in 2007, reconceptualise assessment around holistic clinical tasks - such as admitting an acutely unwell patient or managing anticoagulation - rather than atomised competency domains. The argument is that EPAs better reflect actual clinical work, that supervision decisions are the natural endpoint of competency judgements, and that the framework reduces the tendency towards reductive tick-box assessment.

The shift towards EPA-based frameworks has been adopted in several international contexts, including postgraduate training in the Netherlands and Canada, and has been recommended in a number of UK policy documents. However, the evidence base for EPAs is, at this point, considerably thinner than their enthusiasts acknowledge. A critical review by Rekman and colleagues in Academic Medicine in 2016 noted that while EPAs had strong conceptual appeal, empirical data on their reliability, feasibility, and impact on training outcomes remained limited. The risk of replicating the assessor-dependency problems of mini-CEX in an EPA framework has not been fully addressed.

Relevance to the AKT and MLA

For candidates sitting the MLA, these debates are not merely background academic noise. The MLA itself - and the AKT component specifically - was designed partly as a response to perceived variability in the quality of workplace-based evidence, providing a standardised, psychometrically robust, centrally administered assessment that supplements the portfolio evidence generated by WPBAs. Understanding the epistemological argument that underpins that design - that reliable, valid, standardised written assessment serves a different but complementary function to supervisor-based observation - helps candidates appreciate what the AKT is trying to do and why its scoring methodology differs from a portfolio grade.

Furthermore, as future trainers and supervisors, doctors who have engaged critically with the evidence on WPBA effectiveness are better positioned to deliver genuinely useful feedback rather than completing forms as a bureaucratic exercise.

Conclusion

Workplace-based assessments were introduced into UK medical training with genuinely important educational aims: to move beyond high-stakes end-of-training examinations, to embed formative feedback within clinical practice, and to generate longitudinal evidence of professional development. The evidence suggests that these aims have been partially but not fully realised. Problems of validity, reliability, assessor behaviour, and feedback utility are well documented in the academic literature, and the structural conditions of modern NHS clinical training - time pressure, workforce shortfalls, complex supervisor-trainee relationships - create ongoing barriers to implementation fidelity. This does not mean WPBAs should be abandoned. It means they should be understood critically, reformed iteratively, and combined with assessment tools whose psychometric properties are more rigorously established.