Articles on: What is AI-enhanced CJ?

What is AI-Enhanced Comparative Judgement (CJ)?

AI-Enhanced CJ is a specific method for assessing student writing that combines the strengths of human judgment with the speed of Artificial Intelligence, using the established technique of Comparative Judgement. Here's how it works:

Core Method - Comparative Judgement: Instead of assigning an absolute mark (like 7/10 or Grade B), judges (both human and AI) are shown two pieces of student writing side-by-side and simply decide which one is "better." This process is repeated many times.
AI Integration: An AI is trained to perform these comparative judgments, mimicking the decisions human experts would make.
Human Oversight (Hybrid Model): The system doesn't rely solely on AI. Human teachers also perform a portion of the judgments (e.g., we suggest a 10% human / 90% AI split). This allows for validation and keeps teachers involved.
Powerful Statistical Model: All judgments (human and AI) are fed into a sophisticated statistical model developed over years of human-only CJ. This model aggregates the pairwise comparisons to produce a reliable, scaled score for each piece of writing.
Validation & Disagreement Analysis: The system allows direct comparison between human and AI decisions. Disagreements can be flagged and reviewed. Crucially, our analysis shows disagreements are often small or, in cases of large disagreement, frequently attributable to human error (like handwriting bias or clicking mistakes), not AI "howlers."
Maintaining Standards: Scores are linked to established national standards through comparison with previously assessed and graded work.

Why is AI-Enhanced CJ Better Than Using LLMs Directly for Marking?

Comparative vs. Absolute Judgement: Humans, and seemingly AI as well, are better at making relative (comparative) judgments ("Is A better than B?") than absolute judgments ("What mark does A deserve?"). AI-Enhanced CJ leverages this strength, while direct LLM marking forces an absolute judgment, which may be less reliable.
Robust Validation & Trust: The hybrid CJ model has built-in validation. You can directly see how often the AI agrees with humans (81% in our most recent study, close to the 87% human-human agreement). With direct LLM marking, it's hard to know if the assigned marks are accurate or meaningful without manually re-marking, defeating the purpose.
Avoids Superficiality & Gaming: Historically, AI essay markers could achieve high surface correlations with human marks by focusing on superficial features (like length), making them easy to cheat. The CJ approach, validated against human holistic judgment, is less susceptible to this.
Psychometric Validity & Standardisation: The CJ statistical model provides psychometrically valid scores that are reliable and maintain standards over time by linking to a large dataset of graded work. Direct LLM marks lack this inherent grounding in established standards and proven statistical modelling unless specifically built and validated, which isn't guaranteed with off-the-shelf chatbots.
Meaningful Disagreement Analysis: The CJ model allows for analysis of where and why disagreements occur. Finding that large disagreements often stemmed from human error builds confidence in the AI's judgment within this specific framework. Direct LLM marking doesn't inherently offer this level of error analysis.
Human-in-the-Loop Control: AI-Enhanced CJ keeps humans involved, allowing teachers to validate the AI, ensure alignment with pedagogical values, and stay engaged with student work. It offers a controlled way to leverage AI ("human behind the steering wheel") rather than simply outsourcing marking entirely.
Minimises Bias and Hallucinations: The integration into the powerful statistical model helps minimise potential AI bias or the tendency of LLMs to "hallucinate" or generate nonsensical outputs, providing more reliable results than potentially isolated LLM judgments.

In essence, AI-Enhanced CJ is presented not just as using AI to mark, but as using AI within a proven, statistically robust assessment framework (Comparative Judgement) that includes human oversight and validation, leading to more trustworthy, reliable, and valid results than simply asking a general-purpose LLM for a grade.

Updated on: 22/04/2025

Was this article helpful?

Share your feedback

Cancel

Thank you!