How long does it take to assess one class’s essays using Comparative Judgement?

We are often asked how long it takes to assess a set of essays using Comparative Judgement.

The short answer is as follows.

Traditional assessment with a rubric: 5 hours to assess 30 essays
Comparative Judgement without AI: 2.5 hours to assess 30 essays
Comparative Judgement with AI: 15 minutes to assess 30 essays

Read on for the longer answer.

Traditional assessment – 5 hours
Marking one essay traditionally with a traditional mark scheme may take 5-15 minutes.
Let’s assume 10 minutes as an average.
One teacher with a class of 30 students will therefore take 5 hours to mark one class’s set of essays.
This type of traditional assessment involves each essay being seen once by one teacher. There may be some moderation of a small proportion of scripts after the event. The typical reliability for this kind of assessment is +/- 5 marks on a 40 mark scale.

Comparative Judgement without AI - 2.5 hours
We have been running human Comparative Judgement assessments for several years. Here is how it works.

You need to do 10 Comparative Judgements per essay. If you do this, each essay will be seen 20 times, because each judgement involves 2 essays.
So a class of 30 students will need 300 judgements.
Each judgement takes between 20 and 30 seconds. Let’s assume 30 seconds – that’s 150 minutes – 2.5 hours. That is half the time it takes to assess a class set traditionally.

This approach is also more reliable than traditional marking. We achieve typical reliability of +/- 2 marks on a 40 mark scale. (So for a truly fair time comparison with traditional marking, we should compare how long it takes to achieve the reliability of traditional marking, of +/- 5 marks. In 2018, an independent study by Ofqual, the English exam regulator, found that you could achieve this with just 4 judgements per essay, which would only take one hour! (See page 27 here). However, we do not recommend this, as we think the reliability of traditional marking is not good enough!)

How is it possible for Comparative Judgement to achieve such good results? It’s down to two main features. One, humans are good at Comparative Judgement, and they are not very good at absolute judgement, which is the format used for traditional marking. You can read more about this here. Two, Comparative Judgement allows for the aggregation of lots of different judgements. So even if some judgements are errors, the weight of all the judgements cancels out the errors.

So, in conclusion, even without AI, Comparative Judgement can halve your workload and improve your reliability.

But if we add in AI, it gets even better…

Comparative Judgement with AI - 15 minutes
When you set up a judging task now, you can now choose to add in AI judges.
You can choose what proportion of judgements you want the AI judges to do.
Our recommended approach, which we have used in our trials, is to do 90% AI and 10% human judgements.
Follow this approach, and you will cut your teacher workload by 90% (and that’s 90% of the CJ human workload, which is already only 50% of the traditional approach!)

Let’s see what this means in practice for a typical class.

You need to do 10 judgements per essay.
So a class of 30 students will need 300 judgements.
The AI can do 90% of those judgements. The human teacher therefore only has to do 30 judgements.
Each judgement takes between 20 and 30 seconds. Let’s assume 30 seconds – that’s 15 minutes.

We have therefore reduced the time it takes to mark a class set of essays from 5 hours to 15 minutes.

What about the reliability and error rate? Can we trust the AI? Our initial findings are that the human and AI judges agree about 80-85% of the time – which is about the same as human-human agreement. Where the humans and AI disagree, we find that often the AI is right and the human is being misled by some superficial features. After every assessment, you will get a report listing all the human and AI disagreements so you can see if you trust the AI.

So, in conclusion, Comparative Judgement combined with AI can reduce your workload by 95% and eliminate obvious human errors.

Frequently Asked Questions

Isn’t it important for students to know their work is being read by a human?
Yes – and with our recommended approach, every essay will be read twice by a human – which is more than you get with traditional marking. Read more on this here.

What about feedback?
You can choose to leave audio comments on each piece of writing as you judge. Our AI system will combine all the audio comments by all the different teachers, and turn them into a polished written comment for the students and a whole class feedback report for the teacher. If you choose to use audio comments, it may mean you need to do slightly more than 10% of human judgements to ensure each student gets enough detail. It may also slow down your judging a bit more. So if you choose to use audio comments, you may want to budget for a bit more than 15 minutes per teacher.
Our national projects also provide you with two other forms of feedback. One is direct AI feedback, which we make available in teacher and pupil friendly format. The second is sets of multiple-choice questions designed by us and allocated to students depending on their score. These do not require any teacher time.

Can the AI do 100% of the judging?
Yes. We don’t recommend this as the default but it may be appropriate at certain points in the year, eg for a redrafting assessment or for an assessment that needs a very quick turnaround.

All the above calculations are based on just one essay. What about a mock exam which consists of a series of essays and short questions?
We are working with a number of schools on applying the above approach to mock English exams, and so far we have some promising results. We will have further updates soon.

I have other questions!
Ask us here or sign up for one of our intro webinars here.

Updated on: 31/05/2025

Was this article helpful?

Share your feedback

Cancel

Thank you!