Double Marker: Validation Guide: AI & Human Marking Alignment

To ensure our marking is robust and reliable, we don’t just look for "exact matches"—we look for Expert Alignment.

In subjective subjects like English Language, even two highly experienced human examiners will rarely give the exact same mark. Our system uses a research-backed Tolerance Model to validate that our AI and human markers are operating within the "Professional Zone of Agreement."

The Science: The Ofqual Consistency Study

Our tolerances aren't arbitrary. They are rooted in the Ofqual Marking Consistency Study which analysed thousands of marks from major exam boards (AQA, OCR, Pearson, WJEC).

Key Research Insight: As the "Item Tariff" (total marks available) increases, the natural variation between expert markers also increases. A 1-mark difference on a short question is common; a 4-mark difference on a long essay is often still considered "accurate" marking within the industry.

Our Methodology: The "Linear + Offset" Formula

To provide a fair and robust comparison between AI and human markers, we use a formula that accounts for the "Subjectivity Floor"—the baseline level of disagreement that exists in English Language assessment.

The Formula

We calculate tolerance by taking 10% of the total marks and adding a 1-mark baseline for subjectivity. We always round up (Ceiling) to ensure we respect professional judgment.

How this builds confidence:

Question Marks	Tolerance (±)	Why this is robust
8 Marks	± 2	A 1-mark difference is statistically "noise." A 2-mark limit ensures the AI and human are on the same page.
20 Marks	± 3	Aligns with the Standard Deviation of Senior Examiners for mid-length responses.
24 Marks	± 4	For extended writing, this allows for different (but valid) interpretations of style and flair.

What "Within Tolerance" Means for You

When you see that a mark is Within Tolerance, it serves as a guarantee of two things:

Calibration: The AI has interpreted the mark scheme in a way that aligns with expert human standards.
Robustness: The final grade is not dependent on the "luck of the draw" of a single marker, but is verified against a data-driven standard.

Updated on: 18/05/2026

Was this article helpful?

Thank you!