Open access publication in the spotlight - 'The price of automated case law annotation: comparing the cost and performance of GPT-4o and student annotators'

Each month, the open access team of the University of Groningen Library (UB) puts a recent open access article by UG authors in the spotlight. This publication is highlighted via social media and the library’s newsletter and website.
The article in the spotlight for the month of January 2026 is titled 'The price of automated case law annotation: comparing the cost and performance of GPT-4o and student annotators', written by Iris Schepers (Faculty of Law), Michelle Bruijn (Faculty of Law), Martijn Wieling (Faculty of Arts) and Michel Vols (Faculty of Law).
Abstract:
This paper evaluates the performance of GPT-4o in annotating decisions of the United Nations Committee on Economic, Social, and Cultural Rights and compares these results with manual annotations by trained law students and senior (legal) scholars. GPT-4o achieves human-level accuracy in basic annotations, but struggles with recall in citation extraction, particularly for complex legal references. Human annotators, while more reliable in citation extraction, introduce formatting inconsistencies and occasional errors due to sloppiness. In contrast, GPT-4o maintains high precision, but suffers from variability across repeated prompts, raising concerns about reproducibility. Beyond accuracy, this study highlights cost-effectiveness as a key advantage of GPT-4o. The model significantly reduces annotation time and expenses compared to human annotators, who require post-processing and expert supervision. While GPT-4o produces structured output with fewer formatting inconsistencies, its omissions and inconsistencies require human oversight. These findings highlight trade-offs in expertise, cost, and reliability between human and AI-driven annotation. Although GPT-4o is a viable tool for basic legal annotations, improvements in recall and consistency are needed for more complex tasks.
We asked corresponding author Iris Schepers a few questions about the article:
You conclude that human oversight remains necessary. What would an ideal workflow look like where humans and AI complement each other?
Every project is different, but I think that an ideal workflow between humans and AI keeps humans in control at the beginning and the end of the process, with AI supporting specific tasks in between. Humans should be responsible for the creative start of projects, defining the research questions, tasks, and approach, setting goals, and deciding ethical boundaries, since these require understanding, judgment, and responsibility. Creativity is inherently a human trait that we should cherish. AI can then be used for repetitive or large-scale tasks such as collecting data, sorting data or (pre-)annotation. This stage comes with risks, such as biased labels, hidden errors, or researchers relying too much on AI output. Because of this, human oversight remains necessary after AI has done its work. Humans need to check results, interpret findings, and make final decisions, especially since (generative) AI can produce outputs that sound confident but are incorrect. A similar approach can be seen in the medical field, where robots assist with surgeries that require speed or precision, while doctors remain responsible for diagnosis and treatment decisions. This kind of workflow allows AI to support human work without replacing human responsibility.
You note that researchers must weigh which types of annotation errors are acceptable. Should there be standardized reporting of AI performance limitations in published research, similar to how we report statistical limitations?
Yes, because without clear reporting, any system can easily be presented as more reliable or more widely applicable than it actually is, which increases the risk of misuse and overconfidence in their results. In the early days of generative AI, this was often easier to recognise, for example when a model could confidently give a wrong answer to a simple question like “1+1” or when generative images would be obviously distorted. This made it clear that they were not general problem-solving authorities. As these systems have improved, their mistakes have become less obvious, which can make their limitations harder to detect.
This lack of reporting can result in a dislike or distrust of generative AI, which can also spill over to other types of AI, causing people to be skeptical of systems that may actually be more transparent or reliable. Clear and standardised reporting would help prevent both blind trust and blanket distrust. By clearly stating what kind of AI was used, what it can and cannot do, and where it is likely to fail, we can avoid the idea that a single model is a solution to every problem and encourage more careful and responsible use of AI in research, or anywhere else. A positive development in this direction is that journals, and sometimes educators, increasingly ask for a disclosure or logbook of any AI tools used for a submission.
Do you think a course in prompt engineering should be mandatory for law students?
I am not sure that a mandatory course in prompt engineering should be the first step for law students, or for students in general for that matter. Students are still students, meaning they are in the process of developing the skills needed to critically assess information. Many do not yet have the experience and background knowledge (which they gain through the courses they follow) required to reliably distinguish between accurate output and confident but incorrect or misleading text produced by large language models. This is also reflected in our own work, where neither the output of student researchers nor that of GPT-4o was treated as ground truth. Instead, only answers produced and reviewed by multiple legal and computational experts were considered reliable, ensuring that authority and responsibility remained with human experts. There are clear risks involved in students becoming over-reliant on AI systems. While research shows that large language models can be very useful in educational settings, uncritical acceptance of their output can negatively affect independent thinking and judgment, even when the tools appear helpful (Shi et al., 2026).
Rather than focusing narrowly on prompt engineering, education could emphasise responsible and critical use of AI. Completely banning these tools is likely unrealistic, given their growing presence, but students should be taught to question AI outputs, verify claims, understand model limitations, and remain aware of broader concerns such as environmental impact. I think a helpful comparison can be made to how I was taught source research in high school. Students were expected to find and evaluate their own sources, and pages like Wikipedia or Reddit were not accepted as final references. Large language model output could be treated in a similar way: as a starting point or draft, but never as the final product. One possible exercise could be to ask students to analyse AI-generated text and identify errors using reliable legal sources.
Do you use AI yourself in your work as a scientist? If so, how?
Apart from the article, I personally exclusively use AI for low risk tasks, and I find it particularly useful when it comes to language. English is not my native language, and sometimes my writing ends up sounding “too Dutch”. A sentence might be grammatically correct, but sounds awkward or unnatural. In those cases, ChatGPT helps translate these rough sentences into more natural English. In this way it helps me to write down imperfect sentences without worrying about getting everything right immediately. I always check the meaning afterwards. This takes away some of the pressure, helps me get past writer’s block, and lets me focus first on getting my ideas onto the page. For me, this is where large language models work best: as tools that support writing and problem-solving, without replacing my own thinking or scientific judgment.
A practical use case is troubleshooting errors in LaTeX, which I use to write my articles. When an error message leading to a compiling crash is unclear or difficult to interpret, ChatGPT can help suggest possible fixes or point me in the right direction.
Could you reflect on your experiences with open access and open science in general?
This is my second experience with publishing open access, both times with the journal Artificial Intelligence and Law for the EVICT project: this project is ERC funded, which requires open access publishing, as it aims to make publicly funded research openly available to a broad audience. Overall, I am very positive about open access publishing. I think it lowers barriers to accessing knowledge and helps research reach a much broader audience, including practitioners and researchers at institutions with fewer resources. I benefit from this myself, as open access makes background research more accessible and allows my own work to be read more widely. Since my PhD research focuses on developing new ideas for legal research methodologies, I think reaching as many readers as possible is especially important.
In practice, my experiences have been a bit varied between articles. My first article went through a relatively fast review process of about four months and was published in June. The open access publication costs were fully covered by the agreement between Springer Nature and the Universities of the Netherlands, which made the process quick, easy, and straightforward. My second article, however, had a longer review process. Although it was submitted in March, it was not accepted until December. By that time, the open access quota under the agreement had been reached, and we were required to pay the publication fee ourselves. This experience highlighted how access to open access publishing can still depend on timing, funding structures, and institutional agreements.
Citation:
Schepers, I., Bruijn, M., Wieling, M. et al. The price of automated case law annotation: comparing the cost and performance of GPT-4o and student annotators. Artif Intell Law (2025). https://doi.org/10.1007/s10506-025-09495-1

