Can Artificial Intelligence Chatbots Assess Educational Data as Well as Humans?
Accessing accurate information in vast educational databases remains a challenge for many users. Answers to simple questions about academic performance or statistical standards are often scattered across reports, tables, or technical documents. To facilitate this access, generative artificial intelligence tools, such as chatbots, are being developed. However, their reliability and accuracy are called into question, especially when dealing with sensitive and up-to-date data.
A promising solution lies in a technology called retrieval-augmented generation. Unlike traditional artificial intelligence models, this approach does not rely solely on pre-recorded knowledge. It draws in real time from official and verified sources to provide tailored and contextual responses. This reduces the risk of errors or outdated information, a common issue with traditional tools.
Researchers tested a chatbot specialized in the field of education, designed to answer complex questions about school standards and data. To assess its performance, they compared its responses to those of human judges based on three main criteria: accuracy of information, completeness, and clarity of communication. The results show that the chatbot provides responses as reliable as those evaluated by experts. In some cases, it even surpassed the consistency of human evaluations, particularly in the quality of communication.
The major innovation lies in the use of another artificial intelligence model to automate part of the evaluation. This process, called evaluation by a large language model, saves time and resources while maintaining a high level of quality. Analyses reveal that this automated method produces results comparable to those of human evaluators, except for the clarity of responses, where it proves even more consistent.
This advancement paves the way for broader use of artificial intelligence to analyze and make complex educational data accessible. It could particularly help teachers, parents, and decision-makers obtain accurate information quickly without requiring technical expertise. Partial automation of evaluation also helps reduce costs and speed up processes while retaining human oversight to ensure the accuracy of results.
The study emphasizes, however, that these tools should not completely replace human expertise. They act rather as assistants, facilitating access to information while occasionally requiring verification. In the future, this approach could be extended to other fields where data precision and currency are crucial.
Bibliography
Report Source
DOI: https://doi.org/10.1186/s40536-026-00287-w
Title: Evaluating generative AI chatbots for large-scale assessment data: comparing LLM-as-a-judge and human ratings
Journal: Large-scale Assessments in Education
Publisher: Springer Science and Business Media LLC
Authors: Ting Zhang; Luke Patterson; Blue Webb; Zeyu Jin; Maggie Beiting-Parrish