Reliability of Large Language Models for Identifying and Classifying Content in Research Articles
DOI:
https://doi.org/10.18552/joaw.v15iS2.1129Keywords:
GenAI, large language model, GPT-3.5, GPT-4o, literature review process, content classificationAbstract
GenAI has demonstrated functionality that seems, uncannily, to parallel reading and writing by identifying/reformulating information from source texts and generating novel content and argumentation. These skills are essential yet challenging for many students tasked with producing literature reviews. This study takes the first steps to investigating the feasibility of a GenAI-facilitated literature review. This investigation starts from the ‘human-in-the-loop’ position that complex processes can be deconstructed and compartmentalized, and that component functions needed for these processes can be delegated to machines while humans contribute to, or control, the overall process. We explore the hypothesis that certain functions of the literature review process, such as information extraction and content classification, might be able to be automated. Prompts modeled on recommended practices for research synthesis were designed to identify and classify particular types of content in research articles. Outputs produced by two GenAI models, GPT-3.5 and GPT-4o, were assessed for reliability with a human coder. Overall, the results posit concerns about the models’ performance on this task, cautioning against direct uses of GenAI output as learning scaffolding for students developing literature review skills.