ѻý

Hallucination, Fake References: Cautionary Tale About AI-Generated Abstracts

<ѻý class="mpt-content-deck">— Tripped up by nuanced decision-making, chatbots still produced "average-quality" abstracts
MedpageToday
 A photo of the ChatGPT application on a smartphone.

Artificial intelligence (AI) chatbots generated "average-quality" ophthalmic scientific abstracts but produced an "alarming" rate of fake references, a study of two chatbot programs showed.

About 30% of the references generated by both versions of the ChatGPT chatbot were either fake or nonverifiable, but many appeared similar to real information in the medical literature. Moreover, two different AI chatbot detectors achieved disparate scores for identifying chatbot-generated abstracts.

When given challenge questions for seven ophthalmology subspecialties, the chatbots sometimes stumbled on topics involving nuanced decision making. For example, the bot could not distinguish between the effectiveness of oral and IV corticosteroids for optic neuritis. In response to a question about antiangiogenic agents for age-related macular degeneration (AMD), one version of the bot proclaimed -- without qualifications -- the superiority of one medication.

"This report calls attention to the pitfalls of using AI chatbots for academic research," concluded Danny A. Mammo, MD, of the Cole Eye Institute at the Cleveland Clinic, and coauthors in . "Chatbots may provide a good skeleton or framework for academic endeavors, but generated content must be vetted and verified."

"The ethics of using AI LLMs [language-learning models] to assist with academic writing will surely be spelled out in future editorials and debates. At the very least, we suggest that scientific journals incorporate questions about the use of AI LLM chatbots in submission managers and remind authors regarding their duty to verify all citations if AI LLM chatbots were used."

The problematic fake references notwithstanding, "both versions of the chatbot generated average-quality abstracts," they concluded.

The study highlights the "serious limitations" of current AI chatbot technology, asserted the authors of an . As an example, the faulty response to the optic neuritis question "demonstrates that chatbots are incapable of nuanced decision-making."

"AI chatbots offer great advantages in terms of reduced time and increased efficiency for authors and clearly generate well-written, grammatically correct sentences," wrote Nicholas J. Volpe, MD, and Rukhsana G. Mirza, MD, of the Feinberg School of Medicine at Northwestern University in Chicago. "Much like many other tools that have been introduced into science and scientific writing ... chatbots should be added to the list of technologies and other forms of assistance that scientists and authors have as they prepare to analyze and disseminate information."

"This new and disruptive technology will facilitate efficient creation of text based on author query and information available. However, it is critical to recognize that this information may be dated, biased, or wrong and that the ultimate conclusions generated lack critical thinking and understanding the nuances of scientific decision-making."

Revolutionizing Interaction With Technology

Since its launch in late 2022, the ChatGPT 3.5 AI chatbot has revolutionized the way people interact with technology and opened new possibilities for AI applications, including in healthcare, Mammo and coauthors noted in their introduction. Unique among chatbots, ChatGPT learns from human feedback to improve its responses over time. Version 4.0, released earlier this year, incorporated enhanced capabilities in reasoning, complex instructions, and creativity.

A major pitfall of chatbots is the tendency to generate factual errors from "hallucinations," AI output that deviates from training data, they continued. Current evidence suggests a of 20-25%. Mammo and colleagues evaluated both versions of ChatGPT for generating ophthalmic scientific abstracts. They sought to determine the hallucination rate of the chatbot program and examine the accuracy of two different AI output detectors for evaluating the abstracts.

Investigators developed a challenge question for each of seven ophthalmology subspecialties: comprehensive, retina, glaucoma, cornea, oculoplastics, pediatrics, and neuro-ophthalmology. The questions were input into both versions of ChatGPT to compare the accuracy and quality of abstracts produced by the initial and updated version of the program, including 10 references for each abstract.

Abstract quality was evaluated by means of DISCERN criteria modified for AI (AI-DISCERN), including added criteria specific to AI LLMs, based on published assessments of LLMs. Criteria consisted of clear aims, achieving aims, relevance, clear sources, balance and nonbias, reference to uncertainty, and overall rating. The added criteria consisted of helpfulness, truthfulness, and harmlessness, graded by means of a 5-point Likert scale.

Investigators used two AI output detectors (GPT-2 and Sapling) to assess the abstracts for likelihood of being fake (generated by the chatbot). Studies of GPT-2 showed an area under the receiver operating characteristic curve of 0.94 for detecting chatbot-generated text. The authors could find no comparable studies for the Sapling detector.

Key Findings

Scores for abstracts generated by ChatGPT 3.5 averaged 35.9 out of a possible 50 points on AI-DISCERN, increasing slightly to 38.9 with the updated version of the chatbot (P=0.30). Scores for the three added criteria also did not differ significantly between the two versions of the chatbot: 3.36 versus 3.79 for helpfulness, 3.64 versus 3.86 for truthfulness, and 3.57 versus 3.71 for truthfulness.

The mean hallucination rate for the references was 31% with ChatGPT 3.5 and 29% with version 4.0. As an example of a fake reference, the authors cited the following: "Cleary PA, et al Optic Neuritis: a 10-year follow-up study. Arch Ophthalmol 2003; 121(1):47-52." No results were returned from a search of PubMed.

With regard to the performance of the output detectors, a score of 100% meant generated by AI. The GPT-2 detector produced mean scores of 65.4% for abstracts produced by ChatGPT 3.5 and 10.8% for those generated by version 4.0 (P=0.01). The Sapling detector came up with scores of 69.5% and 42.7% for the two chatbot versions (P=0.17).

"With AI's rapidly changing landscape, AI LLMs have the potential for increased research productivity, creativity, and efficiency," Mammo and coauthors wrote in conclusion. "Nonetheless, the scientific community at large, authors, and publishers must pay close attention to scientific study quality and publishing ethics in the uncharted territory of generative AI."

  • author['full_name']

    Charles Bankhead is senior editor for oncology and also covers urology, dermatology, and ophthalmology. He joined ѻý in 2007.

Disclosures

Mammo disclosed relationships with Alimera Sciences and Apellis.

Volpe and Mirza reported no relevant relationships with industry.

Primary Source

JAMA Ophthalmology

Hua HU, et al "Evaluation and comparison of ophthalmic scientific abstracts and references by current artificial intelligence chatbots" JAMA Ophthalmol 2023; DOI: 10.1001/jamaophthalmol.2023.3119.

Secondary Source

JAMA Ophthalmology

Volpe NJ, Mirza RG "Chatbots, artificial intelligence, and the future of scientific reporting" JAMA Ophthalmol 2023; DOI: 10.1001/jamaophthalmol. 2023.3344.