EnagoBy: Enago

Getting AI Right Starts with Getting Human Oversight Right

Getting AI Right Starts with Getting Human Oversight Right

In recent years, academic research has undergone a profound transformation, largely driven by the rapid advancement of generative artificial intelligence (AI). From drafting literature reviews to refining final manuscripts, AI-powered tools are now integrated into nearly every stage of the research workflow. Their ability to quickly generate, summarize, and edit text has made them indispensable for researchers striving to keep pace with the accelerating demands of scholarly communication.

But alongside these benefits comes a subtle, growing risk: the temptation to accept AI-generated content without rigorous human review.

Fluent Language, Hidden Flaws

AI language models like ChatGPT and its successors excel at producing grammatically flawless and stylistically polished text. For many researchers—especially those working in a second language—these tools offer valuable assistance through the process of scientific writing. They reduce the burden of repetitive tasks and allow more focus on scientific discovery.

However, that same fluency can be deceptive. Because AI-generated text often looks professional, it can create a false sense of reliability. When the writing is smooth and confident, researchers may overlook factual inaccuracies or subtle shifts in meaning.

The risks aren’t always obvious. AI mistakes tend to be nuanced: a slightly inaccurate technical term, a modifier that changes a sentence’s meaning, or a confident claim that’s misleading. For instance, an AI might state that a drug “prevents” a disease, when the more accurate phrasing would be “reduces the risk of.” While this might seem minor, such distinctions carry real weight in scientific literature.

Worse still, AI tools are trained on large datasets that may include outdated, biased, or incorrect information. This can lead to "hallucinations": confidently presented but entirely fabricated facts, misinterpreted concepts, or even non-existent references. Because these outputs are so polished, they often slip through unnoticed—even by experienced researchers.

How AI Errors Undermine Good Science

The consequences of unverified AI-generated content are becoming increasingly visible. Journals and preprint platforms are beginning to flag, reject, or retract papers that contain errors introduced by AI tools. These aren’t cases of fraud or data manipulation—they’re the unintended results of misplaced trust.

A recent peer-reviewed study published in Scientific Reports systematically evaluated the prevalence of fabricated bibliographic citations produced by ChatGPT-3.5 and ChatGPT-4. The researchers found that 55% of citations generated by GPT-3.5 and 18% of those from GPT-4 were fabricated—that is, they did not correspond to any real scholarly work. The study also identified a significant rate of substantive errors in the citations that did refer to real papers.

Another well-documented example of how subtle shifts to technical terminology—introduced by AI tools—can undermine perceived author expertise is the case of the phrase “vegetative electron microscopy.” This nonsensical term appeared in nearly two dozen published scientific papers after an AI mistranslation jumbled the established phrase “electron microscopy of vegetative structures.” The error, unnoticed by peer reviewers, raised serious concerns about the authors’ subject-matter knowledge and the credibility of the research, despite otherwise solid underlying work. This incident illustrates how even minor shifts in technical language can signal a lack of expertise and lead to questions or outright rejection in academic publishing.

Such incidents are becoming more common. They damage not only individual reputations but also the credibility of scientific communication as a whole. Perhaps most troubling is that these errors are often made in good faith. Researchers may simply not realize that the AI has introduced inaccuracies or misrepresented their intended meaning.

Understanding the Limits of AI: Context and Judgment Matter

At the heart of the issue is a fundamental limitation of current AI models: they don’t understand science—they only predict language. These systems operate by identifying patterns in their training data, not by comprehending the meaning, context, or implications of what they generate.

This means they can’t reliably distinguish between a proven fact and a speculative claim unless explicitly told. They don’t grasp the rationale behind a methodological decision or the weight of a cautiously worded limitation. These are the subtleties that demand human judgment—something no AI can replicate.

Domain expertise, critical thinking, and contextual understanding remain irreplaceable. Only a researcher familiar with the field can ensure that AI-generated language accurately reflects the science it’s meant to support.

As AI becomes increasingly embedded in the academic writing process, it's tempting to view it as a complete solution. However, this example underscores a critical truth: while AI tools can enhance grammar and polish, they often fall short in detecting logical inconsistencies or contextual ambiguities—areas where human judgment remains irreplaceable.

Consider Original text, “Therefore, it could be possible that the feasibility of a throwing estimation considering individual differences. However, since our evaluations were limited to a specific age group, we have studied the widening of age groups to obtain new insights. As one of our future works, we deal with these limitations.” To be copy edited by LLMs and a human editor. 

Copy Edited text output using Deepseek

"Therefore, developing throwing performance estimations that account for individual differences may be feasible. However, since our current evaluations were limited to a specific age group, future studies should expand the age range to gain new insights. Addressing these limitations will be a focus of our future research."

Key Improvements:

  • Restructured the first sentence for grammatical correctness and clarity.
  • Changed "could be possible" to "may be feasible" for more precise academic language.
  • Replaced "we have studied" with "future studies should" to better reflect research intentions.
  • Improved the flow between sentences with better transitions.
  • Strengthened the final statement about future work.
  • Maintained consistent academic tone throughout.

This version keeps your original meaning while making the text more polished and suitable for academic publication.

Copy Edited text output using ChatGPT

"Therefore, it may be possible to estimate throwing feasibility while accounting for individual differences. However, since our evaluations were limited to a specific age group, we have explored expanding the age range to gain new insights. Addressing these limitations is one of our future research directions."

How AI Helped—But Only Partly:

Despite its grammatical issues, the intended meaning of the original text seemed to center on evaluating the possibility of a throwing estimation while accounting for individual differences. However, the phrasing “it could be possible that the feasibility…” was logically and stylistically unclear.

Two different AI-generated revisions helped improve grammar and tone, but they also carried over some logical flaws. Phrases like “these limitations” were left ambiguous, and the relationship between the limited age group and the need to widen the age range remained underdeveloped. AI smoothed the surface—but didn’t fix the structure.

Human Review:

Remark 1: The intended meaning of “it could be possible that the feasibility of a throwing estimation” was slightly unclear. We infer that you are referring here to the throwing estimation being possible, and we have revised the text accordingly. 

Remark 2: The meaning of this text is unclear; in what way are the widening of age groups related to the limitation of specific age groups?

"Therefore, throwing estimation considering individual differences may be feasible. However, since our evaluations are limited to a specific age group, we have studied the widening of age groups to obtain new insights. In a future work, we plan to address this limitation."

Where Human Editing Added Real Value:

The human editor stepped in to do what AI couldn’t:

  • Clarified vague phrasing by rewriting “it could be possible that the feasibility…” as “throwing estimation considering individual differences may be feasible.”
  • Resolved a key logical flaw by recognizing that only one limitation (age group restriction) was mentioned and revised “these limitations” to “this limitation.”
  • Inferred contextual meaning by referencing ideas from elsewhere in the manuscript to strengthen argument flow and anticipate reviewer concerns.

This version not only reads better but is also logically sound, contextually appropriate, and publication-ready—thanks to human editorial insight.

Safeguarding Research Integrity: A Human-AI Partnership

The solution isn’t to abandon AI, but to use it responsibly. AI should be treated as an assistant, not an authority. Every output—from individual sentences to full manuscripts—needs careful human review. That includes checking for shifts in meaning, confirming facts and citations, and ensuring that the text faithfully represents the researcher’s intent.

Universities, academic journals, and funding organizations are starting to establish policies and guidelines to address the use of AI tools. Many now require authors to disclose AI use and to certify that all content has been reviewed by a human expert. These policies aren’t red tape—they’re essential safeguards for research quality and credibility.

Generative AI is changing the landscape of academic research, offering tools that boost productivity and streamline communication. But these benefits come with responsibilities. Relying on AI without human oversight invites subtle errors that can undermine even the best research—not through carelessness or dishonesty, but through unchecked automation.

The future of science depends on a balanced partnership between AI and human expertise. Researchers must be trained in AI literacy, and institutions should implement AI review checkpoints to critically evaluate AI-assisted work before publication. To protect the integrity of research, human judgment must remain the final checkpoint—ensuring that every claim, every citation, and every conclusion is accurate, intentional, and trustworthy.