Skip to main navigation Skip to search Skip to main content

Trusting ChatGPT? When a Subtle Variation in the Prompt Can Significantly Alter the Results

Research output: Contribution to journalArticlepeer-review

Abstract

How much can we trust highly complex predictive models like ChatGPT? This study tests if subtle changes in prompt structuring do not produce significant variations in the classification results of sentiment polarity analysis generated by the LLM GPT-4o mini. The model classified 100.000 comments in Spanish on four Latin American presidents as positive, negative, or neutral on 10 occasions, varying the prompts each time. The experimental methodology included exploratory and confirmatory analyses to identify significant discrepancies among classifications.
The results reveal that minor modifications to prompts, such as lexical, syntactic, modal, or even their lack of structure, impact the classifications. At times, the model produced undecided responses mixing categories, providing unsolicited explanations, or using languages other than Spanish. Statistical analysis using Chi-square tests confirmed significant differences in most comparisons between prompts, except in one case when linguistic structures were similar.
These findings challenge the robustness and trustworthiness of large language models (LLMs) for classification tasks, highlighting their vulnerability to variations in instructions. Moreover, it was evident that the lack of structured grammar in prompts increases the frequency of hallucinations. The discussion underscores that trust in LLMs is based not only on technical performance but also on the social and institutional relationships underpinning their use.
Original languageEnglish
JournalJournal of Artificial Intelligence and Technology
DOIs
StatePublished - 23 Feb 2026

Keywords

  • ChatGPT
  • Large language models (LLMs)
  • Trust
  • Robustness
  • Sentiment analysis
  • Spanish

Fingerprint

Dive into the research topics of 'Trusting ChatGPT? When a Subtle Variation in the Prompt Can Significantly Alter the Results'. Together they form a unique fingerprint.

Cite this