Transformational Leadership Training

What AI's Disagreements Reveal About How It Works

What Nine AI Answers to One Obscure History Question Reveal About Language Models, Hallucination and How to Stay in Control

The same prompt sent to nine AI systems produced nine different historical theories not because any model was broken but because each was trained on different data and tuned to present confidence differently.
Fluency and accuracy are independent variables in large language models: a polished, well-structured answer is not evidence that the answer is correct.
AI is reliably good at expanding a search space, not closing it. Use its outputs as leads to verify, not conclusions to trust.

A small school on a remote Japanese island -- 青海島小学校, now preserved as a cultural site called 青海島共和国 on Omi Island in Yamaguchi Prefecture -- holds a striking mystery in its archive. Every year's sixth-grade graduation photo includes both boys and girls. Every year except one: 1946. That cohort graduated with only girls in the picture.

The same question about that photo was submitted to nine AI systems in August 2025 -- Claude, ChatGPT, Perplexity, Grok, Gemini, CoPilot, DeepSeek, Qwen and Kimi. What came back was not nine versions of the same answer. It was nine distinct historical theories, weighted differently, landing on different conclusions, with different tones and different levels of confidence. Some were right. Some were plausible. Some were confidently wrong.

That divergence is not a bug. It is the most direct window into how large language models are actually built.

What a Language Model Is Doing When It Answers

No AI system ‘looked up’ the answer to the Omi Island question. There is no database of rural Japanese school photos from 1946. What each system did was generate the most statistically likely continuation of the prompt, given everything it had ever been trained on.

A large language model is trained on an enormous corpus of text: digitized books, websites, academic papers, news archives, government records and countless other documents scraped from the public internet. During training, the model adjusts billions of numerical parameters to predict, at each step, which word is most likely to follow the words before it. The model never memorizes facts the way a database does. It learns relationships, patterns and contextual associations between concepts.

When you ask it a question, it produces tokens -- the smallest units of text -- one at a time, each chosen based on the probability distribution the model has learned. The answer it gives is the path through that probability space that the model calculates as most coherent given the input.

This is why the nine answers about the 1946 photo are all plausible. They are all well-formed historical reasoning. But they are not identical because each model was trained on a different corpus, with different weightings, different fine-tuning processes and different instruction-following techniques applied after the initial training was complete.

How the Answers Diverged

Grok gave the most specific answer: the boys were conscripted specifically to build local defense facilities including an air defense monitoring post on a nearby mountain. That is a concrete, localized claim. It may be accurate or it may be a confident confabulation of real wartime patterns applied to a specific place without verified evidence.

DeepSeek offered a different frame entirely: delayed repatriation. The boys of that cohort, it argued, were likely not on the island at all in March 1946, they were stranded abroad with their families as colonial settlers in Manchuria or elsewhere in Japan's empire, waiting for transport ships. This is an interesting theory but impossible to confirm for this specific school without local archival records.

Claude and ChatGPT produced nearly identical responses. Both cited the same Wikipedia sources about wartime mobilization of children and arrived at the same multi-factor explanation (evacuation programs, family displacement, educational disruption, war casualties). The structural similarity suggests these two models may have drawn on a similar distribution of training data and web search results. It also illustrates the concept of model convergence: when multiple systems are trained on similar corpora and optimized toward similar instruction-following objectives, they can produce outputs that are hard to distinguish from each other.

Qwen added a detail that no other model included: it named the school ‘Aoshima Elementary’ -- a different rendering of the Japanese -- and stated the school now uses this history as part of peace education. That claim about the school's current educational mission is either specific local knowledge absorbed from a source the other models did not have or a plausible inference presented as fact.

Perplexity introduced the gender dimension most explicitly: girls were less likely to be pulled into labor-intensive roles, so they remained in school while boys did not. This is consistent with documented gender dynamics in wartime Japan but the framing reflects a particular analytical lens that other models applied less prominently.

What This Tells You About How Models Are Built

Three structural facts about LLMs explain the divergence above.

Training data shapes everything. A model trained on more Japanese-language historical sources will generate different responses about Japanese wartime history than one trained predominantly on English-language secondary sources. A model with access to regional newspapers from Yamaguchi Prefecture in its training corpus could theoretically retrieve specifics about Omi Island. A model without that will generate plausible generalizations instead. You cannot tell from the output alone which is happening.

Models do not cite sources the way a scholar does. When Grok mentioned an air defense post on a mountain near Omi Island, it did not retrieve that fact from a file. It generated text that is consistent with the training patterns it has absorbed. The confidence of the language does not indicate the reliability of the claim. This is one of the most important things to understand about LLM outputs: fluency and accuracy are independent variables.

Instruction tuning and reinforcement learning from human feedback (RLHF) shape the style of certainty. After a model is pre-trained on raw text, developers fine-tune it using human raters who score responses for helpfulness, accuracy and safety. This process teaches the model when to hedge, when to speculate and how to frame uncertainty. The difference between CoPilot's emoji-dotted bullet-point format and DeepSeek's long-form narrative is not a difference in underlying knowledge, it is a difference in how each model was trained to present information. Some models are tuned to sound authoritative. Some are tuned to offer caveats. Neither disposition reliably tracks accuracy.

The Risks That Surface Here

The Omi Island question makes it a stress test: it is answerable but obscure. The real answer likely lives in local archives, oral histories and possibly a handful of Japanese-language documents that no AI system has reliable access to. That means every confident answer any model gave is, to some degree, an extrapolation.

Hallucination is the term for when a model generates text that is fluent and confident but factually wrong. It is not a malfunction. It is the model doing exactly what it is designed to do -- produce the most probable continuation -- in a domain where the training data is thin or absent. The Grok answer about the air defense post is either accurate local history or a hallucination. A reader without independent access to Yamaguchi historical archives cannot tell which.

Cascading misinformation is the risk that compounds hallucination. If a model generates a plausible but false historical claim and that claim is published, cited or shared, it can enter the information environment and be scraped into future training corpora. Models trained on that data then reproduce the false claim with even greater confidence. This is sometimes called ‘model collapse’ or data poisoning via synthetic content.

Anchoring on fluency is the human cognitive risk. Reading nine confident, well-structured historical analyses creates an impression of authoritative knowledge. The polished prose obscures the fact that none of the nine systems can verify their claims against primary sources. A student doing research, a journalist fact-checking a story or a policy analyst consulting an AI for context is at risk of treating the confidence of the language as evidence of the accuracy of the content.

Model homogeneity is a subtler risk. The near-identical Claude and ChatGPT responses suggest that when multiple widely-used AI systems converge on the same answer, that convergence can feel like independent corroboration, when it is actually the same training distribution producing the same output twice. This is particularly dangerous in domains where one perspective dominates the training corpus.

Concrete Guidance for Using AI Responsibly

None of the risks above mean AI systems are useless for research, writing or analysis. They mean the tools require active, critical use rather than passive consumption.

Treat every factual claim as a starting point, not an endpoint. When an AI answer includes a specific claim -- a date, a location, a policy, a statistic -- that claim needs independent verification before it is used consequentially. This is true whether the claim sounds plausible or not.

Use AI answers to generate search terms, not conclusions. The DeepSeek theory about repatriation delays might be a lead for a researcher. It suggests specific archives to consult: post-war repatriation records, colonial settler databases and local school enrollment data from Yamaguchi Prefecture. That is what AI is reliably good at -- expanding the search space. It is not reliably good at closing it.

Compare outputs across models for contested or obscure questions. The divergence across the nine answers is itself informative. When multiple models agree on the broad frame but differ on specifics, the specifics warrant skepticism. When one model introduces a claim no other model has, that claim requires particular scrutiny -- it may be genuinely novel knowledge, or it may be a confident confabulation.

Notice the formatting signals. CoPilot's use of emoji and flagged sections signals a model tuned for casual consumer use. DeepSeek's dense narrative signals a model tuned for analytical depth. Neither format is evidence of accuracy. Recognizing these tuning signatures helps calibrate how to read the output.

Ask the model to explain its uncertainty. Most current models will, if asked directly, identify which parts of their answer are speculative versus well-documented. Prompting for this explicitly – ‘which parts of your answer are you less confident about?’ -- often surfaces useful caveats that the default response buries.

What the 1946 Photo Actually Teaches

The graduation photo from Omi Island's elementary school is a small, local document from a catastrophic period. The boys who should have been in it were absent for reasons that a handful of elderly residents, some local community organization or government records and a thorough archival search could probably clarify. No AI system has that access.

What AI does have is an enormous amount of general knowledge about what happened to Japanese boys and their families between 1942 and 1947 such as wartime labor mobilization, conscription, evacuation, overseas colonial displacement, postwar repatriation delays. That knowledge is real and it is useful context. It is why all nine answers, despite their differences, are historically coherent.

The gap between coherent and accurate is where AI literacy lives. A large language model is not a historian. It is a pattern engine trained on the outputs of historians, journalists, teachers and everyone else who has ever written in a language. It produces text that resembles good historical reasoning because it has processed enormous amounts of, hopefully, good historical reasoning. The resemblance is not the same as the thing itself.

Understanding that distinction is the foundation of using these tools well.

#AILiteracy #HowLLMsWork #AIHallucination #CriticalAIThinking #GenerativeAI

P.S. You may read this collection if interested in learning more about this history, or visit the school-turned-into-museum in Yamaguchi and talk to the locals.

Page updated

Google Sites

Report abuse