The Prompt That Worked by Accident
Why a duplicated email changed an AI outcome more than the instructions did
Time to Complete: 30 minutes
Format: Single-session lesson with a PDF 5-minute warm-up activity
Who This Is For: This lesson is for anyone whose work depends on an AI system producing a consistent answer from one run to the next. That includes product managers who greenlight AI features without knowing how fragile the underlying prompt is, clinicians and mental health researchers exploring AI-assisted screening tools, customer support leads who deploy chatbots built on a single tested prompt, data scientists running AI benchmarks who assume their results would replicate and educators teaching prompt engineering who need a concrete example of why small wording choices carry real weight. The shared problem across all these roles is the same. People treat a prompt as a fixed instruction when it behaves more like an unstable variable. This lesson uses one documented case to make that instability visible and memorable.
Real-World Applications
Healthcare-adjacent AI teams building screening tools for crisis indicators face this exact fragility in practice. A prompt engineer working on a suicide risk detection task tested a prompt that accidentally contained a duplicated email message, and that accident raised the F1 score more than any deliberate change to the instructions did. Removing the duplicate later dropped performance again. This is not a hypothetical warning about prompt design, it is a documented result from a real attempt to flag a clinically significant pattern called entrapment in posts written by people in crisis, and it shows why teams cannot assume a prompt's wording is a minor implementation detail.
Lesson Goal
You will examine a documented prompt engineering case study to understand why language models respond unpredictably to small wording changes. You will identify the categories researchers use to describe this instability and apply that understanding to a short comparison exercise. You will leave with a working sense of why prompt testing, not prompt writing alone, determines whether an AI system performs reliably.
The Problem and Its Relevance
Two statements anchor this lesson, and both come directly from documented findings.
First, a change as small as extra spacing, capitalization or a swapped synonym can shift a large language model's accuracy on a task from near zero to above 80 percent. This was measured directly on LLaMA2-7B and shows that prompt wording is not cosmetic. It is a performance variable on the same order as the underlying task itself.
Second, an experienced prompt engineer spent 47 documented development steps and roughly 20 hours building a single prompt for one task, only to discover that an unrelated automated optimization method beat his best result without using any of the contextual material he relied on. Expertise narrowed the search space, but it did not guarantee the best outcome. The system that won did not need the human reasoning the expert assumed was essential.
Core Concepts: How Wording Becomes a Variable
Prompt sensitivity describes how a model's output can shift dramatically based on details that look irrelevant to a human reader. Researchers group these shifts into a few recognizable categories. Small surface changes such as spacing or capitalization fall into one category. Task format, meaning whether a question is phrased as a classification choice or a yes-or-no question, falls into another, and this category alone has been shown to swing GPT-3 accuracy by up to 30 percent on the same underlying task.
A third category is prompt drift, where the same prompt produces different results simply because the model behind an API changed over time. None of these categories require the task itself to change. The instructions stay logically equivalent while the wording shifts, and the output shifts with it.
The entrapment case study makes this concrete. A prompt engineer pasted in a background email twice by accident while building a prompt to detect entrapment in posts from a suicide support forum. That duplication became the single highest-leverage change in 47 attempts, raising the F1 score during one stage of development. When he deliberately removed the duplicate later to clean up the prompt, performance dropped again. Neither he nor the documentation could explain why repetition mattered more than content.
This connects directly to the two statements above. If small formatting choices can swing accuracy by 80 points, and if a careful expert can spend 20 hours without finding the strongest configuration, then prompt engineering cannot be treated as a one-time writing task. It has to be treated as an empirical process that requires testing, measurement and humility about why a result occurred.
In-Session Activity (15 minutes)
Form pairs or small groups. Each group receives the same short classification task description used in the case study. The task asks whether a written post expresses entrapment, defined as the feeling of being trapped with no way out. Each group writes two versions of a one-line instruction asking an AI model to make that judgment. The first version should be as plain and literal as possible. The second version should add one structural change only, such as reordering the instruction and the text, changing the question into a yes-or-no format, or repeating one piece of context twice.
Groups then compare their two versions side by side and predict, without testing on a live model, which version they expect to perform less consistently and why. The goal is not to find a correct answer. The goal is to practice noticing the kind of small structural choice that the case study showed can swing results far more than intended.
Discussion Questions (5 minutes)
Why might repeating the same piece of context twice change a model's output more than changing the wording of an instruction?
If a 20-hour manual prompt engineering effort can be outperformed by a 16-iteration automated method with no domain context, what does that suggest about how teams should allocate time between human refinement and automated testing?
Where in your own work might a prompt be running in production without anyone having tested an alternate phrasing of the same instruction?
The Bottom Line
Two closing statements carry this lesson forward.
First, an automated tool that ran 16 iterations without any human-written context outperformed a human expert who spent 20 hours and had access to a professor's email, a clinical definition and direct domain guidance. This does not mean automation replaces expertise. It means the value of human context in a prompt is not guaranteed just because it feels relevant to the person writing it.
Second, the same paper that produced these findings also warns that F1 scores changed by as much as 0.04 between identical runs at zero temperature. Some of the differences researchers chase are smaller than the noise in their own measurement process. Anyone treating a single benchmark number as proof of a prompt's quality is building confidence on a foundation narrower than they realize.
#PromptEngineering #PromptSensitivity #LLMReliability #AIBenchmarking #PromptDrift