Prompts(1): Why They Sometimes Work (and Sometimes Don't) with LLMs JeariCk

Ever had this happen — your prompt was crushing it yesterday, you feed the exact same input today, and the model starts rambling nonsense. It’s not Mercury in retrograde. You just haven’t figured out the thin layer between prompts and LLMs yet.

No fluff here. Just prompt engineering that’s been battle-tested through 2026 — why does a prompt affect model output? What’s the underlying logic? And how do you write prompts that are less likely to fall apart?

Prompts Are Not Incantations

Let’s put one common misconception to rest: a prompt is not some mystical phrase that unlocks hidden superpowers when you say it the right way.

A large language model is a big probability prediction system. You throw text at it, and it guesses the next most likely token, one by one, based on patterns it saw during training. So-called prompt engineering isn’t magic — it’s translating your intent into a format the model can parse more easily.

This explains a lot of weird stuff:

– Why does “please be polite” work great on one model but get ignored on another? Because each model’s training data has a different density of politeness instructions.

– Why does adding “think step by step” boost reasoning ability? Because it activates chain-of-thought samples from training data used for solving complex problems.

– Why does a prompt that works on one model fail on another? Because models differ wildly in data composition, RLHF preference alignment, and instruction-following capability.

A study from Wharton’s Generative AI Labs had a sobering conclusion: same model, same question, same prompt — results can vary significantly across different runs. Aggregate metrics look fine, but individual questions can fail spectacularly.

Prompt Engineering Has Split Into Two Worlds

If you’ve been following this space, the biggest change isn’t some shiny new technique. It’s that “prompt engineering” as a concept has fractured into two completely different paths.

Path One: Everyday Casual Use

This is what most people do — open ChatGPT or Claude, type a sentence, ask what you want to ask. By 2026, frontier models are already pretty strong at understanding everyday conversational language. You don’t need to overthink “will the model get this?” — it mostly will.

In this direction, technique matters less and less every year. Models’ intent understanding improves visibly with each generation. You don’t really need to learn “the right way to ask.”

Path Two: Production-Grade Context Engineering

This is where it’s actually worth spending your time.

If you’re building AI-powered products — an automated ticket system, a code review agent, a document extraction pipeline — what you need is context engineering, not “how to phrase things prettier.”

Andrej Karpathy said it back in mid-2025: the term “prompt engineering” is already misleading. It dramatically undersells what the work actually involves. Real production-grade prompt design includes:

– Role definition: Not “you are an expert with 30 years of experience” boilerplate, but precise tone, level of detail, and behavioral guardrails

– Context management: How much info to give, where to place it, what facts the model can trust

– Output constraints: Length, style, format, and landmines to avoid

– Acceptance criteria: What counts as “done right” and when a retry is needed

– Versioning and testing: Prompts should be version-controlled, A/B tested, and regression-checked like code

If you’re still maintaining prompts in Word or Google Docs, that’s the first thing you need to change.

Battle-Tested Techniques That Actually Work

After countless devs beating on these models, here are the techniques that hold up in production environments in 2026:

1. Write Prompts Like Spec Docs

The most reliable approach is to write your prompt as a spec — tell the model directly:

– What to do

– What information it can use

– What it cannot do

– What the output looks like

– How to judge right from wrong

This “contract-style” writing has the best stability.

2. Use Structural Tags

Not Markdown. Not numbered lists. Anthropic officially recommends XML tags for structuring prompts, and Google’s prompt engineering whitepaper explicitly advises the same. The reason is simple: tags help the model more accurately distinguish between instructions and data.

Side-by-side comparison:

```

# Bad

Please summarize the following document, extract key points, then output as a table

# Good

<task>Summarize the following document and extract key points</task>

<format>Table</format>

<document>

[document content]

</document>

```

3. Few-Shot Examples Still Matter

Google’s whitepaper states it plainly: always include few-shot examples, zero-shot is not recommended. Models keep getting better at zero-shot, but a few examples always reduces the chance of output drift.

4. Put the Specific Question After the Data

Give context data first, then the specific question. This ordering adjustment works better than agonizing over phrasing.

5. Split Tasks

If a task requires multiple steps (say, review, translate, then reformat), split it into separate prompts. Each prompt does one thing — error rates drop dramatically.

6. Let the Model Help You Improve the Prompt

A handy trick: first ask the model to explain what it thinks it got wrong, then use that explanation to revise the prompt. That’s way more efficient than blindly tweaking your wording over and over.

More Prompt Is Not Better Prompt

There’s a trap a lot of people fall into: cramming in context and examples to make the model “understand comprehensively.” The result? Output quality goes *down*.

Prompt Bloat is a real problem — irrelevant, redundant information dilutes the model’s attention from the key instructions. Research confirms it: the more junk info in your prompt, the lower the model’s accuracy.

The rule is clear: keep it lean. Delete everything that’s not relevant to the task.

Long Context Is Eating RAG’s Lunch

This is one of the most underrated but impactful shifts of 2026.

For the last couple years, RAG (Retrieval-Augmented Generation) was the default play for knowledge tasks. But when effective context windows break past a million tokens — and models actually *use* those tokens well — things change.

For most scenarios — analyzing a company’s earnings report, reviewing a codebase, going through a year’s worth of emails — the approach now is brutally simple: dump the document right into the prompt and let the model figure out what’s important.

That’s way less work than building a vector database plus retrieval pipeline, and the results aren’t worse. Only if you’re dealing with truly massive corpora (like the full SEC EDGAR historical archive) is RAG the right call.

But if your use case is “process a few dozen PDFs,” ask yourself whether you really need RAG.

Different Models Have Different Personalities

One lesson we learned in 2026: don’t expect the same prompt to work everywhere.

– GPT-5.5 series: Structured prompts with formatted output definitions, performs most reliably with clear constraints

– Claude series: XML tags are a natural fit, long context is its strong suit

– Gemini series: Stands out on agent tasks and coding scenarios, but be careful with Flash variants for reasoning tasks

Even different variants of the same model differ significantly. Gemini 3.5 Flash crushes 3.1 Pro on speed, but its reasoning actually regressed — which is exactly why Gemini 3.5 Pro was rushed out in June, to fix that reasoning gap.

Bottom Line

The real relationship between prompts and LLMs, when you cut through everything, is an alignment problem — how to translate your intent into the format the model can execute most effectively. It’s not mysticism. It’s the combined result of information density, training data distribution, and attention mechanisms.

Three things to remember from 2026:

1. Don’t sweat casual use — models are already good enough. Just talk normally.

2. Production is engineering — treat prompts like code. Version control, testing, iteration, none of it’s optional.

3. Long context is the trend — for small to medium knowledge tasks, dump it in. Way simpler than building a RAG pipeline.

📖 Recommended Reading

Take a look at these articles; you might find them interesting

Prompts(1): Why They Sometimes Work (and Sometimes Don’t) with LLMs