Prompts(3): Stop Hand-Writing Prompts: Welcome to the DSPy Era

by JeariCk 8 min read
prompt engineering(3)

If you’re still hand-crafting prompts in 2026, you’re like someone still writing SQL joins by hand in 2022 — yeah, it works, but you can’t compete with people using frameworks.

Not saying hand-writing prompts is wrong. But when your project scales from “ask ChatGPT a question” to “manage 20 prompts across different models while keeping output quality stable,” pure manual work just doesn’t cut it anymore.

That’s where DSPy comes in. It’s not yet another prompt template tool. It turns prompt engineering into a compilation problem.

prompt engineering(3)
prompt engineering(3)

Three Generations of Prompt Engineering

Let’s draw a timeline so you see where DSPy fits.

Gen 1: The Hand-Craft Era (2022–2024)

Open ChatGPT, type a sentence, check the output. Doesn’t work? Reword it, try again. Repeat until you’re happy.

Back then, this was already called “prompt engineering.” In reality, those hand-written prompts were pure intuition plus trial and error.

Gen 2: The Structured Framework Era (2024–2025)

Frameworks like COSTAR, RTF, CRISPE, and RACE popped up. They told you: structure your prompt by role, task, format, constraints.

Better than raw hand-crafting — at least you had a skeleton. But you were still hand-crafting prompts, relying on personal experience to judge what works.

Gen 3: The Programmatic Optimization Era (2025–2026)

DSPy is the poster child here. The core shift: you don’t write prompts anymore — you write program logic, and the compiler auto-generates and optimizes prompts.

If Gen 1 was the Stone Age and Gen 2 was the Bronze Age, Gen 3 is the Industrial Revolution — from handicraft to assembly line.


What Is DSPy — In One Sentence

DSPy (Declarative Self-improving Python) is a framework from Stanford NLP. The core idea is brutally simple:

You define what to do. DSPy figures out how to prompt for it most effectively.

Traditional prompt workflow looks like this:

“`

I need a classifier -> I write a prompt -> test it -> inaccurate -> I tweak the prompt -> test again …

“`

Entirely dependent on your intuition about the model. Model changes? Rewrite your prompts.

DSPy workflow looks like this:

“`

I need a classifier -> define input/output signature -> give a few labeled examples -> run the compiler -> DSPy auto-finds the optimal prompt

“`

Your job shifts from “writing prompts” to “defining task boundaries.” Model swap? Just recompile.

This mindset turns prompt engineering from copywriting into configuration.


Four Core Concepts

DSPy’s architecture isn’t complicated. Four things and you’re ready to go.

Signatures

Instead of hand-writing full prompts, a signature defines “what goes in, what comes out.” Simpler than Python type annotations, but same idea.

“`python

# Define: input is a question and context, output is an answer

class AnswerQuestion(dspy.Signature):

    “””Answer questions based on given context”””

    context = dspy.InputField(desc=”relevant information sources”)

    question = dspy.InputField(desc=”user's question”)

    answer = dspy.OutputField(desc=”concise and accurate answer”)

“`

You don’t need to write “Please answer the question based on the following context. Be concise and accurate. No extra information.” DSPy uses these signature descriptions to generate and optimize the prompt itself.

Modules

Modules are composable prompt components that replace manually written prompts. DSPy ships with several built-in ones:

– `dspy.Predict`: The simplest prediction module

– `dspy.ChainOfThought`: Prediction with chain-of-thought reasoning

– `dspy.ReAct`: Reasoning + action loop (the foundation of agents)

– `dspy.ProgramOfThought`: Reasoning through code

You combine them like building blocks:

“`python

class QAProgram(dspy.Module):

    def __init__(self):

        self.qa = dspy.ChainOfThought(AnswerQuestion)

    def forward(self, context, question):

        return self.qa(context=context, question=question)

“`

Metrics

How do you judge output quality? Define a metric. DSPy doesn’t care — use an LLM to score, exact match, or a custom function.

“`python

def qa_metric(gold, pred, trace=None):

    # gold is the ground truth, pred is model output

    # returns a score between 0 and 1

    return dspy.evaluate.answer_exact_match(gold, pred)

“`

Optimizers (Compilers)

This is the juiciest part of DSPy. You define signatures, modules, and metrics — the optimizer searches for the optimal prompt strategy automatically, no hand-tuning prompts required. DSPy offers several optimizers:

LabeledFewShot: Picks few-shot examples from labeled data

BootstrapFewShot: Has the model generate its own few-shot examples, then picks the best

BootstrapFewShotWithRandomSearch: Runs BootstrapFewShot multiple times, picks the best

MIPROv2: Optimizes both instruction text and few-shot examples simultaneously

GEPA: Has the LLM self-reflect on what prompt strategies work and what doesn’t, then improves accordingly

COPRO: Coordinate descent-based instruction optimization

Compilation looks like this:

“`python

from dspy.teleprompt import MIPROv2

teleprompter = MIPROv2(

    metric=qa_metric,

    num_candidates=10,  # generate 10 candidates per round

    init_temperature=1.0

)

optimized_program = teleprompter.compile(

    QAProgram(),

    trainset=training_examples,  # your labeled data

    num_trials=30,               # 30 compilation trials

    max_bootstrapped_demos=4,    # max 4 dynamic examples

    max_labeled_demos=4          # max 4 static examples

)

“`

After this, `optimized_program` runs with the optimal prompt configuration under the same signature, replacing the default behavior.


An Example You Can Follow: Classification

Let’s make this concrete with a classification task so you can see what DSPy actually saves you.

Traditional hand-written prompt:

“`text

You are a text classification assistant. Your task is to classify the given user feedback into one of the following categories:

[bug, feature_request, question, other]

Output ONLY a single English word as the classification result. No extra explanation.

User feedback: [specific content]

Classification:

“`

Switch models or change the task? You start over. Even worse — sometimes changing a single punctuation mark messes with output quality. Zero engineering discipline.

DSPy version:

“`python

class FeedbackClassifier(dspy.Signature):

    “””Classify user feedback”””

    feedback = dspy.InputField(desc=”user feedback text”)

    category = dspy.OutputField(desc=”classification: bug/feature_request/question/other”)

classifier = dspy.ChainOfThought(FeedbackClassifier)

# Compile

optimized = teleprompter.compile(classifier, trainset=train_data)

# Use

result = optimized(feedback=”App crashes when I click submit”)

print(result.category)

“`

No prompts to write. No phrasing to test. Just define it clearly.


Why You Can’t Ignore DSPy in 2026

Cross-Model Migration

This is DSPy’s most practical superpower. You have a system running beautifully on GPT-4o today. Tomorrow you switch to Claude Sonnet. Traditional approach? Rewrite every single prompt.

DSPy’s answer: recompile once. The signatures and data stay — the optimizer auto-discovers the optimal prompt strategy for the new model. No need to rewrite all your prompts manually.

Automatic Adaptation to Model Updates

Model providers ship updates every month. A hand-crafted prompt that worked yesterday might break today — not because the prompt changed, but because the model’s internal behavior drifted. Your prompts stay brittle without a compiler backing them.

DSPy’s optimizers can be re-run periodically to auto-adapt to drift. It’s like adding adaptive cruise control to your AI system.

Less Grunt Work

A moderately complex AI feature needs 20–30 prompt iterations to stabilize by hand. With DSPy, most of your time goes into defining good signatures, preparing labeled data, and choosing metrics. The compiler handles the rest.

Reproducibility and Testability

The biggest pain with hand-written prompts is that you can’t really test them — “good prompt” is subjective. DSPy forces you to define metrics in code, which is essentially adding a test suite to your prompt.


When DSPy Makes Sense

In plain language: not every scenario needs DSPy.

Good fits for DSPy:

– Multi-step reasoning tasks (QA, code generation, document analysis)

– Production systems that need stable output formats

– The same pipeline running across multiple models

– More than 5 prompts — you’re losing your mind maintaining them by hand

Skip DSPy when:

– You’re just chatting with ChatGPT, no automation needed

– One-off task, use and forget

– You just want Claude to polish some copy


DSPy Is Not a Silver Bullet

DSPy isn’t magic. A few things to keep in mind:

You need labeled data: Optimizers need some labeled input-output pairs. No data, the compiler can’t compile.

Compilation costs money: 30 trials with MIPROv2 will cost several to tens of dollars in API fees. But compared to the time you’d spend tuning prompts manually, it’s noise.

Not for trivial tasks: Something like “translate this to English” — DSPy won’t beat a hand-written prompt by much.

Learning curve: Writing Python classes instead of prompts isn’t trivial for non-programmers.


Comparison With Other Frameworks

FeatureHand-written PromptLangChainDSPy
What you write Natural language Chained calls Program signatures
Optimization Manual trial & error Manual trial & error Auto-search
Cross-model migration Rewrite Fine-tune Recompile
Testability Hard Okay Built-in
Learning cost Low Medium Medium-High

Another framework often compared to DSPy for optimizing prompts is TextGrad, which uses a gradient descent-like approach to prompt optimization — similar to backpropagation. But DSPy has a more mature ecosystem, documentation, and community support.


Getting Started

You don’t need much time to get into DSPy:

1. `pip install dspy` (not `dsp` on PyPI — that was deprecated long ago)

2. Check the official docs at `dspy.ai`

3. Start with the simplest signature definition

4. Try it on a classification or extraction task you have lying around

5. Experience the joy of *not writing prompts*

In 2026, AI development is no longer about who writes the fanciest prompt. The shift from “copywriting” to “programming” decides whether you’re the one using AI — or being used by it.

If you haven’t tried DSPy yet, it’s time to add it to your toolkit.


📖 Recommended Reading

Take a look at these articles; you might find them interesting

Leave a Reply

Your email address will not be published. Required fields are marked *