What is Context Engineering and how is it different from Prompt Engineering?

Prompt Engineering focuses on crafting a single good question. Context Engineering focuses on designing the entire information environment presented to the AI — what to show, how to organize it, when to provide it, and what tools to connect. Prompt Engineering is addition; Context Engineering is multiplication.

What does 'Specs Are the New Source Code' mean?

It means that in AI-assisted development, well-written specification documents become more valuable than the code itself. When AI generates code from specs, the specs become the authoritative source of truth — like source code that compiles into implementation through AI rather than a compiler.

What are the four failure modes of long contexts in AI?

The four failure modes are: lost-in-the-middle (AI ignores information in the middle of long contexts), contradictory context (conflicting instructions confuse the AI), stale context (outdated information leads to wrong decisions), and context overflow (exceeding the model's effective window degrades all output quality).

How do I apply Context Engineering in practice?

Build a layered documentation structure: design docs for goals, implementation plans for details, API guides for specifics, and CLAUDE.md for AI-specific guidance. Load only relevant files (not the entire codebase), keep documentation current, and connect external tools via MCP to extend the AI's perception.

Feb 24, 2026

Context Engineering: The Most Underrated Core Skill in AI Programming (Stanford CS146S Deep Dive)

Deep dive into Stanford CS146S Week 3: the paradigm shift from Prompt Engineering to Context Engineering, why Specs are the new source code, and four failure modes of long contexts with practical solutions.

Bruce

Context EngineeringAI CodingStanford CS146SVibe CodingPrompt Engineering

AI Guides

2641 Words

2026-02-23

This is Part 2 of the “Stanford Vibe Coding Course Deep Dive” series. See the series navigation at the end of this article.

If I had to pick just one week from the entire 10-week CS146S curriculum for a deep dive, I would choose Week 3 without hesitation.

Not because it is the flashiest – that would be Week 8’s “build an app with one sentence.” Not because it is the most hardcore – that would be Week 6’s security deep dive. It is because what Week 3 covers directly determines the ceiling of your AI programming capability.

Context Engineering – you may have just started hearing this term, but it is rapidly replacing Prompt Engineering as the core skill of AI programming. The reason is simple: optimizing a single prompt has already hit its ceiling. What truly determines AI output quality is the overall context you provide.

Writing a good prompt is addition. Doing context engineering well is multiplication.

What Is Context Engineering

Prompt Engineering focuses on “how to ask a question.” Context Engineering focuses on “what kind of world to present to the AI.”

The difference is like:

Prompt Engineering = asking a great question in an interview
Context Engineering = preparing a comprehensive, well-organized briefing packet for the interviewer, so they already understand the full situation before answering your question

Specifically, context engineering encompasses these dimensions:

Dimension	Description	Example
Information Selection	What to show the AI and what to hide	Only load relevant source files, not the entire codebase
Information Organization	How to structure the information	Layered docs: design docs -> implementation plans -> specific code
Information Quality	Ensure no errors or contradictions in the context	Clean up outdated comments and documentation
Information Timing	When to provide what information	Give architecture overview first, then specific implementation
Tool Configuration	Extend AI’s perception through MCP/tools	Connect database schema, API docs, project management tools

The StockApp team distilled a brilliant formula from their practice: Good code is a byproduct of good context.

They treated their code repository as a shared workspace between humans and AI, building a layered documentation structure:

docs/designs/    -> Product requirements & high-level goals
docs/plans/      -> Detailed implementation plans
docs/guides/     -> API tutorials
schema.sql       -> Data structure specifications
CLAUDE.md        -> AI-specific guidance
README.md        -> Project overview

Each layer in this structure has a clear audience and purpose: designs are for decision-makers, plans are for executors (including AI), guides are for consumers. And CLAUDE.md is the “user manual” specifically for AI Agents – telling it the project’s conventions, taboos, and preferences.

Specs Are the New Source Code

The most important reading material in Week 3 is Specs Are the New Source Code. The core argument is:

In the AI programming era, we have our priorities backwards – meticulously maintaining generated code while treating the specs that guide generation carelessly. It is like “shredding the source code and version-controlling the binaries.”

Think about how precise this analogy is.

In traditional development, source code is the core asset – it precisely defines system behavior. PRDs and design docs are written and forgotten; code is what gets version-controlled, code-reviewed, and tested.

But in the AI programming era, the relationship is inverted:

Specs contain the complete intent and values
Code is merely a “lossy projection” of the spec – AI generates code from your spec, but this conversion inevitably loses information
If the spec is vague, AI fills in the blanks with its own “guesses” – and these guesses may be wildly off from your intent

What does this mean?

1. Specs Need Version Control

Just like code, specs should be managed with Git – with diffs, history, and reviews. Because the spec is your only pathway to making AI understand your intent. If the spec is lost or outdated, you cannot reliably reproduce AI’s output.

2. Spec Quality Directly Determines Code Quality

This is not a linear relationship of “slightly better prompt, slightly better results” – it is exponential. A vague spec produces vague code, vague code produces vague bugs, vague bugs produce even vaguer patches – this is a degradation spiral. Conversely, a precise spec can get AI to generate near-production-quality code on the first try.

3. The PM Role Becomes Unprecedentedly Important

Andrew Ng pointed out an unprecedented trend: some organizations now need twice as many PMs as engineers. When AI accelerates engineering output, the bottleneck shifts from “writing code” to “making decisions and writing clear requirements.”

Sean Grove (OpenAI) put it more directly:

In the near future, the most effective communicator will be the most valuable programmer.

4. The Workflow Is Inverted

Old workflow: Vague idea -> Wireframes -> Design -> MVP -> Feedback -> Revise spec -> Rebuild

New workflow: Vague idea -> Quick prototype -> Feedback -> Clear spec -> AI implementation

Notice the difference: in the new workflow, the prototype is not for delivery – it is for getting rapid feedback to refine the spec. The prototype is a draft of the spec, not a draft of the product.

Four Failure Modes of Long Contexts

“Why not just dump everything into the AI?” – this is the most common context management misconception.

How Long Contexts Fail reveals a counterintuitive truth: longer contexts do not lead to better results. Despite modern models supporting million-token context windows, blindly filling them leads to four types of failure.

Failure Mode 1: Context Poisoning

Once incorrect information enters the context, the AI will repeatedly reference and amplify it.

Google’s Gemini encountered this problem when attempting to play Pokemon: after a piece of false information entered the context, the Agent began “fixating on an unachievable goal,” repeating ineffective actions endlessly.

In AI programming, this means: if your CLAUDE.md contains an outdated rule (like “use React 16 class components”), the AI will faithfully follow this incorrect instruction, even if other context suggests using hooks.

Countermeasure: Regularly audit your context files to ensure no outdated or contradictory information. Maintain context files like you maintain code.

Failure Mode 2: Context Distraction

As context length grows, AI tends to “repeat historical behaviors rather than synthesize new strategies.”

Databricks research shows that model correctness begins to decline significantly when processing contexts exceeding 32K tokens. The model does not “forget” earlier information – it gets distracted by recent information, causing decision quality to degrade.

In AI programming, this manifests as: when you keep appending requirements in a long conversation, the AI may “forget” your early constraints and generate contradictory code.

Countermeasure: Keep conversations focused. One conversation, one task. When you need to switch context, start a new session.

Failure Mode 3: Context Confusion

Too many tool definitions or irrelevant information interfere with the model’s judgment.

Berkeley’s function-calling leaderboard shows: every model’s performance degrades when given more tools. Llama 3.1 8B works fine with 19 tools but starts failing at 46. Even GPT-4 class models are not immune.

This is why Anthropic emphasizes in Writing Effective Tools for Agents: “Fewer, well-designed tools outperform comprehensive API wrappers.” Do not expose every capability to the AI – only give it the tools needed for the current task. For practical experience with tool design, see the MCP Server development tutorial.

Countermeasure: Dynamically load tools and context based on the task. For example, in Claude Code, do not load all available MCP Servers at once.

Failure Mode 4: Context Conflict

When information from multiple sources contradicts each other, model performance drops sharply.

A joint study by Microsoft and Salesforce found a striking statistic: providing the same information in stages (first giving partially incorrect answers, then the complete correct information) caused performance to drop by an average of 39%. The early incorrect answers remained in the context, interfering with the final judgment.

In AI programming, this means: if your project’s README says PostgreSQL but docker-compose.yml configures MySQL, the AI may generate inconsistent code or flip-flop between the two databases.

Countermeasure: Ensure consistency across all context sources. When you find contradictions, fix them immediately – do not expect the AI to “figure out which is right.”

Context Rot: Context Decays Too

Chroma’s Context Rot research complements the context failure picture from another angle.

They tested 18 mainstream models (including Claude, GPT, Gemini, Qwen, Llama) and discovered a universal pattern: as input length increases, model performance degrades significantly and consistently, even for extremely simple tasks.

Even more surprising findings:

Structured context actually performs worse than chaotic context. After scrambling the order of context text, all models performed better. This suggests systematic weaknesses in how attention mechanisms process logically coherent long texts.
Lower question-answer relevance leads to faster performance degradation. When questions and answers have low surface similarity (requiring more reasoning), models degrade more severely in long contexts.
A single distractor can significantly reduce accuracy. Even adding just one piece of irrelevant information to the context affects the model.
Claude models tend to refuse answering when uncertain, while GPT models tend to generate confident but incorrect answers. This is an interesting behavioral difference with important practical implications.

The practical takeaway: more context is not better – more precise context is better. Every piece of information you give the AI has a cost – not just in tokens, but in attention.

Anthropic’s Five Principles of Tool Design

Another important reading in Week 3 is Anthropic’s Writing Effective Tools for Agents. Tool design directly affects context quality – poorly designed tools introduce unnecessary confusion and noise for the AI.

Principle 1: Curate, Don’t Dump

Do not wrap every API endpoint as a tool. Identify the core operations the Agent truly needs and consolidate related functions. For example, instead of providing four separate tools – list_events, create_event, invite_attendees, book_room – provide a single schedule_event tool that orchestrates the multi-step operation internally.

Principle 2: Consistent Namespacing

When you have many tools, use prefixes to group them: asana_projects_search, asana_users_search, slack_channels_list. This helps AI quickly locate relevant tools and reduces confusion.

Principle 3: Return Semantic Data

Do not return bare IDs – return meaningful names and descriptions. {"user": "John Doe", "role": "admin"} is far more AI-friendly than {"user_id": "a1b2c3", "role_id": 1}. AI needs to understand data, not look up tables.

Principle 4: Token Efficiency

Implement pagination, filtering, and truncation. Claude Code limits tool responses to 25,000 tokens by default. If your tool returns 100,000 lines of data at once, the AI will be overwhelmed. Set reasonable default limits for each query.

Principle 5: Tool Descriptions Are a Performance Lever

A tool’s description field has an enormous impact on AI behavior. The Anthropic team found that optimizing tool descriptions alone achieved state-of-the-art results on SWE-bench. This means your tool’s “documentation” is as important as your code.

Practical Guide: Implementing Context Engineering in Your Projects

Theory covered – how do you put it into practice? Based on CS146S materials and real-world cases, here is an actionable framework.

Layer 1: Project-Level Context

This is the foundation for all development, typically implemented through configuration files and documentation.

# CLAUDE.md Example Structure

## Project Overview
This is an e-commerce API service built on FastAPI...

## Tech Stack
- Backend: FastAPI + SQLAlchemy
- Database: PostgreSQL 15
- Cache: Redis
- Message Queue: RabbitMQ

## Code Standards
- Use pydantic v2 for data validation
- All API endpoints require type annotations
- Error handling uses custom exception classes

## Do Not
- Use ORM lazy loading
- Write raw SQL in API handlers
- Use print for debugging -- use structlog

Layer 2: Task-Level Context

Provide task-specific context each time you give the AI a task.

# Task: Implement User Registration API

## Relevant Files
- src/models/user.py (User model definition)
- src/schemas/auth.py (Auth-related schemas)
- src/services/email.py (Email sending service)

## Business Rules
- Email must be validated for uniqueness
- Password: minimum 8 characters, including upper/lower case and numbers
- Send verification email upon successful registration
- Use existing EmailService -- do not create a new one

## Reference Implementation
- Similar API pattern can be found in src/api/v1/products.py

Layer 3: Context Hygiene

The most overlooked yet most important layer.

Regular cleanup: Check CLAUDE.md and other context files weekly. Remove outdated information.

Consistency checks: Ensure README, CLAUDE.md, docker-compose.yml, CI configs, and other sources do not contradict each other.

Fresh sessions: When a conversation exceeds 30 turns or 50K tokens, start a new session with clean context.

Dynamic tool loading: Do not enable all MCP Servers at once. Only enable what the current task needs. When writing backend code, you do not need Figma MCP. When doing design work, you do not need database MCP.

Layer 4: Feedback-Driven Context Optimization

Anthropic mentions a clever approach in their tool design article: feed AI’s evaluation conversation back to the AI itself to improve tools.

The same approach applies to context management: when AI generates unexpected code, analyze why – is context missing? Contradictory? Overloaded? Then adjust your context strategy accordingly.

Treat context management as a continuously optimizing system, not a one-time configuration.

Cognition (Devin)’s Perspective

The Week 3 guest was Silas Alberti, Head of Research at Cognition (the company behind Devin).

Devin’s Agents 101 documentation provides another perspective on context management – from the viewpoint of an Agent framework designer, what kind of context helps Agents perform best.

Core insights:

Specify “how,” not just “what.” Instead of “implement user authentication,” say “implement user authentication using JWT + refresh token pattern, token expiry 15 minutes, refresh token 7 days.”
Give Agents access to feedback loops. Let Agents run tests, see lint errors, access CI/CD results. This feedback itself is a form of dynamic context – telling the Agent what it got right and what it got wrong.
Set different context strategies for different complexity levels:
- Simple tasks: Direct description is sufficient
- Medium tasks (1-6 hours of work): Provide detailed context + expect 80% time savings while reserving human polish
- Complex tasks: Provide context in stages + set multiple checkpoints

From “Asking a Good Question” to “Building an Information System”

The core insight from CS146S Week 3 can be distilled into one sentence:

Prompt Engineering is a craft. Context Engineering is systems engineering.

A craft relies on talent and experience to ask an exquisite question. Systems engineering relies on architecture and discipline to ensure AI delivers consistently high-quality output across any task.

How this shift impacts different roles:

Role	Impact
Individual Developer	Need to invest more effort in documentation and context maintenance than in code itself
Team Lead	Need to establish team-level context management standards (CLAUDE.md templates, doc structures, tool configuration standards)
PM/Product Manager	Role value increases dramatically – writing clear specs becomes the most critical output
Architect	Shifts from designing code architecture to designing information architecture – not just making code maintainable, but making it AI-comprehensible

When Specs become the new source code and Context becomes the new programming environment, we are witnessing a fundamental restructuring of software engineering. Code generation can be delegated to AI, but context management – deciding what the AI sees, how it sees it, and when – is an irreplaceable human core competency.

At least for now.

CLAUDE.md Memory Guide: One File to Make AI Remember You Forever – The most direct practice of context engineering
MCP Protocol Complete Guide – Extending AI’s context perception through MCP
Claude Code Complete Guide: From Beginner to Expert – The best platform for context engineering practice
Claude Code Best Practices – Practical tips shared by the Claude Code founders
Build a Claude Code Clone from Scratch – Understand the underlying architecture of AI coding assistants hands-on

This is Part 2 of the “Stanford Vibe Coding Course Deep Dive” series:

Stanford CS146S Deep Dive (Part 1): How Vibe Coding Became an Academic Discipline
This article: Stanford CS146S Deep Dive (Part 2): Context Engineering (Week 3)
Stanford CS146S Deep Dive (Part 3): Agent Manager (Week 4)
Stanford CS146S Deep Dive (Part 4): Secure Vibe Coding (Week 6-7)
Stanford CS146S Deep Dive (Part 5): From Prototype to Production (Week 8-9)

Context Engineering: The Most Underrated Core Skill in AI Programming (Stanford CS146S Deep Dive)

What Is Context Engineering

Specs Are the New Source Code

1. Specs Need Version Control

2. Spec Quality Directly Determines Code Quality

3. The PM Role Becomes Unprecedentedly Important

4. The Workflow Is Inverted

Four Failure Modes of Long Contexts

Failure Mode 1: Context Poisoning

Failure Mode 2: Context Distraction

Failure Mode 3: Context Confusion

Failure Mode 4: Context Conflict

Context Rot: Context Decays Too

Anthropic’s Five Principles of Tool Design

Principle 1: Curate, Don’t Dump

Principle 2: Consistent Namespacing

Principle 3: Return Semantic Data

Principle 4: Token Efficiency

Principle 5: Tool Descriptions Are a Performance Lever

Practical Guide: Implementing Context Engineering in Your Projects

Layer 1: Project-Level Context

Layer 2: Task-Level Context

Layer 3: Context Hygiene

Layer 4: Feedback-Driven Context Optimization

Cognition (Devin)’s Perspective

From “Asking a Good Question” to “Building an Information System”

Series Navigation

Comments

What Is Context Engineering

Specs Are the New Source Code

1. Specs Need Version Control

2. Spec Quality Directly Determines Code Quality

3. The PM Role Becomes Unprecedentedly Important

4. The Workflow Is Inverted

Four Failure Modes of Long Contexts

Failure Mode 1: Context Poisoning

Failure Mode 2: Context Distraction

Failure Mode 3: Context Confusion

Failure Mode 4: Context Conflict

Context Rot: Context Decays Too

Anthropic’s Five Principles of Tool Design

Principle 1: Curate, Don’t Dump

Principle 2: Consistent Namespacing

Principle 3: Return Semantic Data

Principle 4: Token Efficiency

Principle 5: Tool Descriptions Are a Performance Lever

Practical Guide: Implementing Context Engineering in Your Projects

Layer 1: Project-Level Context

Layer 2: Task-Level Context

Layer 3: Context Hygiene

Layer 4: Feedback-Driven Context Optimization

Cognition (Devin)’s Perspective

From “Asking a Good Question” to “Building an Information System”

Related Reading

Series Navigation

Comments