🇨🇳 中文

Context Engineering: The Most Underrated Core Skill in AI Programming (Stanford CS146S Deep Dive)

Deep dive into Stanford CS146S Week 3: the paradigm shift from Prompt Engineering to Context Engineering, why Specs are the new source code, and four failure modes of long contexts with practical solutions.

Bruce

Context EngineeringAI CodingStanford CS146SVibe CodingPrompt Engineering

AI Guides

2641  Words

2026-02-23


This is Part 2 of the “Stanford Vibe Coding Course Deep Dive” series. See the series navigation at the end of this article.

If I had to pick just one week from the entire 10-week CS146S curriculum for a deep dive, I would choose Week 3 without hesitation.

Not because it is the flashiest – that would be Week 8’s “build an app with one sentence.” Not because it is the most hardcore – that would be Week 6’s security deep dive. It is because what Week 3 covers directly determines the ceiling of your AI programming capability.

Context Engineering – you may have just started hearing this term, but it is rapidly replacing Prompt Engineering as the core skill of AI programming. The reason is simple: optimizing a single prompt has already hit its ceiling. What truly determines AI output quality is the overall context you provide.

Writing a good prompt is addition. Doing context engineering well is multiplication.

What Is Context Engineering

Prompt Engineering focuses on “how to ask a question.” Context Engineering focuses on “what kind of world to present to the AI.”

The difference is like:

  • Prompt Engineering = asking a great question in an interview
  • Context Engineering = preparing a comprehensive, well-organized briefing packet for the interviewer, so they already understand the full situation before answering your question

Specifically, context engineering encompasses these dimensions:

DimensionDescriptionExample
Information SelectionWhat to show the AI and what to hideOnly load relevant source files, not the entire codebase
Information OrganizationHow to structure the informationLayered docs: design docs -> implementation plans -> specific code
Information QualityEnsure no errors or contradictions in the contextClean up outdated comments and documentation
Information TimingWhen to provide what informationGive architecture overview first, then specific implementation
Tool ConfigurationExtend AI’s perception through MCP/toolsConnect database schema, API docs, project management tools

The StockApp team distilled a brilliant formula from their practice: Good code is a byproduct of good context.

They treated their code repository as a shared workspace between humans and AI, building a layered documentation structure:

docs/designs/    -> Product requirements & high-level goals
docs/plans/      -> Detailed implementation plans
docs/guides/     -> API tutorials
schema.sql       -> Data structure specifications
CLAUDE.md        -> AI-specific guidance
README.md        -> Project overview

Each layer in this structure has a clear audience and purpose: designs are for decision-makers, plans are for executors (including AI), guides are for consumers. And CLAUDE.md is the “user manual” specifically for AI Agents – telling it the project’s conventions, taboos, and preferences.

Specs Are the New Source Code

The most important reading material in Week 3 is Specs Are the New Source Code. The core argument is:

In the AI programming era, we have our priorities backwards – meticulously maintaining generated code while treating the specs that guide generation carelessly. It is like “shredding the source code and version-controlling the binaries.”

Think about how precise this analogy is.

In traditional development, source code is the core asset – it precisely defines system behavior. PRDs and design docs are written and forgotten; code is what gets version-controlled, code-reviewed, and tested.

But in the AI programming era, the relationship is inverted:

  • Specs contain the complete intent and values
  • Code is merely a “lossy projection” of the spec – AI generates code from your spec, but this conversion inevitably loses information
  • If the spec is vague, AI fills in the blanks with its own “guesses” – and these guesses may be wildly off from your intent

What does this mean?

1. Specs Need Version Control

Just like code, specs should be managed with Git – with diffs, history, and reviews. Because the spec is your only pathway to making AI understand your intent. If the spec is lost or outdated, you cannot reliably reproduce AI’s output.

2. Spec Quality Directly Determines Code Quality

This is not a linear relationship of “slightly better prompt, slightly better results” – it is exponential. A vague spec produces vague code, vague code produces vague bugs, vague bugs produce even vaguer patches – this is a degradation spiral. Conversely, a precise spec can get AI to generate near-production-quality code on the first try.

3. The PM Role Becomes Unprecedentedly Important

Andrew Ng pointed out an unprecedented trend: some organizations now need twice as many PMs as engineers. When AI accelerates engineering output, the bottleneck shifts from “writing code” to “making decisions and writing clear requirements.”

Sean Grove (OpenAI) put it more directly:

In the near future, the most effective communicator will be the most valuable programmer.

4. The Workflow Is Inverted

Old workflow: Vague idea -> Wireframes -> Design -> MVP -> Feedback -> Revise spec -> Rebuild

New workflow: Vague idea -> Quick prototype -> Feedback -> Clear spec -> AI implementation

Notice the difference: in the new workflow, the prototype is not for delivery – it is for getting rapid feedback to refine the spec. The prototype is a draft of the spec, not a draft of the product.

Four Failure Modes of Long Contexts

“Why not just dump everything into the AI?” – this is the most common context management misconception.

How Long Contexts Fail reveals a counterintuitive truth: longer contexts do not lead to better results. Despite modern models supporting million-token context windows, blindly filling them leads to four types of failure.

Failure Mode 1: Context Poisoning

Once incorrect information enters the context, the AI will repeatedly reference and amplify it.

Google’s Gemini encountered this problem when attempting to play Pokemon: after a piece of false information entered the context, the Agent began “fixating on an unachievable goal,” repeating ineffective actions endlessly.

In AI programming, this means: if your CLAUDE.md contains an outdated rule (like “use React 16 class components”), the AI will faithfully follow this incorrect instruction, even if other context suggests using hooks.

Countermeasure: Regularly audit your context files to ensure no outdated or contradictory information. Maintain context files like you maintain code.

Failure Mode 2: Context Distraction

As context length grows, AI tends to “repeat historical behaviors rather than synthesize new strategies.”

Databricks research shows that model correctness begins to decline significantly when processing contexts exceeding 32K tokens. The model does not “forget” earlier information – it gets distracted by recent information, causing decision quality to degrade.

In AI programming, this manifests as: when you keep appending requirements in a long conversation, the AI may “forget” your early constraints and generate contradictory code.

Countermeasure: Keep conversations focused. One conversation, one task. When you need to switch context, start a new session.

Failure Mode 3: Context Confusion

Too many tool definitions or irrelevant information interfere with the model’s judgment.

Berkeley’s function-calling leaderboard shows: every model’s performance degrades when given more tools. Llama 3.1 8B works fine with 19 tools but starts failing at 46. Even GPT-4 class models are not immune.

This is why Anthropic emphasizes in Writing Effective Tools for Agents: “Fewer, well-designed tools outperform comprehensive API wrappers.” Do not expose every capability to the AI – only give it the tools needed for the current task. For practical experience with tool design, see the MCP Server development tutorial.

Countermeasure: Dynamically load tools and context based on the task. For example, in Claude Code, do not load all available MCP Servers at once.

Failure Mode 4: Context Conflict

When information from multiple sources contradicts each other, model performance drops sharply.

A joint study by Microsoft and Salesforce found a striking statistic: providing the same information in stages (first giving partially incorrect answers, then the complete correct information) caused performance to drop by an average of 39%. The early incorrect answers remained in the context, interfering with the final judgment.

In AI programming, this means: if your project’s README says PostgreSQL but docker-compose.yml configures MySQL, the AI may generate inconsistent code or flip-flop between the two databases.

Countermeasure: Ensure consistency across all context sources. When you find contradictions, fix them immediately – do not expect the AI to “figure out which is right.”

Context Rot: Context Decays Too

Chroma’s Context Rot research complements the context failure picture from another angle.

They tested 18 mainstream models (including Claude, GPT, Gemini, Qwen, Llama) and discovered a universal pattern: as input length increases, model performance degrades significantly and consistently, even for extremely simple tasks.

Even more surprising findings:

  1. Structured context actually performs worse than chaotic context. After scrambling the order of context text, all models performed better. This suggests systematic weaknesses in how attention mechanisms process logically coherent long texts.

  2. Lower question-answer relevance leads to faster performance degradation. When questions and answers have low surface similarity (requiring more reasoning), models degrade more severely in long contexts.

  3. A single distractor can significantly reduce accuracy. Even adding just one piece of irrelevant information to the context affects the model.

  4. Claude models tend to refuse answering when uncertain, while GPT models tend to generate confident but incorrect answers. This is an interesting behavioral difference with important practical implications.

The practical takeaway: more context is not better – more precise context is better. Every piece of information you give the AI has a cost – not just in tokens, but in attention.

Anthropic’s Five Principles of Tool Design

Another important reading in Week 3 is Anthropic’s Writing Effective Tools for Agents. Tool design directly affects context quality – poorly designed tools introduce unnecessary confusion and noise for the AI.

Principle 1: Curate, Don’t Dump

Do not wrap every API endpoint as a tool. Identify the core operations the Agent truly needs and consolidate related functions. For example, instead of providing four separate tools – list_events, create_event, invite_attendees, book_room – provide a single schedule_event tool that orchestrates the multi-step operation internally.

Principle 2: Consistent Namespacing

When you have many tools, use prefixes to group them: asana_projects_search, asana_users_search, slack_channels_list. This helps AI quickly locate relevant tools and reduces confusion.

Principle 3: Return Semantic Data

Do not return bare IDs – return meaningful names and descriptions. {"user": "John Doe", "role": "admin"} is far more AI-friendly than {"user_id": "a1b2c3", "role_id": 1}. AI needs to understand data, not look up tables.

Principle 4: Token Efficiency

Implement pagination, filtering, and truncation. Claude Code limits tool responses to 25,000 tokens by default. If your tool returns 100,000 lines of data at once, the AI will be overwhelmed. Set reasonable default limits for each query.

Principle 5: Tool Descriptions Are a Performance Lever

A tool’s description field has an enormous impact on AI behavior. The Anthropic team found that optimizing tool descriptions alone achieved state-of-the-art results on SWE-bench. This means your tool’s “documentation” is as important as your code.

Practical Guide: Implementing Context Engineering in Your Projects

Theory covered – how do you put it into practice? Based on CS146S materials and real-world cases, here is an actionable framework.

Layer 1: Project-Level Context

This is the foundation for all development, typically implemented through configuration files and documentation.

# CLAUDE.md Example Structure

## Project Overview
This is an e-commerce API service built on FastAPI...

## Tech Stack
- Backend: FastAPI + SQLAlchemy
- Database: PostgreSQL 15
- Cache: Redis
- Message Queue: RabbitMQ

## Code Standards
- Use pydantic v2 for data validation
- All API endpoints require type annotations
- Error handling uses custom exception classes

## Do Not
- Use ORM lazy loading
- Write raw SQL in API handlers
- Use print for debugging -- use structlog

Layer 2: Task-Level Context

Provide task-specific context each time you give the AI a task.

# Task: Implement User Registration API

## Relevant Files
- src/models/user.py (User model definition)
- src/schemas/auth.py (Auth-related schemas)
- src/services/email.py (Email sending service)

## Business Rules
- Email must be validated for uniqueness
- Password: minimum 8 characters, including upper/lower case and numbers
- Send verification email upon successful registration
- Use existing EmailService -- do not create a new one

## Reference Implementation
- Similar API pattern can be found in src/api/v1/products.py

Layer 3: Context Hygiene

The most overlooked yet most important layer.

Regular cleanup: Check CLAUDE.md and other context files weekly. Remove outdated information.

Consistency checks: Ensure README, CLAUDE.md, docker-compose.yml, CI configs, and other sources do not contradict each other.

Fresh sessions: When a conversation exceeds 30 turns or 50K tokens, start a new session with clean context.

Dynamic tool loading: Do not enable all MCP Servers at once. Only enable what the current task needs. When writing backend code, you do not need Figma MCP. When doing design work, you do not need database MCP.

Layer 4: Feedback-Driven Context Optimization

Anthropic mentions a clever approach in their tool design article: feed AI’s evaluation conversation back to the AI itself to improve tools.

The same approach applies to context management: when AI generates unexpected code, analyze why – is context missing? Contradictory? Overloaded? Then adjust your context strategy accordingly.

Treat context management as a continuously optimizing system, not a one-time configuration.

Cognition (Devin)’s Perspective

The Week 3 guest was Silas Alberti, Head of Research at Cognition (the company behind Devin).

Devin’s Agents 101 documentation provides another perspective on context management – from the viewpoint of an Agent framework designer, what kind of context helps Agents perform best.

Core insights:

  1. Specify “how,” not just “what.” Instead of “implement user authentication,” say “implement user authentication using JWT + refresh token pattern, token expiry 15 minutes, refresh token 7 days.”

  2. Give Agents access to feedback loops. Let Agents run tests, see lint errors, access CI/CD results. This feedback itself is a form of dynamic context – telling the Agent what it got right and what it got wrong.

  3. Set different context strategies for different complexity levels:

    • Simple tasks: Direct description is sufficient
    • Medium tasks (1-6 hours of work): Provide detailed context + expect 80% time savings while reserving human polish
    • Complex tasks: Provide context in stages + set multiple checkpoints

From “Asking a Good Question” to “Building an Information System”

The core insight from CS146S Week 3 can be distilled into one sentence:

Prompt Engineering is a craft. Context Engineering is systems engineering.

A craft relies on talent and experience to ask an exquisite question. Systems engineering relies on architecture and discipline to ensure AI delivers consistently high-quality output across any task.

How this shift impacts different roles:

RoleImpact
Individual DeveloperNeed to invest more effort in documentation and context maintenance than in code itself
Team LeadNeed to establish team-level context management standards (CLAUDE.md templates, doc structures, tool configuration standards)
PM/Product ManagerRole value increases dramatically – writing clear specs becomes the most critical output
ArchitectShifts from designing code architecture to designing information architecture – not just making code maintainable, but making it AI-comprehensible

When Specs become the new source code and Context becomes the new programming environment, we are witnessing a fundamental restructuring of software engineering. Code generation can be delegated to AI, but context management – deciding what the AI sees, how it sees it, and when – is an irreplaceable human core competency.

At least for now.

Series Navigation

This is Part 2 of the “Stanford Vibe Coding Course Deep Dive” series:

  1. Stanford CS146S Deep Dive (Part 1): How Vibe Coding Became an Academic Discipline
  2. This article: Stanford CS146S Deep Dive (Part 2): Context Engineering (Week 3)
  3. Stanford CS146S Deep Dive (Part 3): Agent Manager (Week 4)
  4. Stanford CS146S Deep Dive (Part 4): Secure Vibe Coding (Week 6-7)
  5. Stanford CS146S Deep Dive (Part 5): From Prototype to Production (Week 8-9)

Comments

Join the discussion — requires a GitHub account