Prototype to Production: AI App Deployment Checklist

A complete checklist for deploying AI applications from prototype to production, covering testing, security, scalability, observability, and automated ops.

Bruce

AI AgentDevOpsProductionDeployment

AI Guides

2799 Words

2026-03-09 10:00 +0000


From Prototype to Production: The Complete AI App Deployment Checklist

You shipped an AI-generated app in 20 minutes. The demo looked great. Your stakeholders were impressed.

Then reality hit.

The app crashed under 50 concurrent users. A security scan found 12 vulnerabilities. There were zero tests, zero logs, and zero alerts. When it went down at 2 AM, nobody knew until customers started complaining on Twitter.

Building the prototype was 10% of the work. The other 90% is what separates a demo from a product.

This article gives you a structured, actionable checklist for crossing that gap. Based on Stanford’s CS146S curriculum (Weeks 8-9) on modern software development and real-world deployment patterns, we will walk through six gates that every AI-generated application must pass before it is production-ready.

Whether you are using Claude Code, Cursor, v0, or any other AI coding tool, these gates apply universally.

The Reality Gap: Prototype vs. Production

What AI Prototyping Tools Can (and Cannot) Do

Tools like Vercel’s v0, Bolt, and AI coding assistants can generate impressive prototypes in minutes. Here is what they handle well:

  • Generating UI layouts with responsive design
  • Basic CRUD functionality
  • Standard navigation and routing
  • Common frontend interaction patterns

But here is what they consistently struggle with:

  • Complex multi-step business logic
  • Performance optimization (lazy loading, bundle splitting, caching)
  • Accessibility (a11y compliance)
  • Integration with existing authentication systems and legacy APIs
  • Production-grade error handling

The Work Distribution Nobody Talks About

Most developers dramatically underestimate the effort required after the prototype phase. Here is the actual breakdown:

Phase% of Total WorkAI Automation Rate
Prototype / Demo10%80%+
Feature Completion25%50-70%
Test Coverage15%40-60%
Security Hardening10%20-30%
Performance Optimization10%20-40%
Deployment Configuration10%30-50%
Operations & Monitoring20%30-50%

The pattern is clear: AI is most effective at the phase that represents the least amount of work. As you move toward production, AI’s contribution drops while the complexity rises.

This is not a criticism of AI tools. It is a reality check. Understanding this distribution lets you plan your project timeline accurately instead of assuming the demo means you are “almost done.”

The Six Gates from Prototype to Production

Every AI-generated application must pass through these six gates. Skip one, and your production deployment becomes a ticking time bomb.

Gate 1: From “Working” to Testable

AI-generated code almost never includes meaningful tests. When it does, the tests are often superficial — they check that a function exists, not that it behaves correctly under edge cases.

What You Need

Unit Tests for core business logic:

# Bad: AI-generated test that tests nothing meaningful
def test_calculate_price():
    result = calculate_price(100)
    assert result is not None  # This tells us nothing

# Good: Test that validates actual business rules
def test_calculate_price_with_discount():
    # 20% discount for orders over $50
    assert calculate_price(100, discount_tier="gold") == 80.0

def test_calculate_price_rejects_negative():
    with pytest.raises(ValueError, match="Price cannot be negative"):
        calculate_price(-10)

def test_calculate_price_applies_tax():
    # Tax should be applied AFTER discount
    result = calculate_price(100, discount_tier="gold", tax_rate=0.1)
    assert result == 88.0  # (100 * 0.8) * 1.1

Integration Tests for module interactions:

# Test that the API layer correctly talks to the service layer
async def test_create_order_integration():
    # Setup: seed test database
    user = await create_test_user()
    product = await create_test_product(price=50.0)

    # Act: call the actual API endpoint
    response = await client.post("/api/orders", json={
        "user_id": user.id,
        "product_id": product.id,
        "quantity": 2
    })

    # Assert: check the full chain worked
    assert response.status_code == 201
    order = response.json()
    assert order["total"] == 100.0
    assert order["status"] == "pending"

    # Verify side effects
    db_order = await get_order(order["id"])
    assert db_order is not None
    assert db_order.user_id == user.id

End-to-End Tests with tools like Playwright:

// Test the complete user flow
test('user can complete checkout', async ({ page }) => {
  await page.goto('/products');
  await page.click('[data-testid="add-to-cart-btn"]');
  await page.click('[data-testid="checkout-btn"]');
  await page.fill('#email', '[email protected]');
  await page.fill('#card-number', '4242424242424242');
  await page.click('[data-testid="pay-btn"]');
  await expect(page.locator('.order-confirmation')).toBeVisible();
});

CI Pipeline to enforce test gates:

# .github/workflows/test.yml
name: Test Suite
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run unit tests
        run: pytest tests/unit/ --cov=src --cov-fail-under=80
      - name: Run integration tests
        run: pytest tests/integration/
      - name: Run E2E tests
        run: npx playwright test

What AI Can Help With

AI is good at generating test skeletons from existing code. But the intent — what to test, why, and what the boundary conditions are — must come from you. Use AI to scaffold tests, then review every assertion to ensure it validates real behavior.

Gate 2: From “Working” to Secure

Security is where AI-generated code is most dangerous. AI tools optimize for “making it work,” not “making it safe.” For a deep dive into AI security practices, see MCP Security in 2026.

The OWASP Checklist for AI Apps

Run through these checks systematically:

Input Validation and Sanitization

# AI often generates code like this - DANGEROUS
@app.post("/api/query")
async def query(request: Request):
    body = await request.json()
    result = db.execute(f"SELECT * FROM users WHERE name = '{body['name']}'")
    return result

# Production version with proper validation
from pydantic import BaseModel, validator

class QueryRequest(BaseModel):
    name: str

    @validator('name')
    def validate_name(cls, v):
        if len(v) > 100:
            raise ValueError('Name too long')
        # Strip any SQL injection attempts
        if any(char in v for char in [';', '--', "'", '"']):
            raise ValueError('Invalid characters in name')
        return v.strip()

@app.post("/api/query")
async def query(request: QueryRequest):
    result = db.execute(
        "SELECT * FROM users WHERE name = :name",
        {"name": request.name}
    )
    return result

Authentication and Authorization

  • Verify every API endpoint has auth checks
  • Implement role-based access control (RBAC)
  • Use secure session management (HttpOnly cookies, short-lived tokens)
  • Add rate limiting to login endpoints

Sensitive Data Protection

  • Encrypt data at rest and in transit
  • Never log sensitive information (passwords, tokens, PII)
  • Use environment variables for secrets, never hardcoded values
  • Implement proper key rotation

Dependency Security

# Scan dependencies for known vulnerabilities
npm audit
pip-audit
snyk test

# Keep dependencies updated
npm update
pip install --upgrade -r requirements.txt

Security Checklist

CheckToolFrequency
SAST (Static Analysis)Semgrep, SonarQubeEvery PR
DAST (Dynamic Analysis)OWASP ZAP, Burp SuiteWeekly
Dependency ScanSnyk, npm auditEvery build
Secret DetectionTruffleHog, GitLeaksEvery commit
Container ScanTrivy, GrypeEvery build

Gate 3: From “Working” to Scalable

Prototype code typically handles one user at a time. Production means hundreds or thousands of concurrent users. Here are the critical issues to address.

Fix N+1 Query Problems

This is the most common performance issue in AI-generated code:

# N+1 Problem: 1 query for orders + N queries for users
orders = db.query(Order).all()
for order in orders:
    user = db.query(User).filter(User.id == order.user_id).first()
    order.user_name = user.name  # This fires a query PER order

# Fixed: Eager loading gets everything in 2 queries
orders = db.query(Order).options(joinedload(Order.user)).all()
for order in orders:
    order.user_name = order.user.name  # No extra query

Implement Caching

import redis
from functools import wraps

cache = redis.Redis(host='localhost', port=6379)

def cached(ttl=300):
    """Cache function results in Redis for `ttl` seconds."""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            cache_key = f"{func.__name__}:{hash(str(args) + str(kwargs))}"
            cached_result = cache.get(cache_key)
            if cached_result:
                return json.loads(cached_result)
            result = await func(*args, **kwargs)
            cache.setex(cache_key, ttl, json.dumps(result))
            return result
        return wrapper
    return decorator

@cached(ttl=600)
async def get_product_catalog():
    """Cached for 10 minutes since catalog rarely changes."""
    return await db.query(Product).filter(Product.active == True).all()

Move Expensive Operations to Background Queues

from celery import Celery

celery_app = Celery('tasks', broker='redis://localhost:6379')

# Instead of sending email synchronously during request
@app.post("/api/orders")
async def create_order(order: OrderCreate):
    db_order = await save_order(order)

    # Offload email to background worker
    send_order_confirmation.delay(db_order.id, order.email)

    return {"order_id": db_order.id, "status": "created"}

@celery_app.task
def send_order_confirmation(order_id: str, email: str):
    """Runs in a background worker, not blocking the API."""
    order = get_order(order_id)
    send_email(to=email, subject="Order Confirmed", body=render_template(order))

Add Rate Limiting

from slowapi import Limiter
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)

@app.post("/api/login")
@limiter.limit("5/minute")  # Prevent brute force
async def login(request: Request, credentials: LoginRequest):
    return await authenticate(credentials)

@app.get("/api/data")
@limiter.limit("100/minute")  # Prevent API abuse
async def get_data(request: Request):
    return await fetch_data()

Scalability Checklist

ItemQuestionAction
DatabaseAre there N+1 queries?Profile with query logger, add eager loading
CachingIs hot data cached?Add Redis/Memcached layer
AsyncAre slow tasks blocking requests?Move to Celery/SQS/Bull queues
Rate LimitsCan one user overload the system?Add per-user and global rate limits
Horizontal ScalingCan you run multiple instances?Remove local state, use shared sessions
Connection PoolingAre DB connections managed?Use connection pool with limits

Gate 4: From “Working” to Observable

Once your app is in production, you need to know what it is doing at all times. Observability has three pillars: logs, metrics, and traces. These align with Google’s SRE Four Golden Signals.

Structured Logging

import structlog

logger = structlog.get_logger()

@app.post("/api/orders")
async def create_order(order: OrderCreate, user: User = Depends(get_current_user)):
    logger.info(
        "order_created",
        user_id=user.id,
        order_total=order.total,
        product_count=len(order.items),
        payment_method=order.payment_method
    )

    try:
        result = await process_order(order)
        logger.info("order_processed", order_id=result.id, duration_ms=result.processing_time)
        return result
    except PaymentError as e:
        logger.error(
            "payment_failed",
            user_id=user.id,
            error_code=e.code,
            error_message=str(e)
        )
        raise HTTPException(status_code=402, detail="Payment failed")

Use JSON format for logs so they are searchable in tools like ELK Stack, Loki, or CloudWatch:

{
  "event": "order_created",
  "user_id": "usr_123",
  "order_total": 99.50,
  "product_count": 3,
  "timestamp": "2026-03-13T10:30:00Z",
  "level": "info"
}

The Four Golden Signals

These are the metrics every production system must track:

SignalWhat It MeasuresKey MetricsAlert Threshold
LatencyResponse timeP50, P95, P99 response timesP95 > 500ms
TrafficRequest volumeRequests per second (RPS)Sudden 3x spike or 50% drop
ErrorsFailure rate5xx error percentage> 1% of requests
SaturationResource usageCPU, memory, disk, connections> 80% utilization

Distributed Tracing

When your app spans multiple services, traces show you the complete path of a request:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Initialize tracing
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

@app.post("/api/orders")
async def create_order(order: OrderCreate):
    with tracer.start_as_current_span("create_order") as span:
        span.set_attribute("order.total", order.total)

        with tracer.start_as_current_span("validate_inventory"):
            await check_inventory(order.items)

        with tracer.start_as_current_span("process_payment"):
            payment = await charge_card(order.payment)

        with tracer.start_as_current_span("save_order"):
            result = await save_to_database(order, payment)

        return result

With tracing, when a request takes 3 seconds, you can see exactly where the time was spent — was it the database? The payment API? The inventory check?

Observability Stack Recommendations

ComponentOpen SourceManaged Service
LogsLoki + GrafanaDatadog, CloudWatch
MetricsPrometheus + GrafanaDatadog, New Relic
TracesJaeger, ZipkinDatadog, Honeycomb
All-in-OneOpenTelemetryDatadog, Dynatrace

Gate 5: From “Working” to Automated Ops

Traditional incident response is manual and slow. AI-powered operations (AI-SRE) can dramatically reduce mean time to resolution (MTTR).

Traditional vs. AI-Enhanced Incident Response

Traditional flow:

Alert fires → On-call engineer wakes up → Manual investigation
→ Root cause analysis → Manual fix → Write postmortem

AI-enhanced flow:

Alert fires → AI Agent automatically collects context
           → AI analyzes probable root causes
           → AI recommends fix with confidence score
           → Human approves (or AI auto-executes low-risk fixes)
           → AI generates postmortem draft

Where AI Ops Excels

ScenarioWhat AI Does
Kubernetes pod crash loopsAuto-check logs, resource quotas, image status, recent deploys
Database connection pool exhaustionAnalyze connection sources, identify leak patterns
API latency spikeCorrelate with deployment history, traffic patterns, dependency status
Disk space running lowIdentify large files, suggest cleanup strategies
Certificate expirationEarly warning, automated renewal
Memory leak detectionTrend analysis, identify offending service

From SRE to AI-SRE

The evolution of operations roles:

  • Manual debugging becomes guiding AI investigation — you set the direction, AI gathers data
  • Writing runbooks becomes training AI agents — encode operational knowledge into agent context (see Context Engineering Guide for best practices)
  • Reactive firefighting becomes proactive prevention — AI continuously analyzes metrics and predicts issues before they impact users

Tools like Resolve AI, PagerDuty AIOps, and Datadog Watchdog are leading this shift. The key is starting with low-risk automated actions (restarting a pod, scaling up replicas) and gradually expanding the AI’s authority as trust is established.

Gate 6: From “Working” to Continuously Evolving

A production system is never “done.” It requires continuous iteration: new features, bug fixes, performance improvements, security patches.

CI/CD Pipeline Requirements

# Complete CI/CD pipeline
name: Deploy
on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm test -- --coverage
      - run: npm run lint
      - run: npx playwright test

  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm audit --audit-level=high
      - run: npx semgrep --config=auto

  deploy:
    needs: [test, security]
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to staging
        run: deploy --env staging
      - name: Run smoke tests
        run: npm run test:smoke -- --target=staging
      - name: Deploy to production (canary)
        run: deploy --env production --canary 10%
      - name: Monitor canary metrics
        run: check-metrics --duration 15m --threshold error_rate<0.01
      - name: Full rollout
        run: deploy --env production --canary 100%

Feedback Loop Architecture

Production data should feed back into your development process:

  1. Error tracking (Sentry, Bugsnag) captures real user errors
  2. Usage analytics shows which features are actually used
  3. Performance monitoring identifies bottlenecks in real workloads
  4. User feedback through in-app surveys and support tickets
  5. AI agent reports summarize operational patterns weekly

This feedback loop is where vibe coding meets engineering discipline. You can use AI to rapidly prototype fixes and features, but the direction comes from production data, not guesswork.

The Complete AI Application Development Lifecycle

Putting all six gates together, here is the full lifecycle from idea to production and beyond:

1. Requirements Analysis
   ├── Write clear spec / PRD
   ├── Define acceptance criteria
   └── Identify security and performance requirements

2. Architecture Design
   ├── Choose tech stack (consider AI tool support)
   ├── Design system architecture
   ├── Set up project context (CLAUDE.md, design docs)
   └── Configure tool integrations (MCP servers, APIs)

3. Rapid Prototyping [AI: 80%+ automation]
   ├── Generate prototype with AI
   ├── Validate core functionality
   └── Collect feedback, refine spec

4. Feature Development [AI: 50-70% automation]
   ├── Break requirements into subtasks
   ├── Assign to AI agents for execution
   ├── Checkpoint reviews + human refinement
   └── Code review

5. Quality Assurance [AI: 40-60% automation]
   ├── Unit + integration + E2E test coverage
   ├── Security scanning (SAST + DAST)
   ├── Performance benchmarking
   └── Accessibility audit

6. Deployment [Primarily toolchain automation]
   ├── CI/CD pipeline setup
   ├── Canary / blue-green deployment
   ├── Monitoring and alerting configuration
   └── Rollback plan

7. Operations [AI: 30-50% automation]
   ├── Observability (logs + metrics + traces)
   ├── Automated incident response
   ├── Regular security audits
   └── Continuous optimization

8. Iteration [Loop back to step 1]
   ├── New requirements from production feedback
   └── Update context and documentation

The Production Readiness Checklist

Use this checklist before every production deployment. Print it out, pin it to your monitor, or add it as a PR template.

Testing

  • Unit test coverage above 80% for core business logic
  • Integration tests for all critical API endpoints
  • E2E tests for primary user flows
  • Tests run automatically in CI on every push
  • Load tests verify the system handles expected traffic

Security

  • All inputs validated and sanitized
  • Authentication on every protected endpoint
  • Authorization checks (RBAC) implemented
  • Secrets stored in environment variables or vault
  • Dependencies scanned for vulnerabilities
  • OWASP Top 10 checklist reviewed
  • HTTPS enforced everywhere

Scalability

  • No N+1 database query problems
  • Caching layer for hot data
  • Expensive operations run asynchronously
  • Rate limiting on all public endpoints
  • Application can scale horizontally
  • Database connection pooling configured

Observability

  • Structured logging with appropriate log levels
  • Four Golden Signals monitored (latency, traffic, errors, saturation)
  • Distributed tracing for multi-service flows
  • Alerting rules configured with proper thresholds
  • Dashboard for real-time system health

Deployment

  • CI/CD pipeline with automated tests and security scans
  • Canary or blue-green deployment strategy
  • Rollback procedure documented and tested
  • Health check endpoints implemented
  • Graceful shutdown handling

Operations

  • Incident response runbook created
  • On-call rotation established
  • Backup and disaster recovery plan tested
  • AI-SRE tools configured for automated triage
  • Postmortem process defined

Key Takeaways

  1. The prototype is 10% of the work. Plan your timeline accordingly. If the prototype took 1 day, expect 9 more days of engineering work before production.

  2. AI effectiveness decreases as complexity increases. Use AI heavily for prototyping and test generation, but apply human judgment for security, architecture, and operational decisions.

  3. The six gates are sequential but not one-time. Every new feature should pass through all six gates before reaching production.

  4. Observability is not optional. If you cannot see what your application is doing in production, you cannot fix it when it breaks.

  5. Automate everything you can. From CI/CD pipelines to incident response, automation reduces human error and response time.

The gap between “it works on my machine” and “it works reliably for thousands of users” is where engineering discipline matters most. AI tools are powerful accelerators, but they do not replace the need to think carefully about testing, security, scalability, and operations.

Use this checklist. Pass through all six gates. Ship with confidence.

Comments

Join the discussion — requires a GitHub account