Designing Systems That Scale: Lessons From Enterprise Projects

I've consulted on eight enterprise projects that grew from zero to millions of users. The same patterns fail repeatedly. Here's what actually survives the scaling curve.

The Database Patterns That Hold

Soft deletes everywhere. No exceptions.

-- Every table gets these four columns
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
deleted_at TIMESTAMP NULL,
deleted_by VARCHAR(255) NULL

I've never regretted a soft delete. I've regretted hard deletes six times. Users accidentally delete data. Compliance needs audit trails. Support needs to restore rows. deleted_at IS NULL filters out the dead. Everything stays recoverable. UUIDs for public IDs, integers for joins.

CREATE TABLE users (
  id BIGSERIAL PRIMARY KEY,        -- Internal joins only
  public_id UUID DEFAULT gen_random_uuid(), -- API exposure only
  email VARCHAR(255) NOT NULL
);

CREATE INDEX idx_users_public_id ON users(public_id);

Integers join faster. UUIDs prevent ID enumeration attacks and allow offline ID generation before database insert. Never expose your internal sequence to the client. The cursor pagination rule:

-- ❌ OFFSET breaks past 10k rows
SELECT * FROM users ORDER BY created_at LIMIT 10 OFFSET 10000;

-- ✅ Keyset pagination with cursor
SELECT * FROM users 
WHERE (created_at, id) > ($1, $2) 
ORDER BY created_at, id LIMIT 10;

OFFSET scans and discards rows. Cursor pagination uses indexes. At 100k users, the difference is 300ms vs 30ms.

API Contracts That Don't Cause Fights

Make every endpoint idempotent with idempotency keys.

POST /api/payments
Idempotency-Key: uuid-v4

{
  "amount": 9900,
  "currency": "USD"
}

Network retries won't double-charge. Store the key with the first response. If same key arrives, return cached response. Frontend generates the key, backend honors it. Saves teams from reconciliation nightmares. Version at the URL, not the header.

GET /api/v1/users
GET /api/v2/users

Headers get stripped by proxies. URLs are visible everywhere — logs, dashboards, error messages. When something breaks, you know exactly which version ran. Rate limits as API response headers:

X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 423
X-RateLimit-Reset: 1620000000
Retry-After: 15

Your frontend should read these and show feedback. Don't make users guess why requests stopped working.

The Team Handoff Rules

Every service gets three files.

service/
├── README.md        # What it does, how to run, where to ask
├── ARCHITECTURE.md  # Diagrams, data flow, failure modes  
└── RUNBOOK.md       # On-call: logs, metrics, common fixes

When your team rotates or someone goes on vacation, new people can triage without Slack pings. The runbook is the most important file in production. Document incidents, not just code.

# Incident 2025-06-15: Search timeouts

**What happened:** Search API p99 rose to 8s
**Why:** Wildcard queries missing index hints
**Fix:** Added pg_trgm indexes + gin indices
**How we caught:** Slow query log at 1s threshold
**Prevention:** Query planner tests in CI

The next person who breaks search will search your docs for "search slow" and find this. That's the difference between repeated failures and learning organizations.

The Scalability Checklist

Before launching any service, I verify these:

No unbounded arrays in API responses (pagination or limit)
All external calls have timeouts (connect + read)
All mutable endpoints are idempotent
No SELECT * in production queries
Database connection pool size fits RDS available memory
Alert if p95 latency doubles for 5 minutes
Someone knows how to page the on-call engineer

The Hardest Lesson

Scale problems are rarely technical first. They're organizational. Teams outgrow their communication patterns before their databases. Codebases become unmaintainable because no one owns refactoring, not because the language is slow.

What scales: Ownership boundaries. Explicit API contracts. Boring solutions everyone understands. Teams that can page each other without blame. What doesn't scale: Hero developers. Tribal knowledge. "We'll fix it later" tickets. Microservices before monoliths prove insufficient.

Start with a monolith that's modular enough to split. Use PostgreSQL until it genuinely hurts. Add caching when latency demands it, not preemptively. Most projects never reach the scale where these become bottlenecks. The ones that do will thank you for keeping the foundation boring.