SWE → AI Engineer

Full Stack System Architecture

15 min read · April 2026 · Free playbook

Every layer, every technology, who owns what

Every layer. Every technology. Who owns what. No filler. This is what a FAANG-level application actually looks like under the hood — and why building a real product is nothing like vibe-coding a dashboard.

The illusion: most founders think a web app is a frontend, a backend, and a database. That's like thinking a city is just buildings. A production-grade application has 18 distinct layers, hundreds of moving parts, and two distinct ownership tracks — product and engineering. Here is every layer, what it does, what runs it, and who is accountable for what.

The 18 layers

Client layer — frontend

Web app · Mobile app · Desktop · Browser extensions

The interface layer that renders in the browser or on a device. It handles UI state, user interactions, routing between pages, and communicating with backend services via API calls. At FAANG scale this is a complex, independently deployable application — not a collection of HTML files. Mobile apps (iOS and Android) are separate codebases with their own release cycles, app store review processes, and platform-specific constraints.

ReactNext.jsVueAngularTypeScriptReact NativeSwift (iOS)Kotlin (Android)Electron (desktop)Tailwind CSSWebpack / Vite

PM owns

User flows and task completion paths
Information architecture decisions
What ships behind a feature flag
Accessibility requirements
Which platforms are prioritised
Copy, microcopy, and error state language

SWE owns

Framework and rendering strategy (SSR vs CSR vs SSG)
Component architecture and design system
State management approach
Bundle size optimisation and performance
Browser compatibility
Accessibility implementation

Edge network & CDN

Content delivery · DDoS protection · SSL termination · Edge compute

Static assets — JavaScript bundles, images, fonts, HTML — are cached at data centres globally, geographically close to users, so they load fast regardless of where the user is. Edge functions allow code to run at these global nodes without hitting origin servers. DDoS protection and SSL/TLS termination also live here. This is the first layer the internet touches before your servers do.

CloudflareAWS CloudFrontAkamaiFastlyVercel EdgeCloudflare WorkersWAFSSL/TLS

PM owns

Regional launch decisions
Performance SLA requirements by region
Content localisation requirements

SWE owns

CDN configuration and cache rules
Edge function logic
DDoS mitigation rules
WAF rule sets
SSL certificate management and renewal

API gateway & load balancer

Traffic routing · Rate limiting · Authentication handoff · Request throttling

Every request from the frontend passes through the API gateway before reaching any backend service. It enforces rate limits, routes requests to the correct service, validates tokens, and distributes traffic across multiple server instances so no single server gets overwhelmed. This is also where API versioning is enforced — v1 requests go to old services, v2 requests go to new ones.

KongAWS API GatewayNginxHAProxyAWS ALBEnvoyTraefik

PM owns

Rate limit policy by pricing tier
Which endpoints are public vs. authenticated
API versioning strategy and deprecation timeline
Partner API access policies

SWE owns

Gateway configuration and routing rules
Load balancing algorithm
Rate limit enforcement implementation
Request/response transformation logic
Health check configuration

Authentication & authorisation

OAuth 2.0 · JWT · SSO · MFA · RBAC · Session management

Two distinct problems that get conflated. Authentication is proving identity — you are who you say you are. Authorisation is enforcing permissions — you are allowed to do what you're trying to do. At FAANG scale these are separate services. Auth includes social login, enterprise single sign-on (SSO), multi-factor authentication, session token management, and token refresh and revocation flows. Role-based access control (RBAC) defines what each user type can see and do.

Auth0AWS CognitoOktaFirebase AuthClerkOAuth 2.0JWTSAML (SSO)Casbin (RBAC)

PM owns

Which login methods to support
MFA requirements by user tier
Permission model — who can do what by role
Session timeout and security policy
Enterprise SSO requirements
Account recovery flow UX decisions

SWE owns

Token generation, signing, and verification
RBAC/ABAC policy enforcement logic
OAuth flow implementation
Token refresh and revocation
Secure credential storage
Auth event audit logging

Backend application services

Business logic · REST · GraphQL · gRPC · Microservices · Monolith

This is where your product's core functionality lives. At early stage this is often a monolith — one deployable application containing all business logic. At FAANG scale it is decomposed into dozens to hundreds of microservices, each owned by a separate team, independently deployable, communicating via internal APIs or a service mesh. Each microservice owns its own data store. This layer contains more code and engineering time than any other.

Node.jsPython / FastAPIDjangoGoJava / Spring BootRuby on RailsREST APIsGraphQLgRPCIstio

PM owns

What the API must do — requirements, not implementation
Business rules and edge case handling
Service boundaries — what is one product vs. another
API contract with third-party partners
Data retention and deletion requirements

SWE owns

Service architecture and decomposition
Language and framework selection
Internal API design and versioning
Service mesh configuration
Inter-service communication patterns
Data ownership per service

Async processing & message queues

Event streaming · Background jobs · Scheduled tasks · Webhooks

Not everything needs to happen synchronously in a request-response cycle. Sending an email, processing a video upload, running a nightly report, triggering a webhook — these happen asynchronously. A message queue decouples the service that produces work from the service that processes it. If a downstream service goes down, the queue holds work until it recovers. This is the difference between a resilient system and a brittle one.

Apache KafkaRabbitMQAWS SQSAWS SNSGoogle Pub/SubCelerySidekiqTemporalCron jobs

PM owns

SLA for async jobs — how fast must email send?
Retry and failure behaviour policy
Which events trigger downstream workflows
Alerting requirements for failed jobs

SWE owns

Queue topology and topic design
Consumer group configuration
Retry and backoff logic
Dead-letter queue handling
Message schema and versioning
Idempotency guarantees

Caching layer

In-memory cache · Session store · Distributed cache · Query cache

Reading from a database on every request is slow and expensive. The caching layer stores the results of expensive operations — database queries, API calls, computed values — in memory so they can be retrieved in microseconds rather than milliseconds. Redis is the most widely used: it serves as a cache, session store, rate limiter, and pub/sub broker simultaneously. Cache misses fall back to the database. Cache invalidation — knowing when to expire cached data — is one of the genuinely hard problems in distributed systems.

RedisMemcachedAWS ElastiCacheDragonflyApplication-level cache

PM owns

Data freshness requirements — how stale is acceptable?
Performance targets that drive caching decisions
Cache invalidation on specific user actions

SWE owns

Cache architecture — write-through vs. read-aside
TTL configuration per data type
Cache invalidation strategy
Cache warming on deployment
Cache hit/miss ratio monitoring

Search layer

Full-text search · Faceted filters · Autocomplete · Vector / semantic search

Relational databases are not built for search. A search layer is a dedicated system that indexes content for fast, relevance-ranked retrieval — including typo tolerance, synonyms, faceted filtering, and autocomplete. AI has changed search fundamentally: vector databases enable semantic similarity search — finding results by meaning, not exact keywords — which powers recommendation systems, RAG pipelines, and AI-native search experiences.

ElasticsearchOpenSearchAlgoliaTypesensePineconeWeaviatepgvectorQdrant

PM owns

Search ranking priorities — recency vs. relevance vs. popularity
Which fields are searchable
Filter and facet requirements
Autocomplete behaviour
Zero-results state and fallback handling

SWE owns

Index schema design
Indexing pipeline — keeping search in sync with DB
Relevance tuning and scoring logic
Embedding generation for vector search
Query parsing and synonym mapping

Primary database layer

Relational · Document · Wide-column · Time-series · Graph

Where persistent data lives — the source of truth for your application. Database type is chosen based on data model and access patterns. Relational databases handle structured, transactional data with complex relationships. NoSQL handles flexible schemas, extreme scale, or access patterns that don't fit tables. FAANG-scale apps run multiple database types across different services. Replication (primary/replica) distributes read traffic. Sharding distributes write traffic across nodes. Connection pooling manages the finite number of database connections efficiently.

PostgreSQLMySQLMongoDBDynamoDBCassandraCockroachDBGoogle SpannerClickHouseTimescaleDBNeo4jPgBouncer

PM owns

Data retention policy — how long to keep user data
Data deletion requirements — right to erasure
Data residency requirements — EU data in EU
Consistency requirements — is eventual consistency acceptable?
Compliance obligations that constrain data storage

SWE owns

Database selection per service
Schema design and migration strategy
Index strategy for query performance
Replication and failover configuration
Backup frequency and restore testing
Connection pooling configuration

File & object storage

Images · Videos · Documents · Exports · ML model weights

Databases store structured data. Object storage stores unstructured binary files — profile photos, uploaded PDFs, audio, video, CSV exports, trained ML model weights. At scale, raw files are processed before storage: images are resized and compressed into multiple format variants, videos are transcoded into streaming formats, documents are scanned for malicious content. All processed assets are then served through the CDN layer — never directly from the storage bucket.

AWS S3Google Cloud StorageAzure BlobCloudflare R2AWS Lambda (processing)FFmpegImageMagick

PM owns

File type and size limits
Supported media formats
User storage quota policy per pricing tier
File retention and auto-deletion rules
User-facing file management UX

SWE owns

Storage bucket architecture and permissions
Upload pipeline — direct vs. server-side
Image and video processing pipeline
Virus and malware scanning integration
Lifecycle policies — auto-archiving and deletion

AI & ML layer

Model serving · Feature store · LLM APIs · Vector DB · Evaluation · A/B testing models

AI in production is not one thing — it is a stack. The full ML layer includes: data pipelines that build training datasets, a feature store that computes and serves ML features in real time, model training infrastructure, a model registry for version control, model serving infrastructure for inference, evaluation systems to measure model quality, and A/B testing frameworks to compare model versions. For LLM-based products, add prompt management, RAG pipelines, guardrails, and output evaluation. This is the most complex and fastest-changing layer in modern web applications.

OpenAI APIAnthropic APIAWS SageMakerVertex AITorchServeMLflowFeastPineconeLangChainWeights & BiasesGuardrails AILlamaIndex

PM owns

Which problems AI is solving — build vs. no-build decision
Quality bar and acceptable failure rate
Latency requirements for inference
User feedback mechanisms for model improvement
Which model version is live — release decision
Guardrail requirements — what outputs are unacceptable
Cost-per-inference trade-off decisions

SWE owns

Model training and retraining pipelines
Feature engineering and feature store
Model serving infrastructure and latency optimisation
Prompt engineering and RAG architecture
Evaluation framework and benchmarks
A/B infrastructure for model comparison
Model monitoring and drift detection

Notification & communication layer

Email · Push notifications · SMS · In-app · Webhooks

A dedicated notification system handles all outbound communication across every channel. At scale this is not simply calling an email API from the backend. It includes channel routing, user preference management, frequency capping (preventing spam), delivery tracking, bounce and unsubscribe management, and regulatory compliance with CAN-SPAM and GDPR. Webhook delivery to third-party systems lives here with retry logic, signature verification, and delivery logs.

SendGridAWS SESTwilio (SMS)Firebase FCMAPNs (iOS)CourierCustomer.ioKnock

PM owns

Notification triggers — what events send what messages
Copy and content for all notification types
Frequency capping and batching rules
User preference controls — what can be disabled
Transactional vs. marketing classification
Regulatory compliance requirements

SWE owns

Notification service architecture
Channel routing logic implementation
Delivery tracking and retry
Template rendering engine
Unsubscribe and preference enforcement
Webhook delivery and signature verification

Payment & billing layer

Payment processing · Subscriptions · Invoicing · Tax calculation · Fraud detection

Payment infrastructure is far more complex than dropping in a checkout form. A production billing layer handles: card processing with 3DS authentication, subscription lifecycle management including trials and failed payment recovery, invoice generation, global tax calculation across jurisdictions, refund logic, and fraud detection. PCI DSS compliance means card data must never touch your servers — it is tokenised entirely by the payment processor. A failed payment recovery system (dunning) alone can recover 20-30% of involuntary churn.

StripeBraintreeAdyenStripe BillingChargebeeLagoAvalara (tax)Stripe Radar (fraud)

PM owns

Pricing model — per-seat, usage-based, flat-rate
Trial length and conversion strategy
Dunning logic — how to recover failed payments
Refund policy
Which currencies and payment methods to support
Plan upgrade / downgrade rules and proration

SWE owns

Payment processor integration and webhook handling
Subscription state machine implementation
PCI DSS compliance scope reduction
Tax calculation integration
Idempotent payment operations
Invoice generation and delivery

Feature management & experimentation

Feature flags · A/B testing · Canary releases · Gradual rollouts

At FAANG scale, features are never switched on for everyone at once. Feature flags allow code to exist in production while only activating for specific users — internal testers, a 1% canary group, a specific market, a specific pricing tier. A/B testing infrastructure runs controlled experiments to measure whether a change improves a target metric. The most sophisticated systems run hundreds of simultaneous experiments with statistical rigour built into the platform. Engineers at Meta and Google deploy to production dozens of times per day behind flags.

LaunchDarklyStatsigOptimizelyUnleashGrowthBookEppoInternal experimentation platforms

PM owns

Which features launch behind a flag
Rollout plan and targeting criteria
Experiment hypothesis and primary success metrics
Minimum detectable effect and required sample size
Launch / no-launch decision after experiment reads
Kill switch criteria if metrics degrade

SWE owns

Flag SDK integration across the codebase
User bucketing and assignment logic
Experiment event logging and tracking
Statistical analysis infrastructure
Flag lifecycle management and code cleanup

Data, analytics & warehouse

Event tracking · ETL/ELT pipelines · Data warehouse · BI tools · Real-time analytics

The operational database is optimised for transactions, not analysis. The data layer is a separate system optimised for analytical queries across the full history of your data. Events from the application are streamed into a data warehouse via ETL/ELT pipelines. BI tools query the warehouse to produce dashboards. Data scientists run analyses that drive product decisions. At FAANG scale this is a separate engineering discipline — Data Engineering — with its own teams, tooling, and infrastructure entirely separate from product engineering.

SnowflakeBigQueryRedshiftdbtFivetranAirbyteSegmentAmplitudeMixpanelApache KafkaApache FlinkLooker

PM owns

What events to track and what properties to capture
Metrics definitions and north star metric
Dashboard and reporting requirements
Event taxonomy — naming conventions
Experiment analysis requirements
Which third-party analytics tools to use

SWE owns

Event instrumentation in application code
ETL/ELT pipeline architecture
Data warehouse schema design
dbt transformation models
Data quality monitoring and alerting
PII handling, masking, and access controls

Infrastructure, DevOps & CI/CD

Cloud · Kubernetes · Terraform · Docker · GitHub Actions · Secrets management

Infrastructure is the platform all other layers run on. At FAANG scale it is entirely code-defined — every server, network rule, database instance, and permission is declared in version-controlled configuration files, not clicked into existence in a web console. Kubernetes orchestrates containers across fleets of machines. CI/CD pipelines automatically test, build, and deploy code changes many times per day — engineers at Meta and Google each deploy to production thousands of times daily across the organisation.

AWS / GCP / AzureKubernetesDockerTerraformPulumiGitHub ActionsCircleCIArgoCDHelmHashiCorp VaultAWS Secrets Manager

PM owns

Uptime SLA requirements — 99.9% vs. 99.99%
Regional deployment requirements — data residency
Disaster recovery objectives — RTO and RPO
Compliance certifications that constrain infrastructure

SWE owns

Cloud architecture and multi-region strategy
Container orchestration and autoscaling rules
Infrastructure as Code
CI/CD pipeline design and maintenance
Secrets management and rotation
Infrastructure cost optimisation

Observability & monitoring

Logging · Metrics · Distributed tracing · Alerting · Error tracking · On-call

You cannot manage what you cannot see. Observability is the ability to understand the internal state of a system from its external outputs. The three pillars are logs — a record of what happened, metrics — measurements of system behaviour over time, and traces — following a single request across every service it touches. At FAANG scale this is a 24/7 engineering discipline: on-call rotations, automated alerting with runbooks, and blameless post-mortems for every significant outage. Without this layer you are flying blind in production.

DatadogPrometheusGrafanaELK StackOpenTelemetryJaegerSentryPagerDutyOpsGenieAWS CloudWatch

PM owns

SLA definitions that drive alert thresholds
User-facing error rate tolerances
Incident communication requirements — status page
Business-level monitoring — conversion rate drops, payment failures

SWE owns

Logging strategy and structured log format
Metrics instrumentation across all services
Distributed tracing setup
Alert thresholds and on-call rotation
Runbook creation and incident response process
Post-mortem process and follow-through

Security layer

WAF · SIEM · Vulnerability scanning · Penetration testing · Compliance · Secrets rotation

Security is a posture applied across every layer above, but it also has dedicated infrastructure. A Web Application Firewall blocks malicious traffic patterns. A SIEM aggregates security events and detects anomalies. Vulnerability scanners continuously check code dependencies and infrastructure for known exploits. Penetration testing finds vulnerabilities before attackers do. Compliance certifications — SOC 2, GDPR, HIPAA, ISO 27001 — impose specific controls across the entire stack, and in B2B they are often a hard requirement to close enterprise deals.

Cloudflare WAFAWS ShieldSnykOWASP ZAPSplunk (SIEM)AWS Security HubDependabotVantaDrataBurp Suite

PM owns

Compliance certifications required for target market
Data classification policy — what is PII, what is sensitive
Security incident user notification policy
Penetration test scheduling and scope sign-off
Bug bounty programme decisions

SWE owns

OWASP Top 10 remediation across the stack
Dependency vulnerability monitoring and patching
Secret scanning in CI/CD pipelines
WAF rule configuration
Encryption at rest and in transit
Security audit log implementation

The ownership summary

The PM defines what the system must do and why. The SWE decides how it's built and keeps it running. These boundaries are not rigid — great PMs have technical depth, and great engineers think about product. But accountability is clear, and confusion about it is expensive.

What PMs are accountable for — across all 18 layers

Requirements, user stories, and acceptance criteria
Policy decisions — rate limits, retention, deletion
SLA and performance requirements
Compliance and regulatory obligations
What events to track and how to define metrics
Feature flag rollout strategy and launch decisions
Experiment design — hypothesis, metrics, sample size
Launch / no-launch decisions after data reads
Pricing and billing product logic
Notification content, triggers, and frequency policy
API contract with external partners
Security incident communication
Which AI model versions ship to production
AI guardrail requirements and quality bar
Which platforms and markets to support
Copy, microcopy, and error state language

What SWEs are accountable for — across all 18 layers

Architecture and technology selection at every layer
System design — scalability, reliability, fault tolerance
Database schema, migrations, and query performance
CI/CD pipeline and deployment automation
Infrastructure as Code and cloud configuration
Security implementation — OWASP, encryption, secrets
Caching strategy and cache invalidation
Async job architecture and retry logic
Observability — logging, metrics, tracing, alerting
On-call rotation and incident response
ML model training, serving, and monitoring
Data pipeline and warehouse architecture
Cost optimisation across all infrastructure
Technical debt management and refactoring
Dependency security and patching
Uptime and disaster recovery implementation

The single most important thing to understand

Every layer above exists in a production app — even one with 1,000 users. The difference between a startup and FAANG is scale, redundancy, and team size — not the presence or absence of these layers. A solo engineer builds a simplified version of all 18. A FAANG team of 500 engineers builds a more sophisticated version of the same 18. If you are building a startup and your system has no observability, no caching, no async processing, and no CI/CD — those aren't features you haven't got to yet. They are technical debt accumulating interest every day. Vibe coding gets you a demo. Architecture gets you a product.

← Browse Full Career Navigation

Full Stack System Architecture

The 18 layers

The ownership summary

What PMs are accountable for — across all 18 layers

What SWEs are accountable for — across all 18 layers

The single most important thing to understand

Related articles in this series