Kubernetes Design & Code Review Checklist
1. Architecture Review
Cluster Design
[ ] Multi-node cluster (avoid single point of failure)
[ ] Separate environments (dev/staging/prod)
[ ] Proper namespace strategy
[ ] ResourceQuota configured
[ ] LimitRange configured
[ ] RBAC enabled
[ ] Network Policies enabled
[ ] Audit logging enabled
[ ] High availability control plane
Namespace Design
[ ] One namespace per application/domain
[ ] Environment isolation
[ ] Consistent naming convention
[ ] Quotas per namespace
Example:
production-payment
production-order
staging-payment
staging-order
2. Workload Review
Workload Type Selection
| Requirement | Kubernetes Resource |
| Stateless API | Deployment |
| Background Worker | Deployment |
| Database | StatefulSet |
| Cache | StatefulSet |
| Scheduled Task | CronJob |
| One-time Task | Job |
| Daemon on every node | DaemonSet |
Checklist:
[ ] Correct workload type selected
[ ] One responsibility per workload
[ ] Horizontal scaling supported
3. Deployment Review
Deployment Configuration
[ ] replicas > 1 in production
[ ] RollingUpdate strategy used
[ ] revisionHistoryLimit configured
[ ] Proper labels
[ ] Proper selectors
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
Labels
Required labels:
labels:
app: payment-api
version: v1.2.0
env: prod
team: backend
Checklist:
[ ] app
[ ] version
[ ] env
[ ] team
4. Container Review
Container Image
Checklist:
Bad:
image: api:latest
Good:
image: api:v1.3.5
Security Context
Checklist:
securityContext:
runAsNonRoot: true
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
5. Resource Management
CPU & Memory
Checklist:
[ ] CPU request
[ ] CPU limit
[ ] Memory request
[ ] Memory limit
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
Autoscaling
Checklist:
Example:
minReplicas: 2
maxReplicas: 10
6. Health Checks
Liveness Probe
Checklist:
[ ] Configured
[ ] Fast endpoint
livenessProbe:
httpGet:
path: /health
port: 8080
Readiness Probe
Checklist:
readinessProbe:
httpGet:
path: /ready
port: 8080
Startup Probe
7. Networking Review
Service Review
Ingress Review
Network Policies
Checklist:
[ ] Default deny policy
[ ] Explicit allow rules
[ ] Namespace isolation
kind: NetworkPolicy
8. Storage Review
Persistent Volumes
Stateful Applications
9. Configuration Management
ConfigMap
Secret Management
Checklist:
[ ] No secrets in Git
[ ] No secrets in ConfigMap
[ ] Rotation process defined
[ ] External secret manager preferred
10. Security Review
RBAC
Checklist:
[ ] Least privilege principle
[ ] Dedicated ServiceAccounts
[ ] No cluster-admin usage
Bad:
cluster-admin
Good:
Role
RoleBinding
Pod Security
Checklist:
[ ] Non-root containers
[ ] No privileged mode
[ ] Seccomp profile
[ ] AppArmor profile
Supply Chain Security
11. Reliability Review
High Availability
Checklist:
podAntiAffinity:
Pod Disruption Budget
Checklist:
minAvailable: 1
Graceful Shutdown
12. Observability Review
Logging
Metrics
Checklist:
[ ] CPU metrics
[ ] Memory metrics
[ ] Request metrics
[ ] Error metrics
[ ] Business metrics
Tracing
13. CI/CD Review
Deployment Pipeline
Checklist:
[ ] Automated build
[ ] Automated test
[ ] Automated deployment
[ ] Rollback support
GitOps
Checklist:
[ ] Git as source of truth
[ ] Pull-based deployment
[ ] Drift detection enabled
Deployment Strategies
14. Cost Optimization
Checklist:
[ ] Requests properly sized
[ ] HPA configured
[ ] Cluster Autoscaler configured
[ ] Spot instances evaluated
[ ] Unused resources removed
15. Disaster Recovery
Backup
Recovery
Checklist:
[ ] Restore procedure documented
[ ] Recovery tested regularly
[ ] Recovery Time Objective (RTO) defined
[ ] Recovery Point Objective (RPO) defined
16. Production Readiness Scorecard
| Category | Target Score |
| Security | 9/10 |
| Reliability | 9/10 |
| Scalability | 9/10 |
| Observability | 9/10 |
| Maintainability | 9/10 |
| Cost Optimization | 8/10+ |
| Disaster Recovery | 8/10+ |
Final Production Review Questions
[ ] Will it survive a Pod crash?
[ ] Will it survive a Node crash?
[ ] Will it survive a Zone failure?
[ ] Can it scale automatically?
[ ] Can it be deployed with zero downtime?
[ ] Can it be rolled back safely?
[ ] Is it secure by default?
[ ] Is it observable?
[ ] Can another engineer maintain it?
[ ] Can it run at 3 AM without waking me up?
If all answers are YES, the Kubernetes platform/workload is considered Production Ready.