====== Kubernetes Design & Code Review Checklist ======
===== 1. Architecture Review =====
==== Cluster Design ====
* [ ] Multi-node cluster (avoid single point of failure)
* [ ] Separate environments (dev/staging/prod)
* [ ] Proper namespace strategy
* [ ] ResourceQuota configured
* [ ] LimitRange configured
* [ ] RBAC enabled
* [ ] Network Policies enabled
* [ ] Audit logging enabled
* [ ] High availability control plane
==== Namespace Design ====
* [ ] One namespace per application/domain
* [ ] Environment isolation
* [ ] Consistent naming convention
* [ ] Quotas per namespace
Example:
production-payment
production-order
staging-payment
staging-order
----
===== 2. Workload Review =====
==== Workload Type Selection ====
^ Requirement ^ Kubernetes Resource ^
| Stateless API | Deployment |
| Background Worker | Deployment |
| Database | StatefulSet |
| Cache | StatefulSet |
| Scheduled Task | CronJob |
| One-time Task | Job |
| Daemon on every node | DaemonSet |
Checklist:
* [ ] Correct workload type selected
* [ ] One responsibility per workload
* [ ] Horizontal scaling supported
----
===== 3. Deployment Review =====
==== Deployment Configuration ====
* [ ] replicas > 1 in production
* [ ] RollingUpdate strategy used
* [ ] revisionHistoryLimit configured
* [ ] Proper labels
* [ ] Proper selectors
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
==== Labels ====
Required labels:
labels:
app: payment-api
version: v1.2.0
env: prod
team: backend
Checklist:
* [ ] app
* [ ] version
* [ ] env
* [ ] team
----
===== 4. Container Review =====
==== Container Image ====
Checklist:
* [ ] No latest tag
* [ ] Immutable version tag
* [ ] Trusted registry
* [ ] Vulnerability scan performed
Bad:
image: api:latest
Good:
image: api:v1.3.5
==== Security Context ====
Checklist:
* [ ] runAsNonRoot
* [ ] readOnlyRootFilesystem
* [ ] allowPrivilegeEscalation=false
* [ ] capabilities dropped
securityContext:
runAsNonRoot: true
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
----
===== 5. Resource Management =====
==== CPU & Memory ====
Checklist:
* [ ] CPU request
* [ ] CPU limit
* [ ] Memory request
* [ ] Memory limit
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
==== Autoscaling ====
Checklist:
* [ ] HPA configured
* [ ] minReplicas defined
* [ ] maxReplicas defined
* [ ] CPU target configured
* [ ] Memory target configured
Example:
minReplicas: 2
maxReplicas: 10
----
===== 6. Health Checks =====
==== Liveness Probe ====
Checklist:
* [ ] Configured
* [ ] Fast endpoint
livenessProbe:
httpGet:
path: /health
port: 8080
==== Readiness Probe ====
Checklist:
* [ ] Configured
* [ ] Service traffic blocked until ready
readinessProbe:
httpGet:
path: /ready
port: 8080
==== Startup Probe ====
Checklist:
* [ ] Configured for slow startup applications
----
===== 7. Networking Review =====
==== Service Review ====
Checklist:
* [ ] ClusterIP for internal services
* [ ] LoadBalancer only when required
* [ ] NodePort avoided
==== Ingress Review ====
Checklist:
* [ ] TLS enabled
* [ ] HTTPS redirect enabled
* [ ] Rate limiting configured
* [ ] WAF considered
==== Network Policies ====
Checklist:
* [ ] Default deny policy
* [ ] Explicit allow rules
* [ ] Namespace isolation
kind: NetworkPolicy
----
===== 8. Storage Review =====
==== Persistent Volumes ====
Checklist:
* [ ] Dynamic provisioning
* [ ] StorageClass used
* [ ] Backup strategy exists
* [ ] Recovery tested
==== Stateful Applications ====
Checklist:
* [ ] StatefulSet used
* [ ] PVC attached
* [ ] Data persistence verified
----
===== 9. Configuration Management =====
==== ConfigMap ====
Checklist:
* [ ] Only non-sensitive data
* [ ] Version controlled
* [ ] Environment specific
==== Secret Management ====
Checklist:
* [ ] No secrets in Git
* [ ] No secrets in ConfigMap
* [ ] Rotation process defined
* [ ] External secret manager preferred
----
===== 10. Security Review =====
==== RBAC ====
Checklist:
* [ ] Least privilege principle
* [ ] Dedicated ServiceAccounts
* [ ] No cluster-admin usage
Bad:
cluster-admin
Good:
Role
RoleBinding
==== Pod Security ====
Checklist:
* [ ] Non-root containers
* [ ] No privileged mode
* [ ] Seccomp profile
* [ ] AppArmor profile
==== Supply Chain Security ====
Checklist:
* [ ] Image signing
* [ ] SBOM generated
* [ ] Vulnerability scanning
----
===== 11. Reliability Review =====
==== High Availability ====
Checklist:
* [ ] Multiple replicas
* [ ] Pod anti-affinity
* [ ] Multi-zone deployment
podAntiAffinity:
==== Pod Disruption Budget ====
Checklist:
* [ ] PDB configured
minAvailable: 1
==== Graceful Shutdown ====
Checklist:
* [ ] SIGTERM handled
* [ ] preStop hook configured
* [ ] terminationGracePeriodSeconds set
----
===== 12. Observability Review =====
==== Logging ====
Checklist:
* [ ] Centralized logging
* [ ] Structured JSON logs
* [ ] Correlation ID support
==== Metrics ====
Checklist:
* [ ] CPU metrics
* [ ] Memory metrics
* [ ] Request metrics
* [ ] Error metrics
* [ ] Business metrics
==== Tracing ====
Checklist:
* [ ] Distributed tracing enabled
* [ ] Request correlation supported
----
===== 13. CI/CD Review =====
==== Deployment Pipeline ====
Checklist:
* [ ] Automated build
* [ ] Automated test
* [ ] Automated deployment
* [ ] Rollback support
==== GitOps ====
Checklist:
* [ ] Git as source of truth
* [ ] Pull-based deployment
* [ ] Drift detection enabled
==== Deployment Strategies ====
Checklist:
* [ ] Rolling deployment
* [ ] Canary deployment
* [ ] Blue-Green deployment
----
===== 14. Cost Optimization =====
Checklist:
* [ ] Requests properly sized
* [ ] HPA configured
* [ ] Cluster Autoscaler configured
* [ ] Spot instances evaluated
* [ ] Unused resources removed
----
===== 15. Disaster Recovery =====
==== Backup ====
Checklist:
* [ ] Database backup
* [ ] Persistent volume backup
* [ ] Secret backup
* [ ] Configuration backup
==== Recovery ====
Checklist:
* [ ] Restore procedure documented
* [ ] Recovery tested regularly
* [ ] Recovery Time Objective (RTO) defined
* [ ] Recovery Point Objective (RPO) defined
----
===== 16. Production Readiness Scorecard =====
^ Category ^ Target Score ^
| Security | 9/10 |
| Reliability | 9/10 |
| Scalability | 9/10 |
| Observability | 9/10 |
| Maintainability | 9/10 |
| Cost Optimization | 8/10+ |
| Disaster Recovery | 8/10+ |
----
===== Final Production Review Questions =====
- [ ] Will it survive a Pod crash?
- [ ] Will it survive a Node crash?
- [ ] Will it survive a Zone failure?
- [ ] Can it scale automatically?
- [ ] Can it be deployed with zero downtime?
- [ ] Can it be rolled back safely?
- [ ] Is it secure by default?
- [ ] Is it observable?
- [ ] Can another engineer maintain it?
- [ ] Can it run at 3 AM without waking me up?
If all answers are YES, the Kubernetes platform/workload is considered Production Ready.