====== Kubernetes Design & Code Review Checklist ======

===== 1. Architecture Review =====

==== Cluster Design ====

  * [ ] Multi-node cluster (avoid single point of failure)
  * [ ] Separate environments (dev/staging/prod)
  * [ ] Proper namespace strategy
  * [ ] ResourceQuota configured
  * [ ] LimitRange configured
  * [ ] RBAC enabled
  * [ ] Network Policies enabled
  * [ ] Audit logging enabled
  * [ ] High availability control plane

==== Namespace Design ====

  * [ ] One namespace per application/domain
  * [ ] Environment isolation
  * [ ] Consistent naming convention
  * [ ] Quotas per namespace

Example:

<code yaml>
production-payment
production-order
staging-payment
staging-order
</code>

----

===== 2. Workload Review =====

==== Workload Type Selection ====

^ Requirement ^ Kubernetes Resource ^
| Stateless API | Deployment |
| Background Worker | Deployment |
| Database | StatefulSet |
| Cache | StatefulSet |
| Scheduled Task | CronJob |
| One-time Task | Job |
| Daemon on every node | DaemonSet |

Checklist:

  * [ ] Correct workload type selected
  * [ ] One responsibility per workload
  * [ ] Horizontal scaling supported

----

===== 3. Deployment Review =====

==== Deployment Configuration ====

  * [ ] replicas > 1 in production
  * [ ] RollingUpdate strategy used
  * [ ] revisionHistoryLimit configured
  * [ ] Proper labels
  * [ ] Proper selectors

<code yaml>
strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 1
    maxSurge: 1
</code>

==== Labels ====

Required labels:

<code yaml>
labels:
  app: payment-api
  version: v1.2.0
  env: prod
  team: backend
</code>

Checklist:

  * [ ] app
  * [ ] version
  * [ ] env
  * [ ] team

----

===== 4. Container Review =====

==== Container Image ====

Checklist:

  * [ ] No latest tag
  * [ ] Immutable version tag
  * [ ] Trusted registry
  * [ ] Vulnerability scan performed

Bad:

<code yaml>
image: api:latest
</code>

Good:

<code yaml>
image: api:v1.3.5
</code>

==== Security Context ====

Checklist:

  * [ ] runAsNonRoot
  * [ ] readOnlyRootFilesystem
  * [ ] allowPrivilegeEscalation=false
  * [ ] capabilities dropped

<code yaml>
securityContext:
  runAsNonRoot: true
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
</code>

----

===== 5. Resource Management =====

==== CPU & Memory ====

Checklist:

  * [ ] CPU request
  * [ ] CPU limit
  * [ ] Memory request
  * [ ] Memory limit

<code yaml>
resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"
</code>

==== Autoscaling ====

Checklist:

  * [ ] HPA configured
  * [ ] minReplicas defined
  * [ ] maxReplicas defined
  * [ ] CPU target configured
  * [ ] Memory target configured

Example:

<code yaml>
minReplicas: 2
maxReplicas: 10
</code>

----

===== 6. Health Checks =====

==== Liveness Probe ====

Checklist:

  * [ ] Configured
  * [ ] Fast endpoint

<code yaml>
livenessProbe:
  httpGet:
    path: /health
    port: 8080
</code>

==== Readiness Probe ====

Checklist:

  * [ ] Configured
  * [ ] Service traffic blocked until ready

<code yaml>
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
</code>

==== Startup Probe ====

Checklist:

  * [ ] Configured for slow startup applications

----

===== 7. Networking Review =====

==== Service Review ====

Checklist:

  * [ ] ClusterIP for internal services
  * [ ] LoadBalancer only when required
  * [ ] NodePort avoided

==== Ingress Review ====

Checklist:

  * [ ] TLS enabled
  * [ ] HTTPS redirect enabled
  * [ ] Rate limiting configured
  * [ ] WAF considered

==== Network Policies ====

Checklist:

  * [ ] Default deny policy
  * [ ] Explicit allow rules
  * [ ] Namespace isolation

<code yaml>
kind: NetworkPolicy
</code>

----

===== 8. Storage Review =====

==== Persistent Volumes ====

Checklist:

  * [ ] Dynamic provisioning
  * [ ] StorageClass used
  * [ ] Backup strategy exists
  * [ ] Recovery tested

==== Stateful Applications ====

Checklist:

  * [ ] StatefulSet used
  * [ ] PVC attached
  * [ ] Data persistence verified

----

===== 9. Configuration Management =====

==== ConfigMap ====

Checklist:

  * [ ] Only non-sensitive data
  * [ ] Version controlled
  * [ ] Environment specific

==== Secret Management ====

Checklist:

  * [ ] No secrets in Git
  * [ ] No secrets in ConfigMap
  * [ ] Rotation process defined
  * [ ] External secret manager preferred

----

===== 10. Security Review =====

==== RBAC ====

Checklist:

  * [ ] Least privilege principle
  * [ ] Dedicated ServiceAccounts
  * [ ] No cluster-admin usage

Bad:

<code yaml>
cluster-admin
</code>

Good:

<code yaml>
Role
RoleBinding
</code>

==== Pod Security ====

Checklist:

  * [ ] Non-root containers
  * [ ] No privileged mode
  * [ ] Seccomp profile
  * [ ] AppArmor profile

==== Supply Chain Security ====

Checklist:

  * [ ] Image signing
  * [ ] SBOM generated
  * [ ] Vulnerability scanning

----

===== 11. Reliability Review =====

==== High Availability ====

Checklist:

  * [ ] Multiple replicas
  * [ ] Pod anti-affinity
  * [ ] Multi-zone deployment

<code yaml>
podAntiAffinity:
</code>

==== Pod Disruption Budget ====

Checklist:

  * [ ] PDB configured

<code yaml>
minAvailable: 1
</code>

==== Graceful Shutdown ====

Checklist:

  * [ ] SIGTERM handled
  * [ ] preStop hook configured
  * [ ] terminationGracePeriodSeconds set

----

===== 12. Observability Review =====

==== Logging ====

Checklist:

  * [ ] Centralized logging
  * [ ] Structured JSON logs
  * [ ] Correlation ID support

==== Metrics ====

Checklist:

  * [ ] CPU metrics
  * [ ] Memory metrics
  * [ ] Request metrics
  * [ ] Error metrics
  * [ ] Business metrics

==== Tracing ====

Checklist:

  * [ ] Distributed tracing enabled
  * [ ] Request correlation supported

----

===== 13. CI/CD Review =====

==== Deployment Pipeline ====

Checklist:

  * [ ] Automated build
  * [ ] Automated test
  * [ ] Automated deployment
  * [ ] Rollback support

==== GitOps ====

Checklist:

  * [ ] Git as source of truth
  * [ ] Pull-based deployment
  * [ ] Drift detection enabled

==== Deployment Strategies ====

Checklist:

  * [ ] Rolling deployment
  * [ ] Canary deployment
  * [ ] Blue-Green deployment

----

===== 14. Cost Optimization =====

Checklist:

  * [ ] Requests properly sized
  * [ ] HPA configured
  * [ ] Cluster Autoscaler configured
  * [ ] Spot instances evaluated
  * [ ] Unused resources removed

----

===== 15. Disaster Recovery =====

==== Backup ====

Checklist:

  * [ ] Database backup
  * [ ] Persistent volume backup
  * [ ] Secret backup
  * [ ] Configuration backup

==== Recovery ====

Checklist:

  * [ ] Restore procedure documented
  * [ ] Recovery tested regularly
  * [ ] Recovery Time Objective (RTO) defined
  * [ ] Recovery Point Objective (RPO) defined

----

===== 16. Production Readiness Scorecard =====

^ Category ^ Target Score ^
| Security | 9/10 |
| Reliability | 9/10 |
| Scalability | 9/10 |
| Observability | 9/10 |
| Maintainability | 9/10 |
| Cost Optimization | 8/10+ |
| Disaster Recovery | 8/10+ |

----

===== Final Production Review Questions =====

  - [ ] Will it survive a Pod crash?
  - [ ] Will it survive a Node crash?
  - [ ] Will it survive a Zone failure?
  - [ ] Can it scale automatically?
  - [ ] Can it be deployed with zero downtime?
  - [ ] Can it be rolled back safely?
  - [ ] Is it secure by default?
  - [ ] Is it observable?
  - [ ] Can another engineer maintain it?
  - [ ] Can it run at 3 AM without waking me up?

If all answers are YES, the Kubernetes platform/workload is considered Production Ready.