10 Things You Must Get Right When Building Kubernetes Infrastructure in 2026
The Kubernetes infrastructure checklist every team needs: RBAC, network policies, resource limits, secrets management, and the production readiness audit.
We followed every Kubernetes tutorial on the internet. We still got breached. A low-level developer account accessed the production etcd backup. Payment history of 40,000 customers was exposed. The tutorials taught us how to deploy apps. They never taught us how to secure a cluster. That knowledge cost us.
The Problem
The gap between "getting Kubernetes running" and "running Kubernetes correctly" is vast. Most teams are stuck between these two states without knowing it. They can deploy services. They cannot recover from a security breach. They have observability. They do not have secrets management. The cluster runs. But it is built on sand.
Why This Happens
Kubernetes documentation shows what Kubernetes can do. It does not show what you must do before going to production. The mental model teams carry is: get it working, then secure it. This is backwards. Security must be built in from the start. By the time teams realize this, they have 50 services depending on the insecure foundation. Refactoring is expensive and risky.
The Solution — 10 Things You Must Get Right
1. Set Resource Requests AND Limits on Every Container
Kubernetes scheduler uses requests to place pods. Limits prevent pods from consuming all available resources. Without both, the scheduler is blind and noisy neighbors crash each other.
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
Also enforce this at the namespace level using ResourceQuota so a single team cannot consume all cluster resources.
2. RBAC: Least-Privilege Service Accounts for Every Workload
Kubernetes RBAC is the difference between a cluster where any pod can access any secret and one where each pod can only access what it needs.
apiVersion: v1
kind: ServiceAccount
metadata:
name: api-service
namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: api-service-role
namespace: production
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get"]
resourceNames: ["api-service-config"]
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get"]
resourceNames: ["api-service-secrets"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: api-service-binding
namespace: production
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: api-service-role
subjects:
- kind: ServiceAccount
name: api-service
namespace: production
This service account can ONLY read one specific ConfigMap and one specific Secret. Nothing else. A compromised pod running this account has minimal damage potential.
3. Network Policies: Default-Deny Then Permit What Is Needed
By default, every pod in a Kubernetes cluster can talk to every other pod on any port. In a 50-service microservices cluster, that is 2,450 potential attack vectors with zero restrictions.
# Deny all traffic by default
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
---
# Allow only payment service to call database service on port 5432
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-payment-to-database
namespace: production
spec:
podSelector:
matchLabels:
app: database
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: payment-service
ports:
- protocol: TCP
port: 5432
---
# Allow DNS egress (required for service discovery)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns-egress
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector: {}
ports:
- protocol: UDP
port: 53
Start with deny-all. Then selectively allow what is needed. This prevents lateral movement if a pod is compromised.
4. Never Run Containers as Root
A compromised container running as root has the keys to the kingdom. Running as non-root limits the damage.
securityContext:
runAsNonRoot: true
runAsUser: 1000
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
5. Secrets Management: Use External Secrets Operator or Vault
Kubernetes Secrets are base64-encoded, not encrypted. They appear in etcd plaintext. They end up in git history. Never hardcode secrets in Deployment manifests.
Use External Secrets Operator (ESO) to sync secrets from AWS Secrets Manager, HashiCorp Vault, or Azure Key Vault:
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: aws-secrets-store
spec:
provider:
aws:
service: SecretsManager
region: us-east-1
auth:
jwt:
serviceAccountRef:
name: external-secrets-sa
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: api-service-secrets
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-store
kind: SecretStore
target:
name: api-service-secrets
creationPolicy: Owner
data:
- secretKey: database-password
remoteRef:
key: prod/api-service/db-password
ESO syncs the secret automatically and rotates it on a schedule. Secrets never appear in git.
6. Pod Disruption Budgets for Critical Services
During cluster maintenance, Kubernetes evicts pods to other nodes. Without PDB, all replicas might evict simultaneously, causing downtime.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-service-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: api-service
This PDB ensures at least 2 replicas of api-service stay running during any disruption. Kubernetes will not evict a pod if it would violate the PDB.
7. Image Scanning in CI Before Deployment
Scan container images for vulnerabilities before they are deployed. Trivy is free and catches 95% of common vulnerabilities.
trivy image myregistry.azurecr.io/api-service:v1.2.3
# Fails if any high-severity vulnerabilities found
# Gate the deployment on scan success in CI
8. Separate Node Pools for Different Workload Types
Run system components (kube-proxy, coredns, monitoring) on dedicated nodes. Run application pods on separate nodes. Prevent noisy neighbors.
nodeSelector:
workload-type: application
---
apiVersion: v1
kind: Node
metadata:
labels:
workload-type: application
9. Cluster Autoscaling with Karpenter (Not Cluster Autoscaler)
Cluster Autoscaler is outdated. Karpenter is faster, smarter, and cheaper. It provisions the right instance type for the workload and consolidates underutilized nodes.
10. Etcd Backup Automation with Velero
Etcd is the Kubernetes database. If etcd is corrupted or deleted, the entire cluster state is lost. Velero backs up etcd and application data automatically.
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
spec:
schedule: "0 2 * * *" # 2 AM daily
template:
ttl: 720h # Keep for 30 days
storageLocation: default
volumeSnapshotLocations:
- default
The Production Readiness Checklist
Before promoting any cluster from staging to production, verify:
- ☐ All containers have resource requests and limits
- ☐ RBAC is configured with least-privilege service accounts
- ☐ Network policies are implemented (default deny, selective allow)
- ☐ No containers run as root
- ☐ Secrets are managed via External Secrets Operator or Vault
- ☐ Pod Disruption Budgets protect critical services
- ☐ Image scanning gates deployments in CI/CD
- ☐ Separate node pools for system vs application workloads
- ☐ Karpenter autoscaling is configured
- ☐ Velero backups are running and tested
- ☐ Monitoring and alerting are operational (Prometheus + Grafana)
- ☐ Audit logging is enabled on the API server
- ☐ TLS certificates are valid for 1+ years
- ☐ Disaster recovery runbooks are documented
- ☐ On-call escalation procedures are defined
Common Mistakes to Avoid
- Treating security as an afterthought. Build RBAC, network policies, and secrets management from day 1, not 6 months later.
- Running all workloads on the same nodes. A rogue pod can crash the entire monitoring stack and make debugging impossible.
- No backup strategy for etcd. Etcd failure is permanent loss. Test backups regularly.
- Resource limits without requests (or vice versa). Scheduler cannot make good decisions without requests. Pods OOMKill without limits.
- Hardcoding secrets in git. Even in "private" repos, secrets in git are findable and exploitable. Use External Secrets.
Key Takeaways
- The 10 things are not optional: They are the difference between a safe cluster and a vulnerable one.
- Resource management is foundational: Requests let scheduler decide, limits prevent cascading failures.
- Security must be built in from day 1: Adding RBAC after 50 services are deployed is painful.
- Network policies prevent lateral movement: Default deny, selective allow is the zero-trust model.
- Automation prevents human error: Velero backups, image scanning, Karpenter scaling—all automated.
Struggling with securing your Kubernetes infrastructure or preparing for production? The Skillzmist team has solved this exact problem for engineering teams across the US, UK, and Europe. Reach out for a free technical consultation — we respond within 24 hours.
Related: Kubernetes RBAC Explained | Kubernetes Network Policies and Zero Trust