Kubernetes Monitoring with Prometheus and Grafana: Visualizing Logs and Metrics in Production
Monitor Kubernetes clusters with Prometheus metrics and Grafana dashboards. Integrate Loki for centralized logging, set alerts, and track the 7 critical metrics every team needs.
A critical pod was OOMKilled silently for 3 days. It restarted every 30 minutes, causing cascading errors throughout the system. Users reported random errors. The team got paged three separate times. Nobody connected the dots until day 3 when an engineer was debugging something else and noticed the pod restart count. By then, they had lost thousands of dollars in failed transactions and customer trust. A single metric monitored and alerted on would have caught this on day 1.
The Problem
Kubernetes tells you almost nothing about what is happening inside your cluster by default. A pod crashed. Kubernetes logs say nothing. A pod was OOMKilled. Kubernetes logs say nothing. A node ran out of disk space. Kubernetes keeps working until suddenly everything fails. Events expire. Logs vanish. No historical data. No baseline. No context.
Teams deploy to Kubernetes, everything works for 2 weeks, then something breaks silently. An engineer spends 8 hours debugging. The answer was always there in the logs, but the logs were not being collected or retained. Prometheus collected the metrics, but nobody was looking at them.
Why This Happens
Kubernetes provides the infrastructure to run applications. It does not provide observability. Observability—understanding what is happening inside systems—is a separate concern that teams treat as optional. They are wrong. Observability is prerequisite, not luxury. By the time teams realize they need it, they are in production crisis mode with no time to set it up properly. They cobble together temporary solutions that fall apart 6 months later.
The Solution
Step 1: Install kube-prometheus-stack with Helm
kube-prometheus-stack is the easiest way to get production-ready monitoring running. It includes Prometheus (metrics collection), Alertmanager (alerting), Grafana (visualization), and pre-configured dashboards.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace --set grafana.adminPassword=SuperSecurePassword123 --set prometheus.prometheusSpec.retention=30d --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
# Verify installation
kubectl get pods -n monitoring
kubectl get svc -n monitoring
This command installs everything. Retention is set to 30 days (default is 15, which is too short for trend analysis). Storage is provisioned for 50GB of metrics (adjust based on cluster size).
Step 2: Access Grafana
# Forward the Grafana service to localhost
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Now visit: http://localhost:3000
# Default credentials: admin / SuperSecurePassword123
Grafana opens with pre-configured Prometheus data source and dashboards already loaded. You now have visibility into your entire cluster.
Step 3: Import Essential Dashboards
Grafana has a library of pre-made dashboards. Import these three for immediate cluster visibility:
Dashboard 315: Kubernetes Cluster Overview
- Overall cluster CPU and memory usage
- Node status and capacity
- Pod count by namespace
Dashboard 6417: Kubernetes Pod/Namespace Breakdown
- CPU and memory per pod
- Network I/O per pod
- Restart count per pod
Dashboard 1860: Node Exporter Full
- CPU usage per core
- Memory and swap usage
- Disk I/O and space
- Network I/O
To import: Dashboard menu → Import → enter dashboard ID → select Prometheus → Import.
Step 4: Add Loki for Log Aggregation
Prometheus collects metrics (numbers). Loki collects logs (text). Together they give you complete visibility.
helm repo add grafana https://grafana.github.io/helm-charts
# Install Loki
helm install loki grafana/loki-stack --namespace monitoring --set loki.enabled=true --set promtail.enabled=true --set grafana.enabled=false --set prometheus.enabled=false
# Loki is now collecting logs from all pods automatically.
# Prometheus scrapes Loki for log metrics.
Loki uses a label-based approach similar to Prometheus. Query logs using labels:
{namespace="production", pod_name=~"api-service.*"}
Step 5: Create Critical Alerts
Metrics without alerts are useless. Alerts need to fire when things are actually broken, and silence when things are normal.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kubernetes-alerts
namespace: monitoring
spec:
groups:
- name: kubernetes.rules
interval: 30s
rules:
# Alert: Pod is crash-looping (restarting frequently)
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0.1
for: 5m
annotations:
summary: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is crash-looping"
description: "Pod has restarted {{ $value }} times in the last 15 minutes. Check logs."
# Alert: Pod using excessive memory
- alert: HighMemoryUsage
expr: |
(container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.85
for: 5m
annotations:
summary: "Pod {{ $labels.pod }} memory usage is {{ $value | humanizePercentage }}"
# Alert: Node is running out of disk space
- alert: NodeDiskPressure
expr: |
(node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.15
for: 10m
annotations:
summary: "Node {{ $labels.node }} disk usage is critical: {{ $value | humanizePercentage }} remaining"
# Alert: API server request latency is high
- alert: APIServerLatency
expr: |
histogram_quantile(0.99, apiserver_request_duration_seconds_bucket) > 1
for: 5m
annotations:
summary: "API server p99 latency is {{ $value | humanizeDuration }}"
---
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-config
namespace: monitoring
stringData:
alertmanager.yml: |
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
receiver: 'slack-alerts'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receivers:
- name: 'slack-alerts'
slack_configs:
- channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
This creates four essential alerts and routes them to Slack. Adjust thresholds based on your environment. Set up runbooks (documentation for responding to each alert) so teams know what to do when paged.
The 7 Metrics Every Kubernetes Team Must Monitor
- Pod Restart Count — If a pod restarts frequently, something is wrong. Monitor `kube_pod_container_status_restarts_total` and alert on restarts > 3 in 15 minutes.
- CPU/Memory Request vs Usage — If actual usage is consistently much lower than requests, you are wasting resources. If usage exceeds limits, pods get OOMKilled. Both are problems.
- Node Disk Pressure — When nodes run out of disk, Kubernetes evicts pods aggressively. Monitor `node_filesystem_avail_bytes` and alert at 15% remaining.
- API Server Request Latency — Slow API server means slow pod scheduling and status updates. Monitor `apiserver_request_duration_seconds` p99 latency.
- PersistentVolume Claim Usage — PVCs fill up without warning. When full, pods fail. Monitor `kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes` per PVC.
- Network Error Rate Between Services — Packet loss or network issues show up as error rate spikes between services. Monitor HTTP error rates and DNS resolution failures.
- Etcd Leader Changes — etcd is the Kubernetes database. Frequent leader changes mean etcd cluster is unhealthy. Monitor `etcd_server_has_leader`.
Common Mistakes to Avoid
- Only monitoring nodes, not pods. Node is healthy but a pod is consuming 100% CPU. Node-level metrics miss pod-level issues.
- Default Prometheus retention of 15 days is too short. You cannot see trends or correlations. Set retention to 30-90 days for production.
- No alerting configured — Prometheus without Alertmanager is useless. You cannot watch dashboards 24/7. Alerts must reach humans on-call.
- Alert fatigue: alerting on everything, ignoring everything. If you get paged 10 times per day, you will ignore alerts. Alert only on things that actually require human intervention.
- Not monitoring PersistentVolume Claim usage until disk is full. PVCs fill up silently. Databases stop accepting writes. Trace the alert back hours later.
- No runbook linked in alert annotations. An engineer gets paged at 2 AM. They need to know immediately what to do. The alert annotation should link to a runbook.
Key Takeaways
- Observability is prerequisite, not luxury: Kubernetes provides no insight by default. Prometheus + Grafana + Loki is the production standard.
- kube-prometheus-stack is the easiest starting point: Helm installation takes 5 minutes. Pre-configured dashboards give immediate visibility.
- Retention matters: 15 days is too short. Use 30+ days for trend analysis.
- Metrics without alerts are useless: Set up Alertmanager and route alerts to Slack or PagerDuty.
- Monitor the 7 critical metrics: Restart count, CPU/memory, disk pressure, API latency, PVC usage, network errors, etcd health.
Struggling with Kubernetes monitoring or need to set up production-grade observability? The Skillzmist team has solved this exact problem for engineering teams across the US, UK, and Europe. Reach out for a free technical consultation — we respond within 24 hours.
Related: 10 Kubernetes Infrastructure Best Practices | Internal Service Communication in Kubernetes