kubernetesobservabilitymonitoringloggingtracing

Kubernetes Observability: Best Practices for Monitoring and Troubleshooting

May 2, 2026

Kubernetes Observability: Best Practices for Monitoring and Troubleshooting

Running Kubernetes without proper observability is like flying blind. When something breaks at 2 AM — and it will — you need to know *what* broke, *why* it broke, and *where* in your cluster it happened. That's exactly what observability gives you: the ability to understand your system's internal state from the outside.

Observability in Kubernetes isn't just "add some metrics and call it a day." It's three interconnected pillars: metrics, logs, and traces. Skip any one of them and you'll spend twice as long debugging the next incident.

Let's build this out properly.

The Three Pillars (and Why You Need All of Them)

Metrics tell you *that* something is wrong — CPU spiking, pod restarts climbing, request latency increasing.

Logs tell you *what* happened — the error message, the stack trace, the context around the failure.

Traces tell you *where* it went wrong across services — which microservice in a chain of ten is actually responsible for that 3-second response time.

You can debug with just metrics and logs for a while, but once you have more than five or six services talking to each other, distributed tracing becomes non-negotiable.

Setting Up Metrics with Prometheus

Prometheus is the de facto standard for Kubernetes metrics. It scrapes metrics endpoints, stores time-series data, and integrates with Kubernetes service discovery out of the box.

The easiest way to get started is with the kube-prometheus-stack Helm chart, which bundles Prometheus, Alertmanager, and Grafana together:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo updatehelm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

For your own applications, expose a /metrics endpoint and annotate your pods so Prometheus picks them up:

apiVersion: v1
kind: Pod
metadata:
  name: my-app
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
spec:
  containers:
    - name: my-app
      image: my-app:latest
      ports:
        - containerPort: 8080

A few metrics you should be tracking from day one:

kube_pod_container_status_restarts_total — pod restart counts (a climbing number here is always a red flag)

container_cpu_usage_seconds_total — CPU usage per container

container_memory_working_set_bytes — actual memory in use

http_request_duration_seconds — your application's request latency (instrument this yourself with a client library)

Alerting That Actually Works

Prometheus alerts without Alertmanager routing are useless noise. Here's a practical alert rule that fires when a pod has restarted more than 5 times in the last hour:

groups:
  - name: pod-health
    rules:
      - alert: PodCrashLooping
        expr: |
          increase(kube_pod_container_status_restarts_total[1h]) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"
          description: "Container {{ $labels.container }} in pod {{ $labels.pod }} has restarted {{ $value }} times in the last hour."

Keep your alerts actionable. Every alert should have a clear owner and a runbook link. If your on-call engineer can't do anything about an alert at 3 AM, it shouldn't page them.

Centralized Logging with the EFK Stack

Kubernetes logs are ephemeral — when a pod dies, its logs go with it unless you ship them somewhere first. The EFK stack (Elasticsearch, Fluentd, Kibana) is a solid choice for centralized logging.

Deploy Fluentd as a DaemonSet so it runs on every node and captures all container logs:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
spec:
  selector:
    matchLabels:
      name: fluentd
  template:
    metadata:
      labels:
        name: fluentd
    spec:
      tolerations:
        - key: node-role.kubernetes.io/control-plane
          effect: NoSchedule
      containers:
        - name: fluentd
          image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
          env:
            - name: FLUENT_ELASTICSEARCH_HOST
              value: "elasticsearch.logging.svc.cluster.local"
            - name: FLUENT_ELASTICSEARCH_PORT
              value: "9200"
          volumeMounts:
            - name: varlog
              mountPath: /var/log
            - name: varlibdockercontainers
              mountPath: /var/lib/docker/containers
              readOnly: true
      volumes:
        - name: varlog
          hostPath:
            path: /var/log
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers

Structured logging matters here. If your application logs plain text, you're making your own life harder. Log JSON instead:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "error",
  "service": "payment-service",
  "trace_id": "abc123",
  "message": "Payment processing failed",
  "error": "connection timeout",
  "user_id": "u-789"
}

Structured logs are searchable, filterable, and can be correlated with traces using that trace_id field.

Distributed Tracing with Jaeger

Once you have more than a handful of services, you need traces. Jaeger integrates well with Kubernetes and supports OpenTelemetry, which is the instrumentation standard you should be using.

Deploy Jaeger with the all-in-one image for development, or use the Jaeger Operator for production:

kubectl create namespace observability
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/latest/download/jaeger-operator.yaml \
  -n observability

Then create a Jaeger instance:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: observability
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch.logging.svc.cluster.local:9200

Instrument your application with OpenTelemetry (here's a Node.js example):

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://jaeger-collector.observability.svc.cluster.local:4318/v1/traces',
  }),
  serviceName: 'payment-service',
});sdk.start();

Grafana: Tying It All Together

Grafana is your single pane of glass. Connect it to Prometheus for metrics, Elasticsearch for logs, and Jaeger for traces. The kube-prometheus-stack chart already includes Grafana with pre-built dashboards for cluster health.

A few dashboards worth importing immediately:

Kubernetes Cluster Overview (ID: 7249) — nodes, pods, namespaces at a glance

Node Exporter Full (ID: 1860) — deep dive into node-level metrics

Kubernetes Pod Monitoring (ID: 6336) — per-pod resource usage

Practical Tips That Save You Time

Set resource requests and limits on everything. Without them, you can't trust your resource metrics. A pod without limits can starve its neighbors and your dashboards won't tell you why.

Use namespace-level separation for your observability stack. Keep monitoring, logging, and observability namespaces separate from your application namespaces. It makes RBAC and resource management much cleaner.

Correlate across pillars. The real power comes when you can jump from a Grafana alert → filter logs by time range → find the trace ID → follow the trace through your services. Build this workflow before you need it in an incident.

Don't skip liveness and readiness probes. They're not just for Kubernetes health checks — they're also signals that feed into your metrics and alert on pod restarts.

Your Next Steps

Deploy kube-prometheus-stack if you haven't already — it gets you metrics and dashboards in under 10 minutes

Switch your application logging to structured JSON — this pays dividends immediately in searchability

Add OpenTelemetry instrumentation to your two or three most critical services first, then expand

Write a runbook for your top three most likely failure modes and link them from your alerts

Run a game day — deliberately kill a pod or inject latency and practice using your observability tools to find the problem

Observability isn't a one-time setup. It's a practice. The teams that do it well are the ones that treat it as a first-class engineering concern, not an afterthought bolted on after something breaks.