This page documents significant outages with root cause analysis, resolution steps, and preventive measures.
Date: 2026-02 | Severity: High | Duration: ~30 minutes
K8s workloads using TrueNAS NFS PVCs became completely unresponsive due to NFS server thread exhaustion. TrueNAS's KVM process pinned at 100% CPU and the VM became unreachable via SSH and web UI.
TrueNAS SCALE defaults to 2 NFS server threads (servers: 2). This is sufficient for a home NAS with a few SMB clients, but dangerously low for Kubernetes:
The cluster had approximately 15+ pods with NFS PVCs at the time, each capable of concurrent I/O.
| Affected | Not Affected |
|---|---|
All pods with nfs-subdir-retain PVCs |
Pods using longhorn storage |
All pods with nfs-subdir-delete PVCs |
Pods with no persistent storage |
Grafana (was on nfs-subdir-retain) |
VictoriaMetrics (was also NFS at the time) |
| Harbor Registry, LifeOps DB | Vault, Authentik (Longhorn) |
Compounding factor: Monitoring (VictoriaMetrics) was also on TrueNAS NFS at the time. The outage took down monitoring during the outage — making it harder to diagnose what was failing.
# From Proxmox host — graceful reboot first
qm reboot 109
# If graceful times out after ~60s, force reset
qm reset 109
# After TrueNAS comes back, immediately increase NFS threads
curl -u "andy:<password>" -X PUT https://192.168.88.230/api/v2.0/nfs \
-H "Content-Type: application/json" \
-d '{"servers": 8}'
# Verify in TrueNAS web UI: Services > NFS > Edit > Servers = 8
K8s pods recover automatically once NFS is available again — no manual pod restarts needed (kubelet retries NFS mounts).
If TrueNAS NFS becomes unresponsive and SSH is closed:
# From Mac or any machine with Proxmox access
ssh [email protected] "qm reboot 109"
# Wait 60s, check if TrueNAS web UI at https://192.168.88.230 responds
# If not: qm reset 109
Date: 2026-02 | Severity: High | Duration: Several days (silent degradation)
K8s worker nodes silently lost RAM over several days due to Proxmox's memory balloon driver. This caused intermittent pod OOMKills and incorrect scheduling decisions. The degradation was invisible in both K8s and Proxmox dashboards — it looked like application bugs, not infrastructure.
Why kubelet doesn't notice: Kubelet reads allocatable memory at startup and caches it. The balloon driver shrinks the VM's physical memory without notifying the guest OS in a way that kubelet responds to. Kubelet continues advertising 12GB of allocatable memory to the scheduler even though the VM only has 4GB.
Why it's invisible in Proxmox: The Proxmox UI shows the VM's configured memory (12GB), not the current ballooned size. The balloon value only appears if you run pvesh get /nodes/andy/qemu/<vmid>/status/current.
Default Proxmox VM configuration sets balloon: 4096 (4GB minimum). When the Proxmox host experienced any memory pressure, the balloon driver silently reduced all 3 worker VMs from 12GB to 4GB over several hours.
The control plane (VMID 107) was not affected because it has fewer pods and less memory pressure — its balloon had not yet triggered.
# Check current balloon config on all VMs (run on Proxmox host)
pvesh get /nodes/andy/qemu --output-format=text | grep -E "name|mem |balloon"
# From K8s side — check what kubelet thinks is allocatable
kubectl describe node k8s-node1 | grep -A5 "Allocatable:"
# Compare to what the VM actually has
# Expected: ~11Gi allocatable on a 12GB VM (minus kernel/system)
# Actual during incident: ~3.5Gi (balloon shrank to 4GB)
# Disable ballooning on all K8s worker VMs (live — no VM restart required)
# Run on Proxmox host (192.168.88.100)
qm set 103 --balloon 0 # k8s-node2
qm set 104 --balloon 0 # k8s-node3
qm set 108 --balloon 0 # k8s-node1
# Restart kubelet on each worker to refresh allocatable resources
ssh [email protected] "qm guest exec 103 -- systemctl restart kubelet"
ssh [email protected] "qm guest exec 104 -- systemctl restart kubelet"
ssh [email protected] "qm guest exec 108 -- systemctl restart kubelet"
# Verify allocatable memory is now correct
kubectl describe node k8s-node1 | grep -A5 "Allocatable:"
# Should show ~11Gi
| VMID | Name | RAM | Balloon | Notes |
|---|---|---|---|---|
| 103 | k8s-node2 | 12GB | 0 (off) | Fixed |
| 104 | k8s-node3 | 12GB | 0 (off) | Fixed |
| 108 | k8s-node1 | 12GB | 0 (off) | Fixed |
| 107 | k8s-controlplane | 8GB | 2048 | Kept — fewer critical pods |
| 109 | TrueNAS-Scale | 16GB | 2048 | Kept — not K8s workload |
pvesh get on Proxmox host and kubectl describe node for allocatable vs configured memory mismatch.Date: 2026-02 | Severity: Medium | Duration: Until OTEL collector restored
LifeOps backend entered a crash loop (CrashLoopBackOff) due to a cascade: OTEL collector went down → backend goroutines blocked waiting for OTEL connection → /api/health responses slowed → liveness probe timeout → kubelet restart → repeat.
The LifeOps backend initialises an OTEL gRPC exporter at startup. When the OTEL collector endpoint is unreachable, the gRPC client enters a reconnection backoff loop. During this backoff, any goroutine that tries to export a span calls into the gRPC layer and blocks on a channel send waiting for the connection to become available.
In Go, if the OTEL span export call is synchronous (not fire-and-forget), the HTTP request handler goroutine blocks until the OTEL call either succeeds or times out. If there is no explicit timeout, it blocks indefinitely — or until the OTEL client eventually gives up (which may take 30+ seconds).
The liveness probe calls /api/health with a 1-second timeout. If the handler goroutine is blocked on OTEL, the probe times out and kubelet restarts the pod.
# Pod in crash loop
kubectl get pods -n life-ops
# lifeops-backend-xxx 0/1 CrashLoopBackOff 12 1h
# Logs show liveness probe failures
kubectl logs -n life-ops <pod> --previous
# context deadline exceeded (liveness probe timeout)
# OTEL collector is the root cause
kubectl get pods -n monitoring | grep otel
# otel-collector-xxx 0/1 ImagePullBackOff 0 2h
# Option 1: Fix the OTEL collector (preferred)
kubectl describe pod -n monitoring <otel-pod> # find the root cause
# Fix the image pull, OOM, config issue, etc.
# Option 2: Temporarily disable OTEL to stop the crash loop
kubectl set env deployment/lifeops-backend -n life-ops \
OTEL_EXPORTER_OTLP_ENDPOINT=""
# Once OTEL collector is healthy again, re-enable
kubectl set env deployment/lifeops-backend -n life-ops \
OTEL_EXPORTER_OTLP_ENDPOINT="http://otel-collector.monitoring:4317"
WithTimeout or use context cancellation in OTEL SDK config--previous pod logs AND look for unhealthy pods in other namespaces that the crashing pod depends onDate: 2026-03-06 → 2026-03-07 | Severity: High | Duration: ~12 h detection blindspot + silent Telegram failure since deployment
All 4 CrowdSec agents and the AppSec pod entered CrashLoopBackOff with "machine not found" errors after the LAPI pod was replaced. This was compounded by two pre-existing silent bugs discovered during investigation: Telegram notifications were never being sent due to a wrong template field name, and HTTP detection was completely blind due to a Traefik traffic policy misconfiguration.
The LAPI's SQLite database was stored in an emptyDir volume. emptyDir persists across container restarts within the same pod, but is destroyed when the pod itself is replaced.
When the LAPI pod was replaced (due to node reschedule after Longhorn RWO multi-attach error):
wait-for-lapi-and-register init containers do not re-run on container restarts — only on pod replacementCrashLoopBackOffThe agents were stuck until their pods were explicitly deleted (triggering new pods with fresh init containers).
Original Longhorn RWO issue: Before the emptyDir phase, the LAPI used a Longhorn RWO PVC. When the pod rescheduled to a different node, Longhorn's Multi-Attach error caused the new pod to start without the PVC (emptyDir fallback), corrupting the SQLite WAL.
The HTTP notifier template in values.yaml used {{.Value}} and {{.Duration}}:
# Wrong — causes silent template error
IP: {{.Value}}\nDuration: {{.Duration}}
# Correct — models.Alert fields
IP: {{.Source.Value}}\nDuration: {{(index .Decisions 0).Duration}}
The models.Alert type does not have .Value or .Duration at the top level. On every alert, the LAPI logged:
level=error msg="format alerts for notification: template: :1:69:
executing "" at <.Value>: can't evaluate field Value in type *models.Alert"
No Telegram messages were ever sent. This had been broken since the initial deployment.
externalTrafficPolicy: Cluster (Silent Since Deployment)With externalTrafficPolicy: Cluster, kube-proxy routes external traffic through any node and applies SNAT — rewriting the client source IP to the pod-network gateway (10.244.x.x). Traefik logs this internal IP as ClientHost.
The crowdsecurity/whitelist-good-actors parser whitelists 10.0.0.0/8. Every Traefik log line was silently whitelisted — 0 events ever reached any HTTP detection scenario.
Verified: 1130+ Traefik log lines processed, 1130 whitelisted, 0 poured to any bucket.
| Component | Status During Incident |
|---|---|
| Traefik bouncer (IP ban check) | ✅ Working (stream cache from last sync) |
| AppSec WAF (per-request block) | ✅ Working (blocks still applied) |
| HTTP scenario detection | ❌ Dead since deployment (all IPs whitelisted) |
| SSH brute force detection | ✅ Working (not affected by LAPI restart) |
| Telegram notifications | ❌ Dead since deployment (template bug) |
| Community blocklist (CAPI) | ✅ Working (pulled periodically) |
Step 1 — Immediate recovery (agent re-registration):
kubectl rollout restart ds/crowdsec-agent -n crowdsec
kubectl rollout restart deploy/crowdsec-appsec -n crowdsec
# All 6 pods Running, 0 restarts within ~2 minutes
Step 2 — Permanent fix (NFS PVCs):
# values.yaml — switched from Longhorn RWO to NFS RWX
lapi:
persistentVolume:
data:
enabled: true
storageClassName: nfs-synology
accessModes: [ReadWriteMany]
size: 1Gi
config:
enabled: true
storageClassName: nfs-synology
accessModes: [ReadWriteMany]
size: 100Mi
Verification: LAPI pod manually deleted → new pod came up → cscli machines list showed all agents still registered. No rollout restart needed.
Step 3 — Fix Telegram template:
# Before (broken):
format: '... IP: {{.Value}}\nDuration: {{.Duration}} ...'
# After (correct):
format: '... IP: {{.Source.Value}}\nDuration: {{(index .Decisions 0).Duration}} ...'
Step 4 — Fix Traefik source IP preservation:
# traefik values.yaml
service:
spec:
externalTrafficPolicy: Local # was Cluster
After fix: Traefik logs real client IPs. LAN traffic (192.168.88.x) is still RFC1918-whitelisted (correct). Internet attackers are now detected.
error level but there is no metric or alert. Consider monitoring LAPI error log rate.cscli metrics show acquisition to confirm lines are reaching detection buckets, not just being parsed and whitelisted.Date: 2026-03-15 | Severity: Medium | Duration: ~1h
backend and reminder-checker pods in life-ops stuck in ImagePullBackOfflifeops-backend, lifeops-frontend) had 0 artifactsThe Harbor retention policy for the applications project was configured with nDaysSinceLastPush: 7 (TTL-based, delete images older than 7 days). The last CI/CD run that built and pushed images was 2026-03-07 (8 days before). The daily midnight retention job on 2026-03-15 deleted all images since they were outside the 7-day window.
The Terraform module (modules/harbor/harbor.tf) had been updated to include a safety net Rule 2 (most_recently_pushed = 5 — always keep the 5 most recently pushed images regardless of age), but Terraform Cloud had not been re-applied since that change was written. Only Rule 1 existed in Harbor.
Additionally, the retention_days for applications in main.tf was already updated to 90 (correct), but again not applied.
Summary: two-layer protection existed in code but neither was in effect in Harbor.
applications: Rule 1 nDaysSinceLastPush: 90, Rule 2 latestPushedN: 5 ✓gha-apps: Rule 1 nDaysSinceLastPush: 7, Rule 2 latestPushedN: 5 ✓tooling-images: Rule 1 nDaysSinceLastPush: 90, Rule 2 latestPushedN: 5 ✓workflow_dispatch on AnhTran1610/LifeOps CI with component=all, skip_ci=true8d86bbaupdate-manifests job updated applications/lifeops/values.yaml in k8s-cluster-config1/1 Running)latestPushedN safety net.modules/harbor/harbor.tf are invisible to Harbor until Terraform Cloud actually runs. Verify retention policy rules in the Harbor UI after any Terraform module change.most_recently_pushed: 5 is the critical safety net — it ensures the currently-deployed image is never wiped regardless of push age.