Three production services have HA configurations implemented as of 2026-03-14:
| Service | Primary | Standby | Strategy |
|---|---|---|---|
| Nextcloud | K8s (2 replicas) | — | Active-active, Redis locking |
| Vaultwarden | NAS Docker | K8s pod (nas-ingress) | Traefik weighted failover |
| Immich | NAS Docker | K8s pod (nas-ingress) | Traefik weighted failover |
Nextcloud runs as a 2-replica Deployment in the nextcloud namespace, backed by a shared NFS PVC (nfs-synology, ReadWriteMany).
With multiple replicas reading/writing the same NFS volume, distributed file locking is required. Redis provides:
memcache.locking — prevents split-brain on file operationsmemcache.distributed — shared metadata cache across replicasRedis is deployed as a subchart (redis.enabled: true) and configured in custom.config.php:
$CONFIG = array (
'overwriteprotocol' => 'https',
'memcache.distributed' => '\OC\Memcache\Redis',
'memcache.locking' => '\OC\Memcache\Redis',
'redis' => array(
'host' => 'nextcloud-redis-master',
'port' => 6379,
),
);
The Nextcloud Helm chart runs the container as uid/gid 1024 (Synology NFS root_squash requirement). Two known issues:
1. redis-session.ini permission denied — The Nextcloud entrypoint writes /usr/local/etc/php/conf.d/redis-session.ini when it detects REDIS_HOST. This path is container-internal (root-owned) and uid 1024 can't write to it.
Fix: An extraInitContainers entry copies the directory to an emptyDir and chowns it to 1024, then the main container mounts the emptyDir over the original path:
extraInitContainers:
- name: fix-php-conf-perms
image: nextcloud:32.0.5-apache
command:
- sh
- -c
- cp -a /usr/local/etc/php/conf.d/. /php-conf-d/ && chown -R 1024:1024 /php-conf-d
volumeMounts:
- name: php-conf-d
mountPath: /php-conf-d
extraVolumes:
- name: php-conf-d
emptyDir: {}
extraVolumeMounts:
- name: php-conf-d
mountPath: /usr/local/etc/php/conf.d
2. Duplicate redis.config.php volumeMount — The chart auto-generates redis.config.php when redis.enabled: true, which conflicts with the configs ConfigMap mount. Suppressed with:
nextcloud:
defaultConfigs:
redis.config.php: false
spec.failover was removed in Traefik v3. Nextcloud doesn't use failover (it's K8s-native), but the NAS ingress for Immich/Vaultwarden uses spec.weighted instead (see below).
Both Vaultwarden and Immich use the same pattern in the nas-ingress namespace:
Internet → Traefik → TraefikService (weighted)
├── NAS primary (weight 100)
└── K8s standby (weight 1)
How failover works: Traefik uses passive circuit breaker — it tracks error rates on each backend. When the NAS backend starts failing requests, Traefik temporarily removes it and sends 100% to the K8s standby. Failover is not instant (takes a handful of failed requests to trigger), but it's automatic.
Why weighted instead of failover? Traefik v3 removed spec.failover.
Why no healthCheck? Traefik v3 healthCheck inside a weighted TraefikService only works for ExternalName K8s service type. Our NAS services are ClusterIP with custom Endpoints (external IPs). Using healthCheck on them causes "healthCheck allowed only for ExternalName services" — which makes Traefik fail to build the entire TraefikService, returning 404 for all routes using it. The fix is to omit healthCheck entirely and rely on the passive circuit breaker.
Future improvement: Convert NAS Services to
type: ExternalNamepointing to a DNS hostname (e.g. addnas.homelab.vyanh.uk→192.168.88.19in Technitium, thenexternalName: nas.homelab.vyanh.uk). This would re-enable active health checks with instant failover.
Each service has two files:
core-components/nas-ingress/resources/
immich.yaml # Service+Endpoints (NAS IP), TraefikService, Certificate, IngressRoute
immich-standby.yaml # NFS PV/PVC, Secret, Deployment, Service (K8s standby)
vaultwarden.yaml # Service+Endpoints (NAS IP), TraefikService, Certificates, IngressRoutes
vaultwarden-standby.yaml # ConfigMap (litestream), Secrets, Deployment, Service (K8s standby)
192.168.88.19:8843db.sqlite3 WAL to MinIO continuously (60s sync interval, 72h retention):
db-backups/vaultwarden/litestream/ (separate from rclone daily snapshots at vaultwarden/YYYY-MM-DD.sqlite3)Deployment vaultwarden-k8s in nas-ingress:
litestream-restore): restores db.sqlite3 from MinIO on pod startup (-if-db-not-exists, -if-replica-exists flags ensure idempotent restore)vaultwarden/server:latest): identical config to NAS, including SMTP, rate limiting, domainemptyDir — the restored DB lives only in pod memory; Litestream restore init container populates it on every startVaultStaticSecret CRs in vaultwarden-vso.yaml. Vault paths:
kv/nas-ingress/vaultwarden-minio → K8s secret vaultwarden-minio (access_key, secret_key)kv/nas-ingress/vaultwarden-smtp → K8s secret vaultwarden-smtp (SMTP_PASSWORD)Incident 2026-03-15: Pod was stuck
Init:CrashLoopBackOffbecause ArgoCD was re-applying PLACEHOLDER secrets from Git on every sync, overwriting manually-set values. Root cause: secrets were defined inline invaultwarden-standby.yamlwith hardcoded PLACEHOLDERs. Fix: removed Secret definitions from YAML, replaced with VSOVaultStaticSecretresources invaultwarden-vso.yaml. Vault policyvso-nas-ingress-readand K8s auth rolevso-nas-ingresscreated manually; addednas-ingressto Terraform namespaces list for future drift prevention.
spec:
weighted:
services:
- name: vaultwarden # NAS (ClusterIP + Endpoints → 192.168.88.19:8843)
port: 8843
weight: 100
- name: vaultwarden-k8s # K8s Deployment Service
port: 80
weight: 1
# NOTE: healthCheck omitted — only works for ExternalName services in Traefik v3.
# ClusterIP+Endpoints causes TraefikService build failure → 404.
192.168.88.19:2283192.168.88.19:5434 (non-standard port, mapped from container 5432)192.168.88.19:6379Synology gotcha: Containers on
internal: trueDocker networks don't get host port-publish on Synology. Bothredisanddatabaseservices must also be on a non-internal network (e.g.app-net) for the port binding to activate at runtime.
Deployment immich-k8s in nas-ingress:
IMMICH_WORKERS_INCLUDE=api — only the API server, no background job processing (microservices stay on NAS)192.168.88.19:5434192.168.88.19:6379192.168.88.19:/volume1/photos_immichimmich-db with DB_PASSWORD managed by ArgoCD from git (private repo)
IMMICH_PORTgotcha: K8s injects service discovery env vars into all pods in a namespace. The env varIMMICH_K8S_PORTgets injected astcp://10.x.x.x:2283, and Immich reads theIMMICH_PORTenv var, gettingNaN. Fix: explicitly setIMMICH_PORT: "2283"in the Deployment env to override the injected value.
The photos_immich share is exported via NFS to the K8s subnet. Config is written to all three Synology NFS files for persistence:
/etc/exports/etc/exports_syno/etc/exports_map/volume1/photos_immich 192.168.88.0/24(rw,async,no_wdelay,insecure,all_squash,insecure_locks,sec=sys,anonuid=1024,anongid=100)
After editing, reload with: sudo exportfs -ra
spec:
weighted:
services:
- name: immich # NAS (ClusterIP + Endpoints → 192.168.88.19:2283)
port: 2283
weight: 100
- name: immich-k8s # K8s Deployment Service
port: 2283
weight: 1
# NOTE: healthCheck omitted — only works for ExternalName services in Traefik v3.
# ClusterIP+Endpoints causes TraefikService build failure → 404.
| Scenario | Nextcloud | Vaultwarden | Immich |
|---|---|---|---|
| NAS Docker crash (hardware up) | No impact | K8s standby activates after a few failed requests (passive circuit breaker) | Same |
| Full NAS hardware failure | No impact | K8s standby can't restore DB (MinIO on NAS) | K8s standby fails (NFS + PG down) |
| K8s cluster failure | Nextcloud down | NAS primary still serves | NAS primary still serves |
| Single K8s node failure | One replica continues | No impact | No impact |
Limitation: The warm standby for Immich and Vaultwarden only handles NAS Docker failures. If the NAS hardware is down, the K8s standby also loses access to its dependencies (MinIO, NFS, PostgreSQL). MinIO could be migrated to K8s in the future to improve this.