observability: migrate VictoriaMetrics to operator CRDs + Consul SD #234

Merged
benvin merged 1 commits from benvin/vm-operator-migration into main 2026-07-05 22:17:32 +10:00
Owner

Why

The k8s au-syd1 VictoriaMetrics stack ran as two helm charts and only scraped in-cluster targets. The victoria-metrics-operator already runs in vm-system, so this moves the stack onto operator-managed CRDs. That unlocks VMServiceScrape/VMPodScrape (auto-converted from Prometheus ServiceMonitors, used by a follow-up PR) and adds Consul service discovery so the cluster scrapes the same puppet-prod targets as the puppet vmagent. Also shrinks vmstorage 3 → 2 (Ceph-backed, replicationFactor 2).

Changes

  • Add VMCluster main: vmstorage 2 replicas (cephrbd-fast-delete 200Gi, 180d retention, replicationFactor 2), vminsert/vmselect 2 replicas + HPA (2–10, 60% cpu).
  • Add VMAgent main: retains the kubernetes SD jobs (apiservers/nodes/cadvisor), selectAllByDefault for VMServiceScrape/VMPodScrape, and a Consul SD job against consul.service.consul (resolves to the puppet Consul from pods) replicating the puppet vmagent relabels — keep tag metrics, __scheme__ from metrics_scheme, job from metrics_job. TLS is verified against the reflected vault-ca-cert (no insecure skip-verify).
  • Expose vmselect/vminsert/vmagent via Gateway API (traefik-internal Gateway + HTTPRoute, http→https redirect), same hostnames.
  • Remove the two helm charts, their values files, and vendored charts.

Notes

  • Data wipe on cutover is acceptable (confirmed) — old helm PVCs can be deleted.
  • Verify at rollout: pods resolve *.main.unkin.net node FQDNs (needed for CA SAN match on scrape targets); /targets shows job=consul.
## Why The k8s au-syd1 VictoriaMetrics stack ran as two helm charts and only scraped in-cluster targets. The victoria-metrics-operator already runs in vm-system, so this moves the stack onto operator-managed CRDs. That unlocks VMServiceScrape/VMPodScrape (auto-converted from Prometheus ServiceMonitors, used by a follow-up PR) and adds Consul service discovery so the cluster scrapes the **same puppet-prod targets** as the puppet vmagent. Also shrinks vmstorage 3 → 2 (Ceph-backed, replicationFactor 2). ## Changes - Add **VMCluster `main`**: vmstorage 2 replicas (cephrbd-fast-delete 200Gi, 180d retention, replicationFactor 2), vminsert/vmselect 2 replicas + HPA (2–10, 60% cpu). - Add **VMAgent `main`**: retains the kubernetes SD jobs (apiservers/nodes/cadvisor), `selectAllByDefault` for VMServiceScrape/VMPodScrape, and a **Consul SD job** against `consul.service.consul` (resolves to the puppet Consul from pods) replicating the puppet vmagent relabels — keep tag `metrics`, `__scheme__` from `metrics_scheme`, `job` from `metrics_job`. TLS is **verified against the reflected `vault-ca-cert`** (no insecure skip-verify). - Expose vmselect/vminsert/vmagent via **Gateway API** (traefik-internal Gateway + HTTPRoute, http→https redirect), same hostnames. - Remove the two helm charts, their values files, and vendored charts. ## Notes - Data wipe on cutover is acceptable (confirmed) — old helm PVCs can be deleted. - Verify at rollout: pods resolve `*.main.unkin.net` node FQDNs (needed for CA SAN match on scrape targets); `/targets` shows `job=consul`.
unkinben added 1 commit 2026-07-05 22:12:21 +10:00
observability: migrate VictoriaMetrics from helm charts to operator CRDs
ci/woodpecker/pr/pre-commit Pipeline was successful
ci/woodpecker/pr/kubeconform Pipeline was successful
41ab3ff614
The k8s au-syd1 VictoriaMetrics stack ran as two helm charts
(victoria-metrics-cluster + victoria-metrics-agent) and only scraped
in-cluster targets. The victoria-metrics-operator already runs in
vm-system, so move the stack onto operator-managed CRDs. This lets the
VMAgent consume VMServiceScrape/VMPodScrape (auto-converted from
Prometheus ServiceMonitors) and adds Consul service discovery so the
cluster scrapes the same puppet-prod targets as the puppet vmagent.

Changes:
- Add VMCluster `main`: vmstorage 2 replicas (down from 3, replicationFactor
  2, cephrbd-fast-delete 200Gi, 180d retention), vminsert/vmselect 2 replicas
  + HPA (2-10, 60% cpu).
- Add VMAgent `main`: keeps the kubernetes SD jobs (apiservers/nodes/cadvisor),
  selectAllByDefault for VMServiceScrape/VMPodScrape, and a Consul SD job
  against consul.service.consul (puppet Consul) replicating the puppet vmagent
  relabels (keep tag `metrics`, scheme from `metrics_scheme`, job from
  `metrics_job`). TLS verified against the reflected vault-ca-cert (no
  insecure skip-verify).
- Expose vmselect/vminsert/vmagent via Gateway API (traefik-internal Gateway +
  HTTPRoute, http->https redirect), same hostnames as before.
- Remove the two helm charts, their values files, and vendored charts.
benvin merged commit 5f8f2b1c70 into main 2026-07-05 22:17:32 +10:00
benvin deleted branch benvin/vm-operator-migration 2026-07-05 22:17:32 +10:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: unkin/argocd-apps#234