observability: migrate VictoriaMetrics to operator CRDs + Consul SD (#234)

## Why The k8s au-syd1 VictoriaMetrics stack ran as two helm charts and only scraped in-cluster targets. The victoria-metrics-operator already runs in vm-system, so this moves the stack onto operator-managed CRDs. That unlocks VMServiceScrape/VMPodScrape (auto-converted from Prometheus ServiceMonitors, used by a follow-up PR) and adds Consul service discovery so the cluster scrapes the **same puppet-prod targets** as the puppet vmagent. Also shrinks vmstorage 3 → 2 (Ceph-backed, replicationFactor 2). ## Changes - Add **VMCluster `main`**: vmstorage 2 replicas (cephrbd-fast-delete 200Gi, 180d retention, replicationFactor 2), vminsert/vmselect 2 replicas + HPA (2–10, 60% cpu). - Add **VMAgent `main`**: retains the kubernetes SD jobs (apiservers/nodes/cadvisor), `selectAllByDefault` for VMServiceScrape/VMPodScrape, and a **Consul SD job** against `consul.service.consul` (resolves to the puppet Consul from pods) replicating the puppet vmagent relabels — keep tag `metrics`, `__scheme__` from `metrics_scheme`, `job` from `metrics_job`. TLS is **verified against the reflected `vault-ca-cert`** (no insecure skip-verify). - Expose vmselect/vminsert/vmagent via **Gateway API** (traefik-internal Gateway + HTTPRoute, http→https redirect), same hostnames. - Remove the two helm charts, their values files, and vendored charts. ## Notes - Data wipe on cutover is acceptable (confirmed) — old helm PVCs can be deleted. - Verify at rollout: pods resolve `*.main.unkin.net` node FQDNs (needed for CA SAN match on scrape targets); `/targets` shows `job=consul`. Reviewed-on: #234 Co-authored-by: Ben Vincent <ben@unkin.net> Co-committed-by: Ben Vincent <ben@unkin.net>
2026-07-05 22:17:32 +10:00
parent 53b55419a7
commit 5f8f2b1c70
8 changed files with 523 additions and 301 deletions
@@ -0,0 +1,115 @@
+---
+apiVersion: operator.victoriametrics.com/v1beta1
+kind: VMCluster
+metadata:
+  name: main
+  namespace: observability
+spec:
+  retentionPeriod: "180d"
+  replicationFactor: 2
+  vmstorage:
+    replicaCount: 2
+    extraArgs:
+      dedup.minScrapeInterval: 15s
+      loggerFormat: json
+    storage:
+      volumeClaimTemplate:
+        spec:
+          storageClassName: cephrbd-fast-delete
+          accessModes:
+            - ReadWriteOnce
+          resources:
+            requests:
+              storage: 200Gi
+    resources:
+      requests:
+        cpu: "1"
+        memory: 2Gi
+      limits:
+        cpu: "2"
+        memory: 8Gi
+  vmselect:
+    replicaCount: 2
+    extraArgs:
+      dedup.minScrapeInterval: 15s
+      loggerFormat: json
+    resources:
+      requests:
+        cpu: 50m
+        memory: 128Mi
+      limits:
+        cpu: 500m
+        memory: 1024Mi
+    hpa:
+      minReplicas: 2
+      maxReplicas: 10
+      metrics:
+        - type: Resource
+          resource:
+            name: cpu
+            target:
+              type: Utilization
+              averageUtilization: 60
+      behavior:
+        scaleUp:
+          stabilizationWindowSeconds: 0
+          selectPolicy: Max
+          policies:
+            - type: Percent
+              value: 100
+              periodSeconds: 30
+            - type: Pods
+              value: 4
+              periodSeconds: 30
+        scaleDown:
+          stabilizationWindowSeconds: 300
+          selectPolicy: Min
+          policies:
+            - type: Percent
+              value: 10
+              periodSeconds: 60
+            - type: Pods
+              value: 2
+              periodSeconds: 60
+  vminsert:
+    replicaCount: 2
+    extraArgs:
+      loggerFormat: json
+    resources:
+      requests:
+        cpu: 50m
+        memory: 128Mi
+      limits:
+        cpu: 500m
+        memory: 1024Mi
+    hpa:
+      minReplicas: 2
+      maxReplicas: 10
+      metrics:
+        - type: Resource
+          resource:
+            name: cpu
+            target:
+              type: Utilization
+              averageUtilization: 60
+      behavior:
+        scaleUp:
+          stabilizationWindowSeconds: 0
+          selectPolicy: Max
+          policies:
+            - type: Percent
+              value: 100
+              periodSeconds: 30
+            - type: Pods
+              value: 4
+              periodSeconds: 30
+        scaleDown:
+          stabilizationWindowSeconds: 300
+          selectPolicy: Min
+          policies:
+            - type: Percent
+              value: 10
+              periodSeconds: 60
+            - type: Pods
+              value: 2
+              periodSeconds: 60