observability: migrate VictoriaMetrics to operator CRDs + Consul SD (#234)

## Why
The k8s au-syd1 VictoriaMetrics stack ran as two helm charts and only scraped in-cluster targets. The victoria-metrics-operator already runs in vm-system, so this moves the stack onto operator-managed CRDs. That unlocks VMServiceScrape/VMPodScrape (auto-converted from Prometheus ServiceMonitors, used by a follow-up PR) and adds Consul service discovery so the cluster scrapes the **same puppet-prod targets** as the puppet vmagent. Also shrinks vmstorage 3 → 2 (Ceph-backed, replicationFactor 2).

## Changes
- Add **VMCluster `main`**: vmstorage 2 replicas (cephrbd-fast-delete 200Gi, 180d retention, replicationFactor 2), vminsert/vmselect 2 replicas + HPA (2–10, 60% cpu).
- Add **VMAgent `main`**: retains the kubernetes SD jobs (apiservers/nodes/cadvisor), `selectAllByDefault` for VMServiceScrape/VMPodScrape, and a **Consul SD job** against `consul.service.consul` (resolves to the puppet Consul from pods) replicating the puppet vmagent relabels — keep tag `metrics`, `__scheme__` from `metrics_scheme`, `job` from `metrics_job`. TLS is **verified against the reflected `vault-ca-cert`** (no insecure skip-verify).
- Expose vmselect/vminsert/vmagent via **Gateway API** (traefik-internal Gateway + HTTPRoute, http→https redirect), same hostnames.
- Remove the two helm charts, their values files, and vendored charts.

## Notes
- Data wipe on cutover is acceptable (confirmed) — old helm PVCs can be deleted.
- Verify at rollout: pods resolve `*.main.unkin.net` node FQDNs (needed for CA SAN match on scrape targets); `/targets` shows `job=consul`.

Reviewed-on: #234
Co-authored-by: Ben Vincent <ben@unkin.net>
Co-committed-by: Ben Vincent <ben@unkin.net>
This commit was merged in pull request #234.
This commit is contained in:
2026-07-05 22:17:32 +10:00
committed by BenVincent
parent 53b55419a7
commit 5f8f2b1c70
8 changed files with 523 additions and 301 deletions
+115
View File
@@ -0,0 +1,115 @@
---
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMCluster
metadata:
name: main
namespace: observability
spec:
retentionPeriod: "180d"
replicationFactor: 2
vmstorage:
replicaCount: 2
extraArgs:
dedup.minScrapeInterval: 15s
loggerFormat: json
storage:
volumeClaimTemplate:
spec:
storageClassName: cephrbd-fast-delete
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Gi
resources:
requests:
cpu: "1"
memory: 2Gi
limits:
cpu: "2"
memory: 8Gi
vmselect:
replicaCount: 2
extraArgs:
dedup.minScrapeInterval: 15s
loggerFormat: json
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 500m
memory: 1024Mi
hpa:
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
behavior:
scaleUp:
stabilizationWindowSeconds: 0
selectPolicy: Max
policies:
- type: Percent
value: 100
periodSeconds: 30
- type: Pods
value: 4
periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 300
selectPolicy: Min
policies:
- type: Percent
value: 10
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
vminsert:
replicaCount: 2
extraArgs:
loggerFormat: json
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 500m
memory: 1024Mi
hpa:
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
behavior:
scaleUp:
stabilizationWindowSeconds: 0
selectPolicy: Max
policies:
- type: Percent
value: 100
periodSeconds: 30
- type: Pods
value: 4
periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 300
selectPolicy: Min
policies:
- type: Percent
value: 10
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60