Files
argocd-apps/apps/base/observability/vmcluster.yaml
T
unkinben 41ab3ff614
ci/woodpecker/pr/pre-commit Pipeline was successful
ci/woodpecker/pr/kubeconform Pipeline was successful
observability: migrate VictoriaMetrics from helm charts to operator CRDs
The k8s au-syd1 VictoriaMetrics stack ran as two helm charts
(victoria-metrics-cluster + victoria-metrics-agent) and only scraped
in-cluster targets. The victoria-metrics-operator already runs in
vm-system, so move the stack onto operator-managed CRDs. This lets the
VMAgent consume VMServiceScrape/VMPodScrape (auto-converted from
Prometheus ServiceMonitors) and adds Consul service discovery so the
cluster scrapes the same puppet-prod targets as the puppet vmagent.

Changes:
- Add VMCluster `main`: vmstorage 2 replicas (down from 3, replicationFactor
  2, cephrbd-fast-delete 200Gi, 180d retention), vminsert/vmselect 2 replicas
  + HPA (2-10, 60% cpu).
- Add VMAgent `main`: keeps the kubernetes SD jobs (apiservers/nodes/cadvisor),
  selectAllByDefault for VMServiceScrape/VMPodScrape, and a Consul SD job
  against consul.service.consul (puppet Consul) replicating the puppet vmagent
  relabels (keep tag `metrics`, scheme from `metrics_scheme`, job from
  `metrics_job`). TLS verified against the reflected vault-ca-cert (no
  insecure skip-verify).
- Expose vmselect/vminsert/vmagent via Gateway API (traefik-internal Gateway +
  HTTPRoute, http->https redirect), same hostnames as before.
- Remove the two helm charts, their values files, and vendored charts.
2026-07-05 22:12:04 +10:00

116 lines
2.6 KiB
YAML

---
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMCluster
metadata:
name: main
namespace: observability
spec:
retentionPeriod: "180d"
replicationFactor: 2
vmstorage:
replicaCount: 2
extraArgs:
dedup.minScrapeInterval: 15s
loggerFormat: json
storage:
volumeClaimTemplate:
spec:
storageClassName: cephrbd-fast-delete
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Gi
resources:
requests:
cpu: "1"
memory: 2Gi
limits:
cpu: "2"
memory: 8Gi
vmselect:
replicaCount: 2
extraArgs:
dedup.minScrapeInterval: 15s
loggerFormat: json
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 500m
memory: 1024Mi
hpa:
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
behavior:
scaleUp:
stabilizationWindowSeconds: 0
selectPolicy: Max
policies:
- type: Percent
value: 100
periodSeconds: 30
- type: Pods
value: 4
periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 300
selectPolicy: Min
policies:
- type: Percent
value: 10
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
vminsert:
replicaCount: 2
extraArgs:
loggerFormat: json
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 500m
memory: 1024Mi
hpa:
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
behavior:
scaleUp:
stabilizationWindowSeconds: 0
selectPolicy: Max
policies:
- type: Percent
value: 100
periodSeconds: 30
- type: Pods
value: 4
periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 300
selectPolicy: Min
policies:
- type: Percent
value: 10
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60