5f8f2b1c70
## Why The k8s au-syd1 VictoriaMetrics stack ran as two helm charts and only scraped in-cluster targets. The victoria-metrics-operator already runs in vm-system, so this moves the stack onto operator-managed CRDs. That unlocks VMServiceScrape/VMPodScrape (auto-converted from Prometheus ServiceMonitors, used by a follow-up PR) and adds Consul service discovery so the cluster scrapes the **same puppet-prod targets** as the puppet vmagent. Also shrinks vmstorage 3 → 2 (Ceph-backed, replicationFactor 2). ## Changes - Add **VMCluster `main`**: vmstorage 2 replicas (cephrbd-fast-delete 200Gi, 180d retention, replicationFactor 2), vminsert/vmselect 2 replicas + HPA (2–10, 60% cpu). - Add **VMAgent `main`**: retains the kubernetes SD jobs (apiservers/nodes/cadvisor), `selectAllByDefault` for VMServiceScrape/VMPodScrape, and a **Consul SD job** against `consul.service.consul` (resolves to the puppet Consul from pods) replicating the puppet vmagent relabels — keep tag `metrics`, `__scheme__` from `metrics_scheme`, `job` from `metrics_job`. TLS is **verified against the reflected `vault-ca-cert`** (no insecure skip-verify). - Expose vmselect/vminsert/vmagent via **Gateway API** (traefik-internal Gateway + HTTPRoute, http→https redirect), same hostnames. - Remove the two helm charts, their values files, and vendored charts. ## Notes - Data wipe on cutover is acceptable (confirmed) — old helm PVCs can be deleted. - Verify at rollout: pods resolve `*.main.unkin.net` node FQDNs (needed for CA SAN match on scrape targets); `/targets` shows `job=consul`. Reviewed-on: #234 Co-authored-by: Ben Vincent <ben@unkin.net> Co-committed-by: Ben Vincent <ben@unkin.net>
116 lines
2.6 KiB
YAML
116 lines
2.6 KiB
YAML
---
|
|
apiVersion: operator.victoriametrics.com/v1beta1
|
|
kind: VMCluster
|
|
metadata:
|
|
name: main
|
|
namespace: observability
|
|
spec:
|
|
retentionPeriod: "180d"
|
|
replicationFactor: 2
|
|
vmstorage:
|
|
replicaCount: 2
|
|
extraArgs:
|
|
dedup.minScrapeInterval: 15s
|
|
loggerFormat: json
|
|
storage:
|
|
volumeClaimTemplate:
|
|
spec:
|
|
storageClassName: cephrbd-fast-delete
|
|
accessModes:
|
|
- ReadWriteOnce
|
|
resources:
|
|
requests:
|
|
storage: 200Gi
|
|
resources:
|
|
requests:
|
|
cpu: "1"
|
|
memory: 2Gi
|
|
limits:
|
|
cpu: "2"
|
|
memory: 8Gi
|
|
vmselect:
|
|
replicaCount: 2
|
|
extraArgs:
|
|
dedup.minScrapeInterval: 15s
|
|
loggerFormat: json
|
|
resources:
|
|
requests:
|
|
cpu: 50m
|
|
memory: 128Mi
|
|
limits:
|
|
cpu: 500m
|
|
memory: 1024Mi
|
|
hpa:
|
|
minReplicas: 2
|
|
maxReplicas: 10
|
|
metrics:
|
|
- type: Resource
|
|
resource:
|
|
name: cpu
|
|
target:
|
|
type: Utilization
|
|
averageUtilization: 60
|
|
behavior:
|
|
scaleUp:
|
|
stabilizationWindowSeconds: 0
|
|
selectPolicy: Max
|
|
policies:
|
|
- type: Percent
|
|
value: 100
|
|
periodSeconds: 30
|
|
- type: Pods
|
|
value: 4
|
|
periodSeconds: 30
|
|
scaleDown:
|
|
stabilizationWindowSeconds: 300
|
|
selectPolicy: Min
|
|
policies:
|
|
- type: Percent
|
|
value: 10
|
|
periodSeconds: 60
|
|
- type: Pods
|
|
value: 2
|
|
periodSeconds: 60
|
|
vminsert:
|
|
replicaCount: 2
|
|
extraArgs:
|
|
loggerFormat: json
|
|
resources:
|
|
requests:
|
|
cpu: 50m
|
|
memory: 128Mi
|
|
limits:
|
|
cpu: 500m
|
|
memory: 1024Mi
|
|
hpa:
|
|
minReplicas: 2
|
|
maxReplicas: 10
|
|
metrics:
|
|
- type: Resource
|
|
resource:
|
|
name: cpu
|
|
target:
|
|
type: Utilization
|
|
averageUtilization: 60
|
|
behavior:
|
|
scaleUp:
|
|
stabilizationWindowSeconds: 0
|
|
selectPolicy: Max
|
|
policies:
|
|
- type: Percent
|
|
value: 100
|
|
periodSeconds: 30
|
|
- type: Pods
|
|
value: 4
|
|
periodSeconds: 30
|
|
scaleDown:
|
|
stabilizationWindowSeconds: 300
|
|
selectPolicy: Min
|
|
policies:
|
|
- type: Percent
|
|
value: 10
|
|
periodSeconds: 60
|
|
- type: Pods
|
|
value: 2
|
|
periodSeconds: 60
|