Commit Graph

173 Commits

Author SHA1 Message Date
unkinben 0199a422a0 Merge branch 'main' into fix-kanidm-replication-config
ci/woodpecker/pr/pre-commit Pipeline was successful
ci/woodpecker/pr/kubeconform Pipeline was successful
2026-05-30 23:33:36 +10:00
unkinben f11ec1056d fix(kanidm): remove invalid automatic_refresh from replication config (#179)
Reviewed-on: #179
2026-05-30 23:20:48 +10:00
unkinben 1b4b22cad8 fix(kanidm): remove invalid automatic_refresh from replication config
ci/woodpecker/pr/pre-commit Pipeline was successful
ci/woodpecker/pr/kubeconform Pipeline was successful
2026-05-30 23:15:41 +10:00
unkinben ed7feaf19a Update apps/base/kanidm/vaultauth.yaml (#177)
Fix the VaultAuth object

Reviewed-on: #177
2026-05-30 23:11:38 +10:00
unkinben 4d594fbde7 feat(kanidm): vault-managed replication certs with auto-restart (#176)
- Store per-pod replication certs in Vault (kv/kubernetes/namespace/kanidm/default/repl-certs)
- VaultAuth + VaultStaticSecret sync certs to kanidm-repl-certs Secret
- busybox config-init init container injects peer certs from Secret into server.toml at startup
- Remove hardcoded partner_cert entries from per-pod server.toml templates
- Add automatic_refresh = true to all replication configs
- Add reloader.stakater.com/auto annotation to trigger rolling restart on ConfigMap/Secret changes
- Document domain UUID mismatch resolution and cert rotation in README

Reviewed-on: #176
2026-05-30 23:00:46 +10:00
unkinben 1b781e0885 feat(woodpecker): set workflow pod priority class to power (#175)
## Summary
Sets `WOODPECKER_BACKEND_K8S_PRIORITY_CLASS: power` on the Woodpecker agent so all CI pipeline pods are scheduled with the `power` PriorityClass (value 100, preemptionPolicy: Never).

This means pipeline pods can be evicted when the cluster is under pressure but won't preempt other workloads.

## Dependency
Requires the `power` PriorityClass to exist on the cluster — deploy PR #174 (priority-classes app) first.

## Test plan
- Trigger a pipeline run and confirm pods are created with `priorityClassName: power`
- `kubectl get pod -n woodpecker -o jsonpath='{.items[*].spec.priorityClassName}'`

Reviewed-on: #175
2026-05-26 23:58:57 +10:00
unkinben ede25a3858 feat(platform): add priority-classes app with low/power/medium/high classes (#174)
## Summary
- New `apps/base/priority-classes/` app with four `PriorityClass` objects managed via the `platform` ArgoCD project
- Adds `apps/overlays/*/priority-classes` to the platform ApplicationSet generator
- Adds `priority-classes` namespace to platform AppProject destinations (required even for cluster-scoped resources)

| Class | Value | PreemptionPolicy | Intent |
|---|---|---|---|
| `low` | 100 | Never | Background work; evictable, won't preempt others |
| `power` | 100 | Never | Compute-heavy but expendable (e.g. AI/ML workloads) |
| `medium` | 10000 | PreemptLowerPriority | Standard services |
| `high` | 100000 | PreemptLowerPriority | Critical services; preempts lower-priority pods |

`PriorityClass` is already in the platform project's `clusterResourceWhitelist` so no project policy changes were needed.

## Test plan
- ArgoCD syncs `platform-priority-classes` successfully
- `kubectl get priorityclasses low power medium high` shows all four classes

Reviewed-on: #174
2026-05-26 23:41:54 +10:00
unkinben f5f713fe86 feat(artifactapi): add open-webui/open-webui to ghcr immutable patterns (#173)
Part of #155 (prerequisite for open-webui deployment PR #172).

## Summary
- Adds `^open-webui/open-webui` to the `ghcr` remote's `immutable_patterns` in `remote-docker.yaml` so version-pinned open-webui image pulls are cached indefinitely through artifactapi

## Test plan
- artifactapi serves `ghcr.io/open-webui/open-webui:<version>` with `X-Artifact-Source: cache` on second fetch

Reviewed-on: #173
2026-05-26 23:28:27 +10:00
unkinben 3990fbfe06 feat(vault): switch to Kubernetes service registration (#171)
Replaces Consul service registration with the native Kubernetes provider so Vault labels its own pods with active/standby/perf-standby status without requiring a Consul dependency.

## Changes
- `values.yaml`: swap `service_registration "consul"` for `service_registration "kubernetes" {}`, add `VAULT_K8S_NAMESPACE` and `VAULT_K8S_POD_NAME` env vars via downward API
- `role_k8s-service-registration.yaml`: Role + RoleBinding granting the `vault` service account `get`/`update`/`patch` on pods
- `kustomization.yaml`: include new RBAC file

Reviewed-on: #171
2026-05-26 00:06:56 +10:00
unkinben d358098fff chore: update replication certs (#170)
- add replication certs for kanidm-0, kanidm-1 and kanidm-2

Reviewed-on: #170
2026-05-25 23:52:06 +10:00
unkinben 201e601737 feat: update kanidm replicaiton (#169)
- split to per-server configs
- remove init containers that attempted to automate the replication config
- add README.md

Reviewed-on: #169
2026-05-25 23:25:48 +10:00
unkinben d230d87ec9 feat(artifactapi): add conftest to GitHub generic remote cache (#168)
## Summary

- Adds `open-policy-agent/conftest/.*/conftest_.*_Linux_x86_64.tar.gz$` to the `github` remote immutable patterns in artifactapi

## Why

conftest v0.68.2 (https://github.com/open-policy-agent/conftest/releases/tag/v0.68.2) is now used for OPA policy checks in CI (see #167). Caching the release tarball in artifactapi reduces external dependency on GitHub during builds.

Reviewed-on: #168
2026-05-25 22:44:57 +10:00
unkinben 6497dab25e fix(puppet): remove explicit clusterIP: null from puppetdb Service (#166)
## Summary

- Removes `clusterIP: null` from the `puppetdb` Service spec

## Why

Setting `clusterIP: null` makes ArgoCD's desired state explicit about the field being null. Kubernetes assigns a real IP on creation and the field is immutable afterward. The null vs assigned-IP mismatch causes permanent OutOfSync on the puppetdb Service. Removing the field means ArgoCD no longer claims ownership of `clusterIP`, so the API server's value is authoritative.

Reviewed-on: #166
2026-05-25 22:44:24 +10:00
unkinben f403c6b05d fix(kanidm): add explicit group/kind/weight to TLSRoute refs (#165)
## Summary

- Adds `group: gateway.networking.k8s.io` and `kind: Gateway` to `parentRefs`
- Adds `group: ""`, `kind: Service`, and `weight: 1` to `backendRefs`

## Why

The Gateway API controller defaults these fields when creating/updating TLSRoute objects, so the live state always has them. ArgoCD diffs desired vs live by string comparison, causing the `kanidm` TLSRoute to show permanent OutOfSync. Same root cause as #162 (HTTPRoutes).

Reviewed-on: #165
2026-05-25 22:43:52 +10:00
unkinben ac8b8212bd fix(consul): normalize cpu limit to canonical string form (#164)
## Summary

- Changes `server.resources.limits.cpu` from `1000m` to `"1"` in consul Helm values

## Why

`1000m` (1000 milliCPU) is equivalent to `1` CPU, but Kubernetes normalizes the value to `"1"` when storing. ArgoCD diffs desired vs live by string comparison, so the mismatch causes a permanent OutOfSync on the `consul-server` StatefulSet. Same root cause as #163.

Reviewed-on: #164
2026-05-25 22:43:35 +10:00
unkinben dd282f59fb fix(litellm): normalize postgres cluster resource values (#163)
## Summary

- Changes `limits.memory` from `1024Mi` to `1Gi` (same value, canonical form)
- Changes `limits.cpu` from `1` (integer) to `"1"` (string, canonical form)

## Why

Kubernetes normalizes resource quantities on write — `1024Mi` becomes `1Gi` and integer `1` becomes string `"1"`. ArgoCD diffs by string comparison, so these equivalent values cause a permanent OutOfSync on the `litellm-postgres` Cluster.

Reviewed-on: #163
2026-05-24 23:30:10 +10:00
unkinben 1890dd4bda fix(gateways): add explicit group/kind/weight to all HTTPRoute refs (#162)
## Summary

- Adds `group: gateway.networking.k8s.io` and `kind: Gateway` to all `parentRefs` entries
- Adds `group: ""`, `kind: Service`, and `weight: 1` to all `backendRefs` entries
- Affects 9 HTTPRoute files across artifactapi, cattle-system, consul, kanidm, litellm, paperclip, puppet, and vault

## Why

ArgoCD diffs the desired manifest against the live Kubernetes object. The Gateway API controller defaults these fields when creating/updating objects, so the live state always has them — causing persistent OutOfSync for every HTTPRoute. Same root cause as #153 (certificateRefs).

## Test plan

- [ ] All affected ArgoCD applications show Synced after merge

Reviewed-on: #162
2026-05-24 20:32:37 +10:00
unkinben 6815b66010 fix(kanidm): use dockerhub image instead of ghcr.io (#161)
## Summary

- Changes both `config-init` init container and `kanidm` container images from `ghcr.io/kanidm/server:1.10.3` to `kanidm/server:1.10.3`

## Why

`kanidm/server` is published on Docker Hub, not ghcr.io. RKE2 rewrites dockerhub pulls through the artifactapi mirror automatically.

## Test plan

- [ ] Pods roll successfully after ArgoCD sync
- [ ] Verify kanidm cluster replication still healthy

Reviewed-on: #161
2026-05-24 20:27:21 +10:00
unkinben 7cbec33588 fix(artifactapi): move kanidm to dockerhub remote (#160)
## Summary

- Removes `^kanidm/` from the `ghcr` remote immutable_patterns
- Adds `^kanidm/` to the `dockerhub` remote immutable_patterns

## Why

`kanidm/server` is published on Docker Hub, not ghcr.io. Pulling via the `ghcr` cache was failing with 403 on anonymous token fetch → 502 Bad Gateway.

## Test plan

- [ ] `docker pull artifactapi.k8s.syd1.au.unkin.net/dockerhub/kanidm/server:1.10.3` succeeds after artifactapi redeploys

Reviewed-on: #160
2026-05-24 20:24:33 +10:00
unkinben 3756208ccd benvin/kanidm (#159)
Reviewed-on: #159
2026-05-24 19:55:22 +10:00
unkinben 6ce92e8ead benvin/artifactapi-mail-images (#158)
Reviewed-on: #158
2026-05-24 14:44:38 +10:00
unkinben af79d86db6 feat(artifactapi): cache stalwart webadmin zip (#157)
## Summary

- Adds \`stalwartlabs/webadmin/releases/latest/download/webadmin.zip\` to \`mutable_patterns\` in the \`github\` generic remote so the stalwart webadmin UI can be fetched through artifactapi rather than directly from GitHub.

## Notes

- Uses \`mutable_patterns\` (not \`immutable\`) because \`releases/latest\` resolves to whichever release is current and changes over time.
- Access URL: \`https://artifactapi.k8s.syd1.au.unkin.net/generic/github/stalwartlabs/webadmin/releases/latest/download/webadmin.zip\`

Reviewed-on: #157
2026-05-24 12:55:16 +10:00
unkinben 5f4c9225bb feat(artifactapi): add mail stack images to docker registry cache (#156)
- ghcr: stalwartlabs/stalwart (Stalwart mail server)
- dockerhub: rspamd/rspamd (spam filter), tozd/postfix (MTA gateway)

Reviewed-on: #156
2026-05-24 12:42:27 +10:00
unkinben cbc2c1cb9f fix(gateways): add explicit group: "" to all certificateRefs entries (#153)
The Gateway API admission server defaults certificateRefs[].group to ""
when it is omitted. ArgoCD diffed the desired state (no group field) against
the live state (group: "") and flagged every gateway as out of sync.

Fix: explicitly set group: "" in all certificateRefs entries so the
rendered manifest matches the API server's canonical form exactly.

Affected: artifactapi, cattle-system, consul, litellm, paperclip,
puppet (puppetboard + puppetdb), vault.

Reviewed-on: #153
2026-05-23 23:47:24 +10:00
unkinben c6f9893804 fix(argocd): add vault and consul to platform project destinations (#152)
Vault and consul namespaces were missing from the platform AppProject
allowed destinations, causing ArgoCD sync failures with:
  destination server 'https://kubernetes.default.svc' and namespace
  'vault' do not match any of the allowed destinations in project 'platform'

Reviewed-on: #152
2026-05-23 23:27:24 +10:00
unkinben e43fb742ad feat(artifactapi): add kanidm to ghcr docker immutable patterns (#151)
Prerequisite for kanidm deployment (PR benvin/kanidm).

Reviewed-on: #151
2026-05-23 23:09:38 +10:00
unkinben 11ac2ae91e feat(consul): deploy HashiCorp Consul 1.22.7 via Helm chart (5-replica cluster) (#149)
## Summary

- Deploys HashiCorp Consul 1.22.7 using Helm chart 1.9.7 with 5 server replicas
- Configuration modelled on production consul: \`datacenter=au-syd1\`, \`connect=true\`, \`raft_multiplier=10\`, HTTP on 8500, GRPC on 8502, HTTPS disabled
- 5-replica server cluster with \`bootstrapExpect=5\`
- 10Gi cephrbd-fast-delete PVC per server pod
- Gateway API: HTTPS gateway + HTTPRoute (443→consul-consul-ui:80→8500) at \`consul.k8s.syd1.au.unkin.net\`
- PodDisruptionBudget patched from \`policy/v1beta1\` to \`policy/v1\` (k8s 1.25+ compatibility)
- ArgoCD platform ApplicationSet updated to include consul overlay path
- Clients disabled (server-only deployment)
- ConnectInject disabled (can be enabled later for service mesh)

## Requires

- PR #147 (artifactapi: add hashicorp/consul to docker immutable patterns) to be merged first

## Test plan

- [ ] Sandbox tested in \`sandbox-consul\`: all 5 server pods 1/1 Running, cluster formed
- [ ] After merge: ArgoCD syncs consul namespace
- [ ] Verify \`consul.k8s.syd1.au.unkin.net\` is accessible via Gateway

Reviewed-on: #149
2026-05-23 22:40:49 +10:00
unkinben d2be521878 feat(vault): deploy HashiCorp Vault 2.0.1 via Helm chart (5-replica HA raft) (#148)
## Summary

- Deploys HashiCorp Vault 2.0.1 using Helm chart 0.32.0 in HA raft mode (5 replicas)
- Configuration modelled on production vault: \`disable_mlock=true\`, headless-DNS retry_join for all 5 pods
- IPC_LOCK capability added via \`server.statefulSet.securityContext.container\`
- 10Gi cephrbd-fast-delete PVC per pod via \`dataStorage\`
- Gateway API: HTTPS gateway + HTTPRoute (443→vault service port 8200) at \`vault.k8s.syd1.au.unkin.net\`
- ArgoCD platform ApplicationSet updated to include vault overlay path
- Injector disabled (no agent sidecar injection needed)

## Requires

- PR #147 (artifactapi: add hashicorp/vault to docker immutable patterns) to be merged first

## Test plan

- [ ] Sandbox tested in \`sandbox-vault\`: all 5 pods Running, raft cluster forming
- [ ] After merge: ArgoCD syncs vault namespace
- [ ] Operator runs \`vault operator init\` to initialize, then unseals all 5 nodes
- [ ] Verify \`vault.k8s.syd1.au.unkin.net\` is accessible via Gateway

Reviewed-on: #148
2026-05-23 22:39:41 +10:00
unkinben bcd4c1a722 feat(cert-manager): upgrade to v1.20.2 and enable Gateway API support (#150)
## Summary

- Upgrades cert-manager from v1.19.2 to v1.20.2
- Enables `enableGatewayAPI: true` via the `ControllerConfiguration` config block

## Why

cert-manager's Gateway API integration was not enabled. Without it, `cert-manager.io/*` annotations on Gateway resources are ignored and no certificates are issued. This is required for the consul and vault PRs (#148, #149) to have their TLS certs automatically provisioned from their Gateway annotations.

In v1.20.2, `ExperimentalGatewayAPISupport` is BETA and defaults to true — enabling `enableGatewayAPI` in the controller config activates the gateway-shim controller.

## Test plan

- [ ] cert-manager rolls out cleanly (v1.20.2 pods become Ready)
- [ ] After rollout, existing Gateway-annotated services (artifactapi, puppet, litellm) retain valid certs
- [ ] New Gateway resources with `cert-manager.io/cluster-issuer` annotations trigger Certificate creation

Reviewed-on: #150
2026-05-23 22:38:39 +10:00
unkinben 6d9530b1ee feat(artifactapi): add hashicorp/consul and hashicorp/vault to docker immutable patterns (#147)
## Summary

- Adds \`^hashicorp/consul\` and \`^hashicorp/vault\` to the dockerhub immutable_patterns in artifactapi's remote-docker.yaml
- Replaces the more specific \`^hashicorp/vault-secrets-operator\` pattern since \`^hashicorp/vault\` subsumes it
- Required for the benvin/vault and benvin/consul branches (vault:2.0.1 and consul:1.22.7)

## Test plan

- [ ] Verify artifactapi accepts requests for hashicorp/vault and hashicorp/consul images after merge

Reviewed-on: #147
2026-05-23 18:21:25 +10:00
unkinben dcea768c15 feat(woodpecker): upgrade to v3.14.1 (chart 3.6.3) (#146)
Reviewed-on: #146
2026-05-23 18:00:55 +10:00
unkinben e05f9bfd83 feat: increase litellm resources (#144)
finding litellm performance has dropped, crashed in multiple cases, and
then it had scaled to the maximum level using the majority of memory in
cluster.

- reduce the rate at which litellm autoscales
- increase the requests/limits to match usage

Reviewed-on: #144
2026-05-23 17:59:43 +10:00
unkinben 445d8b6e7e feat: add HTTP→HTTPS redirect to Gateway API services (#145)
Add port 80 HTTP listener and redirect HTTPRoute to artifactapi,
cattle-system (rancher), litellm, paperclip, and puppetboard — restoring
the redirect behaviour that existed on the previous nginx/traefik Ingress
resources.

Reviewed-on: #145
2026-05-23 17:34:07 +10:00
unkinben c2637da068 feat(artifactapi): migrate Ingress to Gateway API (#129)
## Summary

- Replace `Ingress` (nginx) with `Gateway` + `HTTPRoute` using `traefik-internal` GatewayClass
- TLS terminated at the Gateway listener; cert-manager provisions the certificate via `vault-issuer`
- external-dns annotations moved to the Gateway

## Notes

The original Ingress had nginx-specific annotations (`proxy-body-size: 10g`, `proxy-read-timeout: 600`) which are not portable to Gateway API. These can be re-introduced via a Traefik `Middleware` CRD if needed.

## Test plan

- [ ] ArgoCD syncs the app cleanly
- [ ] cert-manager issues the `artifactapi-tls` certificate
- [ ] external-dns creates the DNS record
- [ ] `https://artifactapi.k8s.syd1.au.unkin.net` is reachable

Reviewed-on: #129
2026-05-23 16:06:33 +10:00
unkinben 90ddd932fe feat(puppet): migrate puppetdb Ingress to Gateway API (#131)
## Summary

- Replace `Ingress` (nginx) with `Gateway` + `HTTPRoute` using `traefik-internal` GatewayClass
- TLS terminated at the Gateway listener; cert-manager provisions the certificate via `vault-issuer`
- external-dns annotations moved to the Gateway
- `ingress_puppetboard.yaml` is unchanged in this PR (separate PR)

## Test plan

- [ ] ArgoCD syncs the puppet app cleanly
- [ ] cert-manager issues the `puppetdb-tls` certificate
- [ ] external-dns creates the DNS record
- [ ] `https://puppetdb.k8s.syd1.au.unkin.net` is reachable

Reviewed-on: #131
2026-05-23 16:05:26 +10:00
unkinben 2c6d88aa6b feat(puppet): migrate puppetboard Ingress to Gateway API (#130)
## Summary

- Replace `Ingress` (nginx) with `Gateway` + `HTTPRoute` using `traefik-internal` GatewayClass
- TLS terminated at the Gateway listener; cert-manager provisions the certificate via `vault-issuer`
- external-dns annotations moved to the Gateway
- `ingress_puppetdb.yaml` is unchanged in this PR (separate PR)

## Test plan

- [ ] ArgoCD syncs the puppet app cleanly
- [ ] cert-manager issues the `puppetboard-tls` certificate
- [ ] external-dns creates the DNS record
- [ ] `https://puppetboard.k8s.syd1.au.unkin.net` is reachable

Reviewed-on: #130
2026-05-23 01:31:28 +10:00
unkinben 58368948d9 feat(paperclip): migrate Ingress to Gateway API (#133)
## Summary

- Replace `Ingress` (nginx) with `Gateway` + `HTTPRoute` using `traefik-internal` GatewayClass
- TLS terminated at the Gateway listener; cert-manager provisions the certificate via `vault-issuer`
- external-dns annotations moved to the Gateway

## Test plan

- [ ] ArgoCD syncs the paperclip app cleanly
- [ ] cert-manager issues the `paperclip-tls` certificate
- [ ] external-dns creates the DNS record
- [ ] `https://paperclip.k8s.syd1.au.unkin.net` is reachable

Reviewed-on: #133
2026-05-23 01:31:03 +10:00
unkinben 4f5c3f7ea0 feat(litellm): migrate Ingress to Gateway API (#134)
## Summary

- Replace `Ingress` (nginx) with `Gateway` + `HTTPRoute` using `traefik-internal` GatewayClass
- TLS terminated at the Gateway listener; cert-manager provisions the certificate via `vault-issuer`
- external-dns annotations moved to the Gateway

## Test plan

- [ ] ArgoCD syncs the litellm app cleanly
- [ ] cert-manager issues the `litellm-tls` certificate
- [ ] external-dns creates the DNS record
- [ ] `https://litellm.k8s.syd1.au.unkin.net` is reachable

Reviewed-on: #134
2026-05-23 01:29:54 +10:00
unkinben fd87cb96b5 feat(externaldns): upgrade to 1.21.1, fix sources for installed CRDs (#143)
## Changes

- Upgrade external-dns chart from 1.19.0 → 1.21.1 (app v0.19.0 → v0.21.0)
- Remove `gateway-tcproute` source — `TCPRoute` CRD is not installed, causing crash-loop
- Add `gateway-tlsroute` source — `TLSRoute` CRD is installed (Gateway API 1.5.1)

## Why

The pod was crash-looping every minute with `failed to list *v1alpha2.TCPRoute: the server could not find the requested resource` because the TCPRoute CRD doesn't exist in this cluster. TLSRoute was previously removed but its CRD does exist.

Reviewed-on: #143
2026-05-23 01:28:20 +10:00
unkinben d619f9195e benvin/externaldns_compatability (#142)
Reviewed-on: #142
2026-05-23 01:17:20 +10:00
unkinben 1944dbbfcd temp: enable debug logging on externaldns to diagnose TLSRoute sync timeout (#140)
Temporary: enable --log-level=debug to understand why the TLSRoute informer never reports HasSynced within the 1m interval. To be closed/reverted after root cause is found.
Reviewed-on: #140
2026-05-23 01:07:45 +10:00
unkinben 0940cc20f8 fix(traefik): listen on port 443 directly for Gateway API compatibility (#138)
## Problem

Gateway listeners with `port: 443` were rejected with `PortUnavailable: Cannot find entryPoint for Gateway: no matching entryPoint for port 443 and protocol "HTTPS"`.

Traefik matches Gateway listener ports against its internal entryPoint ports (pod-level), not the Service's `exposedPort`. The `websecure` entryPoint was configured on port `8443`, so port `443` listeners had no match.

## Fix

- `ports.websecure.port: 443` — Traefik now binds directly on 443
- `securityContext.capabilities.add: [NET_BIND_SERVICE]` — allows a non-root process to bind to privileged ports (<1024)

The Service `exposedPort` stays at `443`, so external connectivity is unchanged. All existing Gateway listeners (`port: 443`) are correct as-is.

Applies to both internal and external Traefik instances.

## Test plan

- [ ] Traefik pods restart cleanly
- [ ] `kubectl get gateway -A` shows listeners as `Programmed: True`
- [ ] `https://rancher.k8s.syd1.au.unkin.net` (already merged) is reachable

Reviewed-on: #138
2026-05-23 00:44:13 +10:00
unkinben 20ce2b1b92 feat(cattle-system): migrate rancher Ingress to Gateway API (#132)
## Summary

- Replace `Ingress` (nginx) with `Gateway` + `HTTPRoute` using `traefik-internal` GatewayClass
- TLS terminated at the Gateway listener; cert-manager provisions the certificate via `vault-issuer`
- external-dns annotations moved to the Gateway

## Test plan

- [ ] ArgoCD syncs the cattle-system app cleanly
- [ ] cert-manager issues the `rancher-tls` certificate
- [ ] external-dns creates the DNS record
- [ ] `https://rancher.k8s.syd1.au.unkin.net` is reachable

Reviewed-on: #132
2026-05-23 00:24:57 +10:00
unkinben 64dc5a0242 fix(traefik): add instance labels to GatewayClasses (#137)
## Problem

GatewayClasses were `Unknown` even after controllerName was fixed. The `kubernetesGateway` `labelSelector` applies to all watched resources, including GatewayClasses themselves. Since neither GatewayClass had a `traefik.io/instance` label, both Traefik instances filtered them out and never accepted them.

## Fix

- `gatewayclass-internal.yaml`: add `traefik.io/instance: internal`
- `gatewayclass-external.yaml`: add `traefik.io/instance: external`

## Test plan

- [ ] `kubectl get gatewayclass` shows both as `Accepted: True`

Reviewed-on: #137
2026-05-23 00:23:18 +10:00
unkinben 57c14d32c0 fix(traefik): remove invalid controllerName flag causing CrashLoopBackOff (#136)
## URGENT — Traefik pods are CrashLoopBackOff

The merged PR #135 added `--providers.kubernetesgateway.controllerName` as an `additionalArguments` entry. Traefik v3.7.0 does not support this flag and fails immediately on startup.

Old replica sets are still running (one pod each) but new pods cannot come up.

## Fix

- Remove `additionalArguments` from both `values-internal.yaml` and `values-external.yaml`
- Revert GatewayClass `controllerName` back to `traefik.io/gateway-controller` (the hardcoded Traefik default — no override mechanism exists in v3.7.0)

## After merge

GatewayClasses will remain `Unknown` until a separate solution for internal/external separation is implemented (the `labelSelector` approach needs further investigation).

Reviewed-on: #136
2026-05-22 23:58:56 +10:00
unkinben 2df359c4a9 fix(traefik): set controllerName on GatewayClasses and Traefik providers (#135)
## Problem

Both GatewayClasses (`traefik-internal`, `traefik-external`) were stuck as `Unknown`. Neither Traefik deployment had `controllerName` set in `kubernetesGateway`, so both defaulted to `traefik.io/gateway-controller` — which matched neither GatewayClass.

## Fix

- `gatewayclass-internal.yaml`: `controllerName: traefik.io/gateway-controller-internal`
- `gatewayclass-external.yaml`: `controllerName: traefik.io/gateway-controller-external`
- `values-internal.yaml`: added `controllerName: traefik.io/gateway-controller-internal`
- `values-external.yaml`: added `controllerName: traefik.io/gateway-controller-external`

## Test plan

- [ ] ArgoCD syncs traefik-system cleanly
- [ ] `kubectl get gatewayclass` shows both as `Accepted: True`

Reviewed-on: #135
2026-05-22 23:44:06 +10:00
unkinben f53a2dc4f8 fix: terraform_vault must be RFC1123 compliant (#128)
Reviewed-on: #128
2026-05-21 23:19:20 +10:00
unkinben c5dd3cc5cb feat: add terraform_vault role (#127)
this adds a service account that can be used to run the terraform_vault
workflows with, so that we can access the jwt to generate a token

Reviewed-on: #127
2026-05-21 23:13:48 +10:00
unkinben 462b2b3f4f feat(externaldns): add Gateway API sources for httproute, tlsroute, grpcroute, tcproute, udproute (#126)
Reviewed-on: #126
2026-05-18 00:11:33 +10:00
unkinben 73c9b3f603 fix(traefik): replace invalid controllername flag with labelSelector for v3 (#125)
Remove --providers.kubernetesgateway.controllername which does not exist in
Traefik v3, update GatewayClass controllerName to the standard v3 value, and
use labelSelector on each instance's kubernetesGateway provider to differentiate
internal vs external traffic.

Reviewed-on: #125
2026-05-18 00:03:12 +10:00