## Summary
- Replace `Ingress` (nginx) with `Gateway` + `HTTPRoute` using `traefik-internal` GatewayClass
- TLS terminated at the Gateway listener; cert-manager provisions the certificate via `vault-issuer`
- external-dns annotations moved to the Gateway
## Test plan
- [ ] ArgoCD syncs the litellm app cleanly
- [ ] cert-manager issues the `litellm-tls` certificate
- [ ] external-dns creates the DNS record
- [ ] `https://litellm.k8s.syd1.au.unkin.net` is reachable
Reviewed-on: #134
## Changes
- Upgrade external-dns chart from 1.19.0 → 1.21.1 (app v0.19.0 → v0.21.0)
- Remove `gateway-tcproute` source — `TCPRoute` CRD is not installed, causing crash-loop
- Add `gateway-tlsroute` source — `TLSRoute` CRD is installed (Gateway API 1.5.1)
## Why
The pod was crash-looping every minute with `failed to list *v1alpha2.TCPRoute: the server could not find the requested resource` because the TCPRoute CRD doesn't exist in this cluster. TLSRoute was previously removed but its CRD does exist.
Reviewed-on: #143
Temporary: enable --log-level=debug to understand why the TLSRoute informer never reports HasSynced within the 1m interval. To be closed/reverted after root cause is found.
Reviewed-on: #140
## Problem
Gateway listeners with `port: 443` were rejected with `PortUnavailable: Cannot find entryPoint for Gateway: no matching entryPoint for port 443 and protocol "HTTPS"`.
Traefik matches Gateway listener ports against its internal entryPoint ports (pod-level), not the Service's `exposedPort`. The `websecure` entryPoint was configured on port `8443`, so port `443` listeners had no match.
## Fix
- `ports.websecure.port: 443` — Traefik now binds directly on 443
- `securityContext.capabilities.add: [NET_BIND_SERVICE]` — allows a non-root process to bind to privileged ports (<1024)
The Service `exposedPort` stays at `443`, so external connectivity is unchanged. All existing Gateway listeners (`port: 443`) are correct as-is.
Applies to both internal and external Traefik instances.
## Test plan
- [ ] Traefik pods restart cleanly
- [ ] `kubectl get gateway -A` shows listeners as `Programmed: True`
- [ ] `https://rancher.k8s.syd1.au.unkin.net` (already merged) is reachable
Reviewed-on: #138
## Summary
- Replace `Ingress` (nginx) with `Gateway` + `HTTPRoute` using `traefik-internal` GatewayClass
- TLS terminated at the Gateway listener; cert-manager provisions the certificate via `vault-issuer`
- external-dns annotations moved to the Gateway
## Test plan
- [ ] ArgoCD syncs the cattle-system app cleanly
- [ ] cert-manager issues the `rancher-tls` certificate
- [ ] external-dns creates the DNS record
- [ ] `https://rancher.k8s.syd1.au.unkin.net` is reachable
Reviewed-on: #132
## Problem
GatewayClasses were `Unknown` even after controllerName was fixed. The `kubernetesGateway` `labelSelector` applies to all watched resources, including GatewayClasses themselves. Since neither GatewayClass had a `traefik.io/instance` label, both Traefik instances filtered them out and never accepted them.
## Fix
- `gatewayclass-internal.yaml`: add `traefik.io/instance: internal`
- `gatewayclass-external.yaml`: add `traefik.io/instance: external`
## Test plan
- [ ] `kubectl get gatewayclass` shows both as `Accepted: True`
Reviewed-on: #137
## URGENT — Traefik pods are CrashLoopBackOff
The merged PR #135 added `--providers.kubernetesgateway.controllerName` as an `additionalArguments` entry. Traefik v3.7.0 does not support this flag and fails immediately on startup.
Old replica sets are still running (one pod each) but new pods cannot come up.
## Fix
- Remove `additionalArguments` from both `values-internal.yaml` and `values-external.yaml`
- Revert GatewayClass `controllerName` back to `traefik.io/gateway-controller` (the hardcoded Traefik default — no override mechanism exists in v3.7.0)
## After merge
GatewayClasses will remain `Unknown` until a separate solution for internal/external separation is implemented (the `labelSelector` approach needs further investigation).
Reviewed-on: #136
## Problem
Both GatewayClasses (`traefik-internal`, `traefik-external`) were stuck as `Unknown`. Neither Traefik deployment had `controllerName` set in `kubernetesGateway`, so both defaulted to `traefik.io/gateway-controller` — which matched neither GatewayClass.
## Fix
- `gatewayclass-internal.yaml`: `controllerName: traefik.io/gateway-controller-internal`
- `gatewayclass-external.yaml`: `controllerName: traefik.io/gateway-controller-external`
- `values-internal.yaml`: added `controllerName: traefik.io/gateway-controller-internal`
- `values-external.yaml`: added `controllerName: traefik.io/gateway-controller-external`
## Test plan
- [ ] ArgoCD syncs traefik-system cleanly
- [ ] `kubectl get gatewayclass` shows both as `Accepted: True`
Reviewed-on: #135
this adds a service account that can be used to run the terraform_vault
workflows with, so that we can access the jwt to generate a token
Reviewed-on: #127
Remove --providers.kubernetesgateway.controllername which does not exist in
Traefik v3, update GatewayClass controllerName to the standard v3 value, and
use labelSelector on each instance's kubernetesGateway provider to differentiate
internal vs external traffic.
Reviewed-on: #125
deploy traefik for internal and external applications. port forwarding
from the external routers will only occur to the IP of the
traefik-external service.
- traefik-internal and traefik-external added
- each is a different deployment
Reviewed-on: #119
Adds immutable patterns for yannh/kubeconform and kubernetes-sigs/kustomize
to fix 403 Forbidden errors when downloading their Linux amd64 releases.
Reviewed-on: #123
- Patch argocd-repo-server to mount vault-ca-cert and set SSL_CERT_DIR
so helm subprocesses trust the internal CA when pulling charts
- Add argocd Application pointing at clusters/au-syd1/bootstrap so
ArgoCD manages its own install going forward
Reviewed-on: #112
Patches argocd-tls-certs-cm with the Vault CA chain so ArgoCD can
verify TLS when pulling Helm charts from artifactapi.k8s.syd1.au.unkin.net.
Reviewed-on: #111
have seen some contention on woodpecker jobs, because they are not being
scheduled correctly. we need to set correct limits/requests so that they
can be accurately scheduled.
- set limits/requests for all workflows
Reviewed-on: #110
Mount the vault-ca-cert secret and set NODE_EXTRA_CA_CERTS so Node.js
trusts the internal CA chain when making outbound TLS connections.
Reviewed-on: #108
The privateHostnameGuard middleware blocks requests where the Host header
is not in the allowlist. Kubelet httpGet probes use the pod IP as the
Host header, which is never in the allowlist. Setting Host: localhost
ensures probes are always permitted.
Reviewed-on: #107
Adds base manifests and au-syd1 overlay for Paperclip (AI agent
orchestration platform), following the litellm deployment pattern.
Updates aitooling ApplicationSet to include the paperclip path.
Closes#99
Reviewed-on: #100
Deploys LiteLLM proxy with CNPG PostgreSQL (3-instance HA), PgBouncer
pooler, and Redis cache. Introduces a dedicated aitooling AppProject and
ApplicationSet to keep AI tooling services separate from platform infra.
Reviewed-on: #94
Split monolithic remotes.yaml into per-type-package files under
resources/conf.d/ to align with artifactapi v2.7.1 directory loading.
Updated schema: virtuals/locals use dedicated top-level keys, type field
removed. Added helm remotes for all kustomize helmCharts repos and
OCI patterns to docker remotes. CONFIG_PATH now points to the directory.
Reviewed-on: #92