Commit Graph

136 Commits

Author SHA1 Message Date
unkinben 4f5c3f7ea0 feat(litellm): migrate Ingress to Gateway API (#134)
## Summary

- Replace `Ingress` (nginx) with `Gateway` + `HTTPRoute` using `traefik-internal` GatewayClass
- TLS terminated at the Gateway listener; cert-manager provisions the certificate via `vault-issuer`
- external-dns annotations moved to the Gateway

## Test plan

- [ ] ArgoCD syncs the litellm app cleanly
- [ ] cert-manager issues the `litellm-tls` certificate
- [ ] external-dns creates the DNS record
- [ ] `https://litellm.k8s.syd1.au.unkin.net` is reachable

Reviewed-on: #134
2026-05-23 01:29:54 +10:00
unkinben fd87cb96b5 feat(externaldns): upgrade to 1.21.1, fix sources for installed CRDs (#143)
## Changes

- Upgrade external-dns chart from 1.19.0 → 1.21.1 (app v0.19.0 → v0.21.0)
- Remove `gateway-tcproute` source — `TCPRoute` CRD is not installed, causing crash-loop
- Add `gateway-tlsroute` source — `TLSRoute` CRD is installed (Gateway API 1.5.1)

## Why

The pod was crash-looping every minute with `failed to list *v1alpha2.TCPRoute: the server could not find the requested resource` because the TCPRoute CRD doesn't exist in this cluster. TLSRoute was previously removed but its CRD does exist.

Reviewed-on: #143
2026-05-23 01:28:20 +10:00
unkinben d619f9195e benvin/externaldns_compatability (#142)
Reviewed-on: #142
2026-05-23 01:17:20 +10:00
unkinben 1944dbbfcd temp: enable debug logging on externaldns to diagnose TLSRoute sync timeout (#140)
Temporary: enable --log-level=debug to understand why the TLSRoute informer never reports HasSynced within the 1m interval. To be closed/reverted after root cause is found.
Reviewed-on: #140
2026-05-23 01:07:45 +10:00
unkinben 0940cc20f8 fix(traefik): listen on port 443 directly for Gateway API compatibility (#138)
## Problem

Gateway listeners with `port: 443` were rejected with `PortUnavailable: Cannot find entryPoint for Gateway: no matching entryPoint for port 443 and protocol "HTTPS"`.

Traefik matches Gateway listener ports against its internal entryPoint ports (pod-level), not the Service's `exposedPort`. The `websecure` entryPoint was configured on port `8443`, so port `443` listeners had no match.

## Fix

- `ports.websecure.port: 443` — Traefik now binds directly on 443
- `securityContext.capabilities.add: [NET_BIND_SERVICE]` — allows a non-root process to bind to privileged ports (<1024)

The Service `exposedPort` stays at `443`, so external connectivity is unchanged. All existing Gateway listeners (`port: 443`) are correct as-is.

Applies to both internal and external Traefik instances.

## Test plan

- [ ] Traefik pods restart cleanly
- [ ] `kubectl get gateway -A` shows listeners as `Programmed: True`
- [ ] `https://rancher.k8s.syd1.au.unkin.net` (already merged) is reachable

Reviewed-on: #138
2026-05-23 00:44:13 +10:00
unkinben 20ce2b1b92 feat(cattle-system): migrate rancher Ingress to Gateway API (#132)
## Summary

- Replace `Ingress` (nginx) with `Gateway` + `HTTPRoute` using `traefik-internal` GatewayClass
- TLS terminated at the Gateway listener; cert-manager provisions the certificate via `vault-issuer`
- external-dns annotations moved to the Gateway

## Test plan

- [ ] ArgoCD syncs the cattle-system app cleanly
- [ ] cert-manager issues the `rancher-tls` certificate
- [ ] external-dns creates the DNS record
- [ ] `https://rancher.k8s.syd1.au.unkin.net` is reachable

Reviewed-on: #132
2026-05-23 00:24:57 +10:00
unkinben 64dc5a0242 fix(traefik): add instance labels to GatewayClasses (#137)
## Problem

GatewayClasses were `Unknown` even after controllerName was fixed. The `kubernetesGateway` `labelSelector` applies to all watched resources, including GatewayClasses themselves. Since neither GatewayClass had a `traefik.io/instance` label, both Traefik instances filtered them out and never accepted them.

## Fix

- `gatewayclass-internal.yaml`: add `traefik.io/instance: internal`
- `gatewayclass-external.yaml`: add `traefik.io/instance: external`

## Test plan

- [ ] `kubectl get gatewayclass` shows both as `Accepted: True`

Reviewed-on: #137
2026-05-23 00:23:18 +10:00
unkinben 57c14d32c0 fix(traefik): remove invalid controllerName flag causing CrashLoopBackOff (#136)
## URGENT — Traefik pods are CrashLoopBackOff

The merged PR #135 added `--providers.kubernetesgateway.controllerName` as an `additionalArguments` entry. Traefik v3.7.0 does not support this flag and fails immediately on startup.

Old replica sets are still running (one pod each) but new pods cannot come up.

## Fix

- Remove `additionalArguments` from both `values-internal.yaml` and `values-external.yaml`
- Revert GatewayClass `controllerName` back to `traefik.io/gateway-controller` (the hardcoded Traefik default — no override mechanism exists in v3.7.0)

## After merge

GatewayClasses will remain `Unknown` until a separate solution for internal/external separation is implemented (the `labelSelector` approach needs further investigation).

Reviewed-on: #136
2026-05-22 23:58:56 +10:00
unkinben 2df359c4a9 fix(traefik): set controllerName on GatewayClasses and Traefik providers (#135)
## Problem

Both GatewayClasses (`traefik-internal`, `traefik-external`) were stuck as `Unknown`. Neither Traefik deployment had `controllerName` set in `kubernetesGateway`, so both defaulted to `traefik.io/gateway-controller` — which matched neither GatewayClass.

## Fix

- `gatewayclass-internal.yaml`: `controllerName: traefik.io/gateway-controller-internal`
- `gatewayclass-external.yaml`: `controllerName: traefik.io/gateway-controller-external`
- `values-internal.yaml`: added `controllerName: traefik.io/gateway-controller-internal`
- `values-external.yaml`: added `controllerName: traefik.io/gateway-controller-external`

## Test plan

- [ ] ArgoCD syncs traefik-system cleanly
- [ ] `kubectl get gatewayclass` shows both as `Accepted: True`

Reviewed-on: #135
2026-05-22 23:44:06 +10:00
unkinben f53a2dc4f8 fix: terraform_vault must be RFC1123 compliant (#128)
Reviewed-on: #128
2026-05-21 23:19:20 +10:00
unkinben c5dd3cc5cb feat: add terraform_vault role (#127)
this adds a service account that can be used to run the terraform_vault
workflows with, so that we can access the jwt to generate a token

Reviewed-on: #127
2026-05-21 23:13:48 +10:00
unkinben 462b2b3f4f feat(externaldns): add Gateway API sources for httproute, tlsroute, grpcroute, tcproute, udproute (#126)
Reviewed-on: #126
2026-05-18 00:11:33 +10:00
unkinben 73c9b3f603 fix(traefik): replace invalid controllername flag with labelSelector for v3 (#125)
Remove --providers.kubernetesgateway.controllername which does not exist in
Traefik v3, update GatewayClass controllerName to the standard v3 value, and
use labelSelector on each instance's kubernetesGateway provider to differentiate
internal vs external traffic.

Reviewed-on: #125
2026-05-18 00:03:12 +10:00
unkinben 9a01a9ef19 fix: enable gateway/ingress class on platform project (#124)
- add missing classes to platform required to deploy traefik system

Reviewed-on: #124
2026-05-17 23:56:12 +10:00
unkinben 53553ddcfd feat: deploy internal/external traefik routers (#119)
deploy traefik for internal and external applications. port forwarding
from the external routers will only occur to the IP of the
traefik-external service.

- traefik-internal and traefik-external added
- each is a different deployment

Reviewed-on: #119
2026-05-17 23:44:50 +10:00
unkinben 5d3ff3a0f4 feat(artifactapi): allow kubeconform and kustomize from GitHub (#123)
Adds immutable patterns for yannh/kubeconform and kubernetes-sigs/kustomize
to fix 403 Forbidden errors when downloading their Linux amd64 releases.

Reviewed-on: #123
2026-05-17 12:19:27 +10:00
unkinben c3002dc3c1 feat(artifactapi): allow kubecolor releases from GitHub (#122)
Reviewed-on: #122
2026-05-11 23:39:48 +10:00
unkinben 27db33536a feat(artifactapi): allow almalinux, debian, and fedora from Docker Hub (#121)
Reviewed-on: #121
2026-05-10 22:56:39 +10:00
unkinben 8a7068a1c4 feat(artifactapi): add argo-helm as a remote and virtual helm member (#120)
Reviewed-on: #120
2026-05-10 22:53:43 +10:00
unkinben 1cefd3b78e feat: change argocd crds source to artifactapi (#118)
- migrate argocd crds to come from the artifactapi service

Reviewed-on: #118
2026-05-10 21:12:44 +10:00
unkinben 842d774fc3 feat: deploy gatewayapi crds (#117)
- enable gateway api crds

Reviewed-on: #117
2026-05-10 21:05:56 +10:00
unkinben 4c8827ce35 feat: add traefik/gatewayapi (#116)
enable access to charts/containers/api-specs so that we can migrate from
nginx-ingress to gateway api and traefik

Reviewed-on: #116
2026-05-10 17:07:33 +10:00
unkinben 5e03215f4d chore: migrate reloader/reflector to virtual/helm (#115)
Reviewed-on: #115
2026-05-05 21:42:23 +10:00
unkinben 02ee82da1e feat: update vso to 1.3.0 (#114)
- updates the vso helm chart from 1.2.0 to 1.3.0

Reviewed-on: #114
2026-05-05 00:01:58 +10:00
unkinben 18c519f979 chore: remove hashicorp helm repo (#113)
- no longer required, this is in virtual/helm repo in artifactapi

Reviewed-on: #113
2026-05-03 23:51:44 +10:00
unkinben dd0e297c14 chore: mount vault CA for helm TLS trust and add ArgoCD self-management (#112)
- Patch argocd-repo-server to mount vault-ca-cert and set SSL_CERT_DIR
  so helm subprocesses trust the internal CA when pulling charts
- Add argocd Application pointing at clusters/au-syd1/bootstrap so
  ArgoCD manages its own install going forward

Reviewed-on: #112
2026-05-03 22:47:53 +10:00
unkinben 6fb98d66b0 chore: add vault CA cert to argocd-tls-certs-cm for helm TLS trust (#111)
Patches argocd-tls-certs-cm with the Vault CA chain so ArgoCD can
verify TLS when pulling Helm charts from artifactapi.k8s.syd1.au.unkin.net.

Reviewed-on: #111
2026-05-03 17:13:25 +10:00
unkinben bcea7df925 chore: swap vso to virtual helm repo (#109)
- testing if there will be any changes after merging, before merging all of them

Reviewed-on: #109
2026-05-03 16:49:53 +10:00
unkinben f45194282b chore: add resource requests/limits to workflows (#110)
have seen some contention on woodpecker jobs, because they are not being
scheduled correctly. we need to set correct limits/requests so that they
can be accurately scheduled.

- set limits/requests for all workflows

Reviewed-on: #110
2026-05-03 16:49:46 +10:00
unkinben 260b2d4364 chore: mount vault CA cert for Node.js TLS trust in paperclip (#108)
Mount the vault-ca-cert secret and set NODE_EXTRA_CA_CERTS so Node.js
trusts the internal CA chain when making outbound TLS connections.

Reviewed-on: #108
2026-05-03 00:10:08 +10:00
unkinben 156b545249 fix: set Host header on paperclip health probes to bypass hostname guard (#107)
The privateHostnameGuard middleware blocks requests where the Host header
is not in the allowlist. Kubelet httpGet probes use the pod IP as the
Host header, which is never in the allowlist. Setting Host: localhost
ensures probes are always permitted.

Reviewed-on: #107
2026-05-02 23:01:59 +10:00
unkinben 0883f327e9 chore: update trusted hostnames (#106)
- remove scheme from paperclip.k8s..
- add localhost (what probe is hitting)

Reviewed-on: #106
2026-05-02 22:40:21 +10:00
unkinben 04b7c04366 chore: fix livenessProbe for paperclip (#105)
Reviewed-on: #105
2026-05-02 22:28:52 +10:00
unkinben 9914186fd5 chore: additional papaerclip environemnt variables (#104)
https://github.com/paperclipai/paperclip/issues/3121
Reviewed-on: https://git.unkin.net/unkin/argocd-apps/pulls/104
2026-05-02 22:11:38 +10:00
unkinben f55b7065f1 fix: rename pgpooler to include rw (#103)
- undo previous change (target pgcluster name)
- actually rename the pgpooler

Reviewed-on: #103
2026-05-02 21:39:51 +10:00
unkinben 87a5a271c3 fix: set pgpooler name to include -rw (#102)
- this matches the credentials set for paperclip

Reviewed-on: #102
2026-05-02 21:35:23 +10:00
unkinben 8e7bc289f6 chore: enable access to paperclip namespace (#101)
Reviewed-on: #101
2026-05-02 21:30:59 +10:00
unkinben e156cd10bd feat: deploy paperclip to au-syd1 via ArgoCD (aitooling project) (#100)
Adds base manifests and au-syd1 overlay for Paperclip (AI agent
orchestration platform), following the litellm deployment pattern.
Updates aitooling ApplicationSet to include the paperclip path.

Closes #99

Reviewed-on: #100
2026-05-02 21:27:51 +10:00
unkinben fe714694bf chore: bump artifactapi to 2.7.2 (#98)
Reviewed-on: #98
2026-05-02 17:19:56 +10:00
unkinben 6138afb98b feat: add litellm-env configmap with STORE_MODEL_IN_DB=True (#97)
Reviewed-on: #97
2026-05-01 22:17:53 +10:00
unkinben 949ddb76e4 chore: litellm ooming (#95)
- update memory and cpu resources

Reviewed-on: #95
2026-05-01 21:54:00 +10:00
unkinben 5372914803 feat: add litellm to new aitooling ArgoCD project (#94)
Deploys LiteLLM proxy with CNPG PostgreSQL (3-instance HA), PgBouncer
pooler, and Redis cache. Introduces a dedicated aitooling AppProject and
ApplicationSet to keep AI tooling services separate from platform infra.

Reviewed-on: #94
2026-05-01 21:40:26 +10:00
unkinben 67bb54f092 fix: artifactapi remotes (#93)
- split each yaml into its own mount

Reviewed-on: #93
2026-05-01 21:17:16 +10:00
unkinben fc568dc8b5 feat: split artifactapi config into conf.d and update to v2.7.1 (#92)
Split monolithic remotes.yaml into per-type-package files under
resources/conf.d/ to align with artifactapi v2.7.1 directory loading.
Updated schema: virtuals/locals use dedicated top-level keys, type field
removed. Added helm remotes for all kustomize helmCharts repos and
OCI patterns to docker remotes. CONFIG_PATH now points to the directory.

Reviewed-on: #92
2026-04-30 23:59:01 +10:00
unkinben 1c2c18697d feat: update artifactapi to 2.3.0 (#91)
- update to mutable/immutable ttl/patterns
- reoganised paths to correct patterns

Reviewed-on: #91
2026-04-27 13:16:02 +10:00
unkinben f2af65bc92 fix: update include patterns (#90)
- hadolint and nvim were wrong, updating

Reviewed-on: #90
2026-04-26 16:20:53 +10:00
unkinben fdca69d99a feat: update github remotes (#89)
- enable access to all tagged, master and main branches as tar/gzip
- enable access to additional tool releases

Reviewed-on: #89
2026-04-26 16:05:57 +10:00
unkinben f80be18220 benvin/dockerremotes (#88)
Reviewed-on: #88
2026-04-25 22:34:59 +10:00
unkinben 3a6d93bc3c feat: add woodpeckerci/plugin-docker-buildx to WOODPECKER_PLUGINS_PRIVILEGED (#87)
Plugin is no longer privileged by default in Woodpecker; explicitly list
both the standard and latest-insecure variants.

Reviewed-on: #87
2026-04-25 20:48:46 +10:00
unkinben 7535d655fe feat: add docker remotes to artifactapi (#86)
- set artifactapi to specific version
- add dockerhub and ghcr to remotes

Reviewed-on: #86
2026-04-25 17:40:35 +10:00