Remove the data keys from kanidm-repl-certs in git so ArgoCD never takes
SSA ownership of them. Add ignoreDifferences for /data on that ConfigMap
in the ApplicationSet template so ArgoCD doesn't flag sidecar-patched
cert values as out-of-sync.
Add a native sidecar (bitnami/kubectl, restartPolicy: Always) that runs
kanidmd renew-replication-certificate on each pod and patches the result
into the kanidm-repl-certs ConfigMap (certs are public keys, not secrets).
The config-init init container reads peer certs from the ConfigMap at
startup, building the replication stanza automatically — no manual cert
exchange required after first deploy.
Add RBAC (Role + RoleBinding) granting the kanidm service account
pods/exec and configmap patch permissions scoped to the kanidm namespace.
- Increase replicas from 2 to 3
- Add kanidm-2 headless DNS SAN to TLS certificate
- Add PodDisruptionBudget (maxUnavailable: 1) to maintain quorum during
node drains
- Add requiredDuringSchedulingIgnoredDuringExecution pod anti-affinity
on kubernetes.io/hostname to spread replicas across distinct hosts
- Update replication peers comment to include kanidm-2 cert exchange step
## Summary
- Deploys Kanidm 1.10.3 (ghcr.io/kanidm/server:1.10.3) as a 2-replica
StatefulSet with built-in replication enabled
- Domain: auth.unkin.net (primary WebAuthn origin); au.auth.unkin.net
is an additional hostname for this au-syd1 instance
- TLS: cert-manager Certificate (vault-issuer) covering auth.unkin.net,
au.auth.unkin.net, kanidm.k8s.syd1.au.unkin.net, and internal
headless pod DNS names — Kanidm terminates TLS itself (passthrough)
- Gateway: TLS passthrough on port 443 via TLSRoute; HTTP on port 80
redirects to HTTPS; external-dns creates kanidm.k8s.syd1.au.unkin.net
(not used in server.toml; canonical origin is auth.unkin.net only)
- Replication: init container generates per-pod server.toml with the
correct repl:// origin (kanidm-{N}.kanidm-headless.kanidm.svc);
populate kanidm-repl-peers ConfigMap post-deployment after running
`kanidmd show-replication-certificate` on each pod
- Storage: 10Gi cephrbd-fast-delete PVC per pod via volumeClaimTemplates
- Security: runAsUser/runAsGroup 1000, runAsNonRoot, no privilege
escalation, allowPrivilegeEscalation=false
- ArgoCD: platform ApplicationSet and Project updated for kanidm namespace
## Requires
- PR benvin/kanidm-artifactapi (add ^kanidm/ to ghcr immutable patterns)
to be merged first so artifactapi can cache ghcr.io/kanidm/server
## Post-deployment steps
1. Wait for both pods to reach Running state
2. Exchange replication certificates between pods:
kubectl exec -n kanidm kanidm-0 -- kanidmd show-replication-certificate
kubectl exec -n kanidm kanidm-1 -- kanidmd show-replication-certificate
3. Edit kanidm-repl-peers ConfigMap with both nodes' certs
(see template in configmap.yaml comments)
4. kubectl rollout restart statefulset/kanidm -n kanidm
## Test plan
- [x] Sandbox tested in sandbox-kanidm: all 11 resources server dry-run OK
- [ ] After merge: ArgoCD syncs kanidm namespace
- [ ] Verify auth.unkin.net and au.auth.unkin.net reachable via Gateway
- [ ] Verify kanidm.k8s.syd1.au.unkin.net DNS record created by external-dns
- [ ] Complete replication cert exchange and verify replication active
## Summary
- Adds \`stalwartlabs/webadmin/releases/latest/download/webadmin.zip\` to \`mutable_patterns\` in the \`github\` generic remote so the stalwart webadmin UI can be fetched through artifactapi rather than directly from GitHub.
## Notes
- Uses \`mutable_patterns\` (not \`immutable\`) because \`releases/latest\` resolves to whichever release is current and changes over time.
- Access URL: \`https://artifactapi.k8s.syd1.au.unkin.net/generic/github/stalwartlabs/webadmin/releases/latest/download/webadmin.zip\`
Reviewed-on: #157
The Gateway API admission server defaults certificateRefs[].group to ""
when it is omitted. ArgoCD diffed the desired state (no group field) against
the live state (group: "") and flagged every gateway as out of sync.
Fix: explicitly set group: "" in all certificateRefs entries so the
rendered manifest matches the API server's canonical form exactly.
Affected: artifactapi, cattle-system, consul, litellm, paperclip,
puppet (puppetboard + puppetdb), vault.
Reviewed-on: #153
## Summary
- Deploys HashiCorp Consul 1.22.7 using Helm chart 1.9.7 with 5 server replicas
- Configuration modelled on production consul: \`datacenter=au-syd1\`, \`connect=true\`, \`raft_multiplier=10\`, HTTP on 8500, GRPC on 8502, HTTPS disabled
- 5-replica server cluster with \`bootstrapExpect=5\`
- 10Gi cephrbd-fast-delete PVC per server pod
- Gateway API: HTTPS gateway + HTTPRoute (443→consul-consul-ui:80→8500) at \`consul.k8s.syd1.au.unkin.net\`
- PodDisruptionBudget patched from \`policy/v1beta1\` to \`policy/v1\` (k8s 1.25+ compatibility)
- ArgoCD platform ApplicationSet updated to include consul overlay path
- Clients disabled (server-only deployment)
- ConnectInject disabled (can be enabled later for service mesh)
## Requires
- PR #147 (artifactapi: add hashicorp/consul to docker immutable patterns) to be merged first
## Test plan
- [ ] Sandbox tested in \`sandbox-consul\`: all 5 server pods 1/1 Running, cluster formed
- [ ] After merge: ArgoCD syncs consul namespace
- [ ] Verify \`consul.k8s.syd1.au.unkin.net\` is accessible via Gateway
Reviewed-on: #149
## Summary
- Deploys HashiCorp Vault 2.0.1 using Helm chart 0.32.0 in HA raft mode (5 replicas)
- Configuration modelled on production vault: \`disable_mlock=true\`, headless-DNS retry_join for all 5 pods
- IPC_LOCK capability added via \`server.statefulSet.securityContext.container\`
- 10Gi cephrbd-fast-delete PVC per pod via \`dataStorage\`
- Gateway API: HTTPS gateway + HTTPRoute (443→vault service port 8200) at \`vault.k8s.syd1.au.unkin.net\`
- ArgoCD platform ApplicationSet updated to include vault overlay path
- Injector disabled (no agent sidecar injection needed)
## Requires
- PR #147 (artifactapi: add hashicorp/vault to docker immutable patterns) to be merged first
## Test plan
- [ ] Sandbox tested in \`sandbox-vault\`: all 5 pods Running, raft cluster forming
- [ ] After merge: ArgoCD syncs vault namespace
- [ ] Operator runs \`vault operator init\` to initialize, then unseals all 5 nodes
- [ ] Verify \`vault.k8s.syd1.au.unkin.net\` is accessible via Gateway
Reviewed-on: #148
## Summary
- Adds \`^hashicorp/consul\` and \`^hashicorp/vault\` to the dockerhub immutable_patterns in artifactapi's remote-docker.yaml
- Replaces the more specific \`^hashicorp/vault-secrets-operator\` pattern since \`^hashicorp/vault\` subsumes it
- Required for the benvin/vault and benvin/consul branches (vault:2.0.1 and consul:1.22.7)
## Test plan
- [ ] Verify artifactapi accepts requests for hashicorp/vault and hashicorp/consul images after merge
Reviewed-on: #147
finding litellm performance has dropped, crashed in multiple cases, and
then it had scaled to the maximum level using the majority of memory in
cluster.
- reduce the rate at which litellm autoscales
- increase the requests/limits to match usage
Reviewed-on: #144
Add port 80 HTTP listener and redirect HTTPRoute to artifactapi,
cattle-system (rancher), litellm, paperclip, and puppetboard — restoring
the redirect behaviour that existed on the previous nginx/traefik Ingress
resources.
Reviewed-on: #145
## Summary
- Replace `Ingress` (nginx) with `Gateway` + `HTTPRoute` using `traefik-internal` GatewayClass
- TLS terminated at the Gateway listener; cert-manager provisions the certificate via `vault-issuer`
- external-dns annotations moved to the Gateway
## Notes
The original Ingress had nginx-specific annotations (`proxy-body-size: 10g`, `proxy-read-timeout: 600`) which are not portable to Gateway API. These can be re-introduced via a Traefik `Middleware` CRD if needed.
## Test plan
- [ ] ArgoCD syncs the app cleanly
- [ ] cert-manager issues the `artifactapi-tls` certificate
- [ ] external-dns creates the DNS record
- [ ] `https://artifactapi.k8s.syd1.au.unkin.net` is reachable
Reviewed-on: #129
## Summary
- Replace `Ingress` (nginx) with `Gateway` + `HTTPRoute` using `traefik-internal` GatewayClass
- TLS terminated at the Gateway listener; cert-manager provisions the certificate via `vault-issuer`
- external-dns annotations moved to the Gateway
- `ingress_puppetboard.yaml` is unchanged in this PR (separate PR)
## Test plan
- [ ] ArgoCD syncs the puppet app cleanly
- [ ] cert-manager issues the `puppetdb-tls` certificate
- [ ] external-dns creates the DNS record
- [ ] `https://puppetdb.k8s.syd1.au.unkin.net` is reachable
Reviewed-on: #131
## Summary
- Replace `Ingress` (nginx) with `Gateway` + `HTTPRoute` using `traefik-internal` GatewayClass
- TLS terminated at the Gateway listener; cert-manager provisions the certificate via `vault-issuer`
- external-dns annotations moved to the Gateway
- `ingress_puppetdb.yaml` is unchanged in this PR (separate PR)
## Test plan
- [ ] ArgoCD syncs the puppet app cleanly
- [ ] cert-manager issues the `puppetboard-tls` certificate
- [ ] external-dns creates the DNS record
- [ ] `https://puppetboard.k8s.syd1.au.unkin.net` is reachable
Reviewed-on: #130
## Summary
- Replace `Ingress` (nginx) with `Gateway` + `HTTPRoute` using `traefik-internal` GatewayClass
- TLS terminated at the Gateway listener; cert-manager provisions the certificate via `vault-issuer`
- external-dns annotations moved to the Gateway
## Test plan
- [ ] ArgoCD syncs the paperclip app cleanly
- [ ] cert-manager issues the `paperclip-tls` certificate
- [ ] external-dns creates the DNS record
- [ ] `https://paperclip.k8s.syd1.au.unkin.net` is reachable
Reviewed-on: #133
## Summary
- Replace `Ingress` (nginx) with `Gateway` + `HTTPRoute` using `traefik-internal` GatewayClass
- TLS terminated at the Gateway listener; cert-manager provisions the certificate via `vault-issuer`
- external-dns annotations moved to the Gateway
## Test plan
- [ ] ArgoCD syncs the litellm app cleanly
- [ ] cert-manager issues the `litellm-tls` certificate
- [ ] external-dns creates the DNS record
- [ ] `https://litellm.k8s.syd1.au.unkin.net` is reachable
Reviewed-on: #134
## Summary
- Replace `Ingress` (nginx) with `Gateway` + `HTTPRoute` using `traefik-internal` GatewayClass
- TLS terminated at the Gateway listener; cert-manager provisions the certificate via `vault-issuer`
- external-dns annotations moved to the Gateway
## Test plan
- [ ] ArgoCD syncs the cattle-system app cleanly
- [ ] cert-manager issues the `rancher-tls` certificate
- [ ] external-dns creates the DNS record
- [ ] `https://rancher.k8s.syd1.au.unkin.net` is reachable
Reviewed-on: #132
## Problem
GatewayClasses were `Unknown` even after controllerName was fixed. The `kubernetesGateway` `labelSelector` applies to all watched resources, including GatewayClasses themselves. Since neither GatewayClass had a `traefik.io/instance` label, both Traefik instances filtered them out and never accepted them.
## Fix
- `gatewayclass-internal.yaml`: add `traefik.io/instance: internal`
- `gatewayclass-external.yaml`: add `traefik.io/instance: external`
## Test plan
- [ ] `kubectl get gatewayclass` shows both as `Accepted: True`
Reviewed-on: #137
## URGENT — Traefik pods are CrashLoopBackOff
The merged PR #135 added `--providers.kubernetesgateway.controllerName` as an `additionalArguments` entry. Traefik v3.7.0 does not support this flag and fails immediately on startup.
Old replica sets are still running (one pod each) but new pods cannot come up.
## Fix
- Remove `additionalArguments` from both `values-internal.yaml` and `values-external.yaml`
- Revert GatewayClass `controllerName` back to `traefik.io/gateway-controller` (the hardcoded Traefik default — no override mechanism exists in v3.7.0)
## After merge
GatewayClasses will remain `Unknown` until a separate solution for internal/external separation is implemented (the `labelSelector` approach needs further investigation).
Reviewed-on: #136
## Problem
Both GatewayClasses (`traefik-internal`, `traefik-external`) were stuck as `Unknown`. Neither Traefik deployment had `controllerName` set in `kubernetesGateway`, so both defaulted to `traefik.io/gateway-controller` — which matched neither GatewayClass.
## Fix
- `gatewayclass-internal.yaml`: `controllerName: traefik.io/gateway-controller-internal`
- `gatewayclass-external.yaml`: `controllerName: traefik.io/gateway-controller-external`
- `values-internal.yaml`: added `controllerName: traefik.io/gateway-controller-internal`
- `values-external.yaml`: added `controllerName: traefik.io/gateway-controller-external`
## Test plan
- [ ] ArgoCD syncs traefik-system cleanly
- [ ] `kubectl get gatewayclass` shows both as `Accepted: True`
Reviewed-on: #135
this adds a service account that can be used to run the terraform_vault
workflows with, so that we can access the jwt to generate a token
Reviewed-on: #127
Remove --providers.kubernetesgateway.controllername which does not exist in
Traefik v3, update GatewayClass controllerName to the standard v3 value, and
use labelSelector on each instance's kubernetesGateway provider to differentiate
internal vs external traffic.
Reviewed-on: #125
deploy traefik for internal and external applications. port forwarding
from the external routers will only occur to the IP of the
traefik-external service.
- traefik-internal and traefik-external added
- each is a different deployment
Reviewed-on: #119
Adds immutable patterns for yannh/kubeconform and kubernetes-sigs/kustomize
to fix 403 Forbidden errors when downloading their Linux amd64 releases.
Reviewed-on: #123
Mount the vault-ca-cert secret and set NODE_EXTRA_CA_CERTS so Node.js
trusts the internal CA chain when making outbound TLS connections.
Reviewed-on: #108
The privateHostnameGuard middleware blocks requests where the Host header
is not in the allowlist. Kubelet httpGet probes use the pod IP as the
Host header, which is never in the allowlist. Setting Host: localhost
ensures probes are always permitted.
Reviewed-on: #107
Adds base manifests and au-syd1 overlay for Paperclip (AI agent
orchestration platform), following the litellm deployment pattern.
Updates aitooling ApplicationSet to include the paperclip path.
Closes#99
Reviewed-on: #100
Deploys LiteLLM proxy with CNPG PostgreSQL (3-instance HA), PgBouncer
pooler, and Redis cache. Introduces a dedicated aitooling AppProject and
ApplicationSet to keep AI tooling services separate from platform infra.
Reviewed-on: #94
Split monolithic remotes.yaml into per-type-package files under
resources/conf.d/ to align with artifactapi v2.7.1 directory loading.
Updated schema: virtuals/locals use dedicated top-level keys, type field
removed. Added helm remotes for all kustomize helmCharts repos and
OCI patterns to docker remotes. CONFIG_PATH now points to the directory.
Reviewed-on: #92