Commit Graph

44 Commits

Author SHA1 Message Date
unkinben eef4c2cd49 feat(vault): deploy HashiCorp Vault 2.0.1 with raft HA (5 replicas)
StatefulSet with templated PVC (cephrbd-fast-delete, 10Gi), headless
service for raft cluster communication, HTTPS gateway (443→8200), and
kubernetes provider retry_join for automatic cluster formation.
2026-05-23 18:22:25 +10:00
unkinben dcea768c15 feat(woodpecker): upgrade to v3.14.1 (chart 3.6.3) (#146)
Reviewed-on: #146
2026-05-23 18:00:55 +10:00
unkinben fd87cb96b5 feat(externaldns): upgrade to 1.21.1, fix sources for installed CRDs (#143)
## Changes

- Upgrade external-dns chart from 1.19.0 → 1.21.1 (app v0.19.0 → v0.21.0)
- Remove `gateway-tcproute` source — `TCPRoute` CRD is not installed, causing crash-loop
- Add `gateway-tlsroute` source — `TLSRoute` CRD is installed (Gateway API 1.5.1)

## Why

The pod was crash-looping every minute with `failed to list *v1alpha2.TCPRoute: the server could not find the requested resource` because the TCPRoute CRD doesn't exist in this cluster. TLSRoute was previously removed but its CRD does exist.

Reviewed-on: #143
2026-05-23 01:28:20 +10:00
unkinben d619f9195e benvin/externaldns_compatability (#142)
Reviewed-on: #142
2026-05-23 01:17:20 +10:00
unkinben 1944dbbfcd temp: enable debug logging on externaldns to diagnose TLSRoute sync timeout (#140)
Temporary: enable --log-level=debug to understand why the TLSRoute informer never reports HasSynced within the 1m interval. To be closed/reverted after root cause is found.
Reviewed-on: #140
2026-05-23 01:07:45 +10:00
unkinben 0940cc20f8 fix(traefik): listen on port 443 directly for Gateway API compatibility (#138)
## Problem

Gateway listeners with `port: 443` were rejected with `PortUnavailable: Cannot find entryPoint for Gateway: no matching entryPoint for port 443 and protocol "HTTPS"`.

Traefik matches Gateway listener ports against its internal entryPoint ports (pod-level), not the Service's `exposedPort`. The `websecure` entryPoint was configured on port `8443`, so port `443` listeners had no match.

## Fix

- `ports.websecure.port: 443` — Traefik now binds directly on 443
- `securityContext.capabilities.add: [NET_BIND_SERVICE]` — allows a non-root process to bind to privileged ports (<1024)

The Service `exposedPort` stays at `443`, so external connectivity is unchanged. All existing Gateway listeners (`port: 443`) are correct as-is.

Applies to both internal and external Traefik instances.

## Test plan

- [ ] Traefik pods restart cleanly
- [ ] `kubectl get gateway -A` shows listeners as `Programmed: True`
- [ ] `https://rancher.k8s.syd1.au.unkin.net` (already merged) is reachable

Reviewed-on: #138
2026-05-23 00:44:13 +10:00
unkinben 57c14d32c0 fix(traefik): remove invalid controllerName flag causing CrashLoopBackOff (#136)
## URGENT — Traefik pods are CrashLoopBackOff

The merged PR #135 added `--providers.kubernetesgateway.controllerName` as an `additionalArguments` entry. Traefik v3.7.0 does not support this flag and fails immediately on startup.

Old replica sets are still running (one pod each) but new pods cannot come up.

## Fix

- Remove `additionalArguments` from both `values-internal.yaml` and `values-external.yaml`
- Revert GatewayClass `controllerName` back to `traefik.io/gateway-controller` (the hardcoded Traefik default — no override mechanism exists in v3.7.0)

## After merge

GatewayClasses will remain `Unknown` until a separate solution for internal/external separation is implemented (the `labelSelector` approach needs further investigation).

Reviewed-on: #136
2026-05-22 23:58:56 +10:00
unkinben 2df359c4a9 fix(traefik): set controllerName on GatewayClasses and Traefik providers (#135)
## Problem

Both GatewayClasses (`traefik-internal`, `traefik-external`) were stuck as `Unknown`. Neither Traefik deployment had `controllerName` set in `kubernetesGateway`, so both defaulted to `traefik.io/gateway-controller` — which matched neither GatewayClass.

## Fix

- `gatewayclass-internal.yaml`: `controllerName: traefik.io/gateway-controller-internal`
- `gatewayclass-external.yaml`: `controllerName: traefik.io/gateway-controller-external`
- `values-internal.yaml`: added `controllerName: traefik.io/gateway-controller-internal`
- `values-external.yaml`: added `controllerName: traefik.io/gateway-controller-external`

## Test plan

- [ ] ArgoCD syncs traefik-system cleanly
- [ ] `kubectl get gatewayclass` shows both as `Accepted: True`

Reviewed-on: #135
2026-05-22 23:44:06 +10:00
unkinben 462b2b3f4f feat(externaldns): add Gateway API sources for httproute, tlsroute, grpcroute, tcproute, udproute (#126)
Reviewed-on: #126
2026-05-18 00:11:33 +10:00
unkinben 73c9b3f603 fix(traefik): replace invalid controllername flag with labelSelector for v3 (#125)
Remove --providers.kubernetesgateway.controllername which does not exist in
Traefik v3, update GatewayClass controllerName to the standard v3 value, and
use labelSelector on each instance's kubernetesGateway provider to differentiate
internal vs external traffic.

Reviewed-on: #125
2026-05-18 00:03:12 +10:00
unkinben 53553ddcfd feat: deploy internal/external traefik routers (#119)
deploy traefik for internal and external applications. port forwarding
from the external routers will only occur to the IP of the
traefik-external service.

- traefik-internal and traefik-external added
- each is a different deployment

Reviewed-on: #119
2026-05-17 23:44:50 +10:00
unkinben 5e03215f4d chore: migrate reloader/reflector to virtual/helm (#115)
Reviewed-on: #115
2026-05-05 21:42:23 +10:00
unkinben 02ee82da1e feat: update vso to 1.3.0 (#114)
- updates the vso helm chart from 1.2.0 to 1.3.0

Reviewed-on: #114
2026-05-05 00:01:58 +10:00
unkinben bcea7df925 chore: swap vso to virtual helm repo (#109)
- testing if there will be any changes after merging, before merging all of them

Reviewed-on: #109
2026-05-03 16:49:53 +10:00
unkinben e156cd10bd feat: deploy paperclip to au-syd1 via ArgoCD (aitooling project) (#100)
Adds base manifests and au-syd1 overlay for Paperclip (AI agent
orchestration platform), following the litellm deployment pattern.
Updates aitooling ApplicationSet to include the paperclip path.

Closes #99

Reviewed-on: #100
2026-05-02 21:27:51 +10:00
unkinben 5372914803 feat: add litellm to new aitooling ArgoCD project (#94)
Deploys LiteLLM proxy with CNPG PostgreSQL (3-instance HA), PgBouncer
pooler, and Redis cache. Introduces a dedicated aitooling AppProject and
ApplicationSet to keep AI tooling services separate from platform infra.

Reviewed-on: #94
2026-05-01 21:40:26 +10:00
unkinben 3a6d93bc3c feat: add woodpeckerci/plugin-docker-buildx to WOODPECKER_PLUGINS_PRIVILEGED (#87)
Plugin is no longer privileged by default in Woodpecker; explicitly list
both the standard and latest-insecure variants.

Reviewed-on: #87
2026-04-25 20:48:46 +10:00
unkinben 7d555cd31a feat: migrate purelb to ArgoCD (#84)
Migrate PureLB load balancer from Terragrunt to ArgoCD/Kustomize.
Deploys purelb v0.13.0 with two LBNodeAgent and two ServiceGroup CRs
(common: 198.18.200.0/24, dmz: 198.18.199.0/24).
Adds LBNodeAgent and ServiceGroup to kubeconform skip list (no CRD catalog schema).

💘 Generated with Crush

Assisted-by: Claude Sonnet 4.6 via Crush <crush@charm.land>

Reviewed-on: #84
2026-04-07 19:52:17 +10:00
unkinben f0bdc0231a feat: migrate vso-system to ArgoCD (#81)
Migrate Vault Secrets Operator from Terragrunt to ArgoCD/Kustomize.
Deploys vault-secrets-operator v1.2.0 with 3 replicas, plus ClusterRole,
ClusterRoleBindings, and vault-admin ServiceAccount.

Note: static service account tokens (kubernetes.io/service-account-token)
cannot be stored in git; create manually or via Vault after deployment.

💘 Generated with Crush

Assisted-by: Claude Sonnet 4.6 via Crush <crush@charm.land>

Reviewed-on: #81
2026-04-07 19:33:50 +10:00
unkinben b100f3034e feat: migrate observability to ArgoCD (#82)
Migrate Victoria Metrics cluster and agent from Terragrunt to ArgoCD/Kustomize.
Creates new observability AppProject and ApplicationSet.
Deploys victoria-metrics-cluster v0.33.0 (vmselect/vminsert/vmstorage with
HPA, PDB, ingress) and victoria-metrics-agent v0.30.0 (3 replicas, k8s scrape
configs) in the observability namespace.

💘 Generated with Crush

Assisted-by: Claude Sonnet 4.6 via Crush <crush@charm.land>

Reviewed-on: #82
2026-04-07 19:15:45 +10:00
unkinben c3a145acbf feat: remove jfrog container registry (#83)
its not used and never really installed correctly. going to change to
artifact-keeper which promises to have the same capabilities and is open
source.

Reviewed-on: #83
2026-04-07 19:03:32 +10:00
unkinben 181bc152e7 feat: migrate vm-system to ArgoCD (#80)
Migrate Victoria Metrics operator from Terragrunt to ArgoCD/Kustomize.
Deploys victoria-metrics-operator v0.57.1 with 2 replicas in vm-system.

💘 Generated with Crush

Assisted-by: Claude Sonnet 4.6 via Crush <crush@charm.land>

Reviewed-on: #80
2026-03-27 17:04:15 +11:00
unkinben 5bcbd7e1ba feat: migrate elastic-system to ArgoCD (#79)
Migrate ECK operator from Terragrunt to ArgoCD/Kustomize.
Deploys eck-operator v3.2.0 with 2 replicas and PodDisruptionBudget
in the elastic-system namespace.

💘 Generated with Crush

Assisted-by: Claude Sonnet 4.6 via Crush <crush@charm.land>

Reviewed-on: #79
2026-03-27 17:00:05 +11:00
unkinben 02195e6235 feat: migrate reposync to ArgoCD (#78)
Migrate repository sync cronjobs from Terragrunt to ArgoCD/Kustomize.
Adds four daily CronJobs (almalinux9-baseos, almalinux9-appstream, epel9,
openvox7) with associated PVCs and ConfigMaps in the reposync namespace.

💘 Generated with Crush

Assisted-by: Claude Sonnet 4.6 via Crush <crush@charm.land>

Reviewed-on: #78
2026-03-27 16:26:35 +11:00
unkinben dfbb315522 feat: migrate node-feature-discovery and inteldeviceplugins-system to platform project (#48)
- Add node-feature-discovery and inteldeviceplugins-system to platform project
- Convert intel-nfd-rules from local Helm chart to static NodeFeatureRule manifests
- Add required Helm repositories (NFD OCI registry and Intel charts)
- Create base configurations with Helm charts and overlay structures
- Update platform ApplicationSet and project permissions

Reviewed-on: #48
2026-03-19 02:14:45 +11:00
unkinben 90f793464b feat: migrate CSI drivers to dedicated storage project (#45)
- Migrate csi-cephfs from Terraform to ArgoCD
- Migrate csi-cephrbd from Terraform to ArgoCD
- Create dedicated storage project and ApplicationSet for CSI drivers
- Add csi-* pattern matching in storage ApplicationSet
- Remove CSI apps from platform project to separate concerns

Reviewed-on: #45
2026-03-19 01:29:31 +11:00
unkinben 06a8f98b5c feat: migrate cnpg-system from Terraform to ArgoCD (#44)
- Add cnpg-system base ArgoCD application with namespace
- Create cnpg-system overlay for au-syd1 with CloudNativePG Helm chart
- Update platform ApplicationSet to include cnpg-system deployment
- Configure cloudnative-pg operator v0.27.0 with HA and resource limits
- Maintain one-to-one migration from Terraform configuration

Reviewed-on: #44
2026-03-19 01:25:50 +11:00
unkinben 0bf6e80d6f feat: migrate externaldns from Terraform to ArgoCD (#43)
- Add externaldns base ArgoCD application with namespace and Vault integration
- Create externaldns overlay for au-syd1 with Helm chart configuration
- Update platform ApplicationSet to include externaldns deployment
- Configure external-dns v1.19.0 with RFC2136 provider for DNS updates
- Maintain one-to-one migration from Terraform configuration including TSIG secrets

Reviewed-on: #43
2026-03-19 01:22:39 +11:00
unkinben ed300fabed feat: migrate cert-manager from Terraform to ArgoCD (#42)
- Add cert-manager base ArgoCD application with namespace, RBAC resources
- Create cert-manager overlay for au-syd1 with Helm chart configuration
- Update platform ApplicationSet to include cert-manager deployment
- Configure cert-manager v1.19.2 with jetstack Helm repository
- Maintain one-to-one migration from Terraform configuration

Reviewed-on: #42
2026-03-19 01:18:19 +11:00
unkinben ea71ebb55b feat: migrate cattle-system (Rancher) from Terraform to ArgoCD (#39)
- Add cattle-system base ArgoCD application with namespace, Vault integration, and ingress
- Create cattle-system overlay for au-syd1 with Rancher Helm chart configuration
- Update platform ApplicationSet to include cattle-system deployment
- Update platform project to include Rancher Helm repository as source
- Configure Rancher v2.13.1 with HA, TLS, audit logging, and bootstrap secret from Vault
- Maintain one-to-one migration from Terraform configuration

Reviewed-on: #39
2026-03-19 00:56:39 +11:00
unkinben 3f282fbdc2 feat: migrate certificates from Terraform to ArgoCD (#37)
- Add certificates base ArgoCD application with namespace and Vault CA certificate secret
- Create certificates overlay for au-syd1 with static certificate configuration
- Update platform ApplicationSet to include certificates deployment
- Configure Vault CA certificate with reflector annotations for cross-namespace replication
- Maintain one-to-one migration from Terraform configuration

Note: Skip no_plain_secrets hook as this is a public CA certificate that needs
to be replicated via reflector, not a sensitive secret

Reviewed-on: #37
2026-03-19 00:16:33 +11:00
unkinben 14e3946d4b feat: initial puppet deployment (#25)
working towards a larger, redundant, autoscaling and simple puppet
implementation in kubernetes. this was originally based on the openvox
helm chart with several improvements (not all in this pr)

- use of cnpg instead of single bitnamilegacy postgres container
- use for g10k instead of r10k
- run one instance of g10k per namespace, instead of per-pod
- store only keep one copy of the environments/branches (instead of per-pod)
- change g10k to native cronjob instead of hacky implementation
- use vault secrets

part one adds:

- cnpg puppetdb pgsql cluster
- cnpg puppetdb pgpooler
- persistent volume claims for puppet, puppetdb, the code repository, etc

Reviewed-on: #25
2026-03-09 01:10:30 +11:00
unkinben 68b753d7fa chore: reload woodpecker (#24)
- add reloader annotations to woodpecker agent/server

Reviewed-on: #24
2026-03-07 16:02:39 +11:00
unkinben d7b661a619 chore: set WOODPECKER_ADMIN (#23)
- enable admin features for myself

Reviewed-on: #23
2026-03-07 15:47:42 +11:00
unkinben 05a88459a5 chore: migrate artifactapi to kustomize (#18)
- migrate terraform deployment to kustomize

Reviewed-on: #18
2026-03-06 21:35:47 +11:00
unkinben f9a8dca060 chore: change max workflows to string (#16)
WOODPECKER_MAX_WORKFLOWS shows no value in the pods environment, trying
as a string instead

Reviewed-on: #16
2026-03-03 23:14:05 +11:00
unkinben 46e11dd05e chore: increase agents to 3 (#15)
- increase woodpecker agents to 3 for parallel jobs

Reviewed-on: #15
2026-03-03 23:02:15 +11:00
unkinben dbd8914013 feat: migrate woodpecker to argocd (#13)
- move woodpecker helm chart deployment to argocd
- move cnpg resources
- move vault resources

Reviewed-on: #13
2026-03-03 22:24:17 +11:00
unkinben be9d485bfe feat: testing jfrog-container-registry (#11)
- trialing jfrog container registry

Reviewed-on: #11
2026-03-02 23:07:47 +11:00
unkinben ebb47348fe fix: resolve issues with helm deployments (#8)
- remove helm-patch files that are unused
- change platform namespaces allowed to *-system
- change chart name

Reviewed-on: #8
2026-03-01 18:55:47 +11:00
unkinben e873935634 feat: add reloader (#6)
- deploy reloader via helm
- only watch configmaps, secrets are reloaded by vso

Reviewed-on: #6
2026-03-01 16:34:01 +11:00
unkinben c52af7eb11 fix: helm-charts in overlay only (#5)
weird issues with kustomize not being able to merge helm-charts between
base/overlays

- move the helm-charts to the overlay only

Reviewed-on: #5
2026-03-01 16:01:32 +11:00
unkinben 9094cea77d fix: patches must contain path: (#2)
- update the patches for reflector-system to include the path field

Reviewed-on: #2
2026-03-01 14:46:28 +11:00
unkinben 971835f845 feat: initial commit
- add structure to clusters, apps and argocd objects
- add bootstrapping features
2026-03-01 14:31:16 +11:00