diff --git a/README.md b/README.md index 508c9d9..8138ee3 100644 --- a/README.md +++ b/README.md @@ -161,27 +161,101 @@ remotes: ### Include Patterns -Include patterns are regular expressions that control which files can be accessed: +Include patterns are regular expressions that control which files can be accessed. Patterns use Python `re.search`, so they match anywhere in the path unless anchored with `^` or `$`. Only files matching at least one pattern are served; all others return HTTP 403. ```yaml include_patterns: - # Specific project patterns + # Exact project + architecture — most restrictive + - "^gruntwork-io/terragrunt/releases/download/.*/terragrunt_linux_amd64$" + + # Any release asset for a project, any version - "gruntwork-io/terragrunt/.*terragrunt_linux_amd64.*" - # File extension patterns + # File extension only — allow all files of a given type from any path - ".*\\.tar\\.gz$" - - ".*\\.zip$" - ".*\\.rpm$" + - ".*\\.zip$" - # Architecture-specific patterns + # Architecture subtree — allow everything under x86_64/ - ".*/x86_64/.*" - - ".*/linux-amd64/.*" - # Version-specific patterns - - "prometheus/node_exporter/.*/node_exporter-.*\\.linux-amd64\\.tar\\.gz$" + # Combined: architecture + extension + - ".*/x86_64/.*\\.rpm$" + - ".*/noarch/.*\\.rpm$" + + # Docker image names (used with package: docker remotes) + - "^library/nginx" # nginx official images only + - "^rancher/" # all rancher/* images + - "^rancher/rke2-runtime" # specific image + + # Repodata directories — allow all metadata for an RPM repo + - ".*/repodata/.*$" ``` -**Security Note**: Only files matching at least one include pattern are accessible. Files not matching any pattern return HTTP 403. +**Security note**: Omitting `include_patterns` entirely allows all files from that remote. Index files (e.g. `APKINDEX.tar.gz`, `repomd.xml`, tag manifests) always bypass pattern enforcement — they are served unconditionally so clients can discover available packages. + +### Index Patterns + +Index patterns identify repository metadata files. Index files get special treatment: +- **Always served** regardless of `include_patterns` +- **Cached with `index_ttl`** instead of `file_ttl` +- **Automatically refreshed** when the TTL expires — the cached copy is evicted and re-fetched on next request + +Built-in defaults per package type: + +| Package type | Built-in index patterns | +|---|---| +| `alpine` | `APKINDEX\.tar\.gz$` | +| `rpm` | `repomd\.xml$`, `repodata/` metadata (xml, sqlite, yaml, asc, txt variants), `Packages\.gz$` | +| `docker` | Tag manifests (non-digest refs), `/tags/list` | +| `generic` | *(none)* | + +Use `index_patterns` to add extra patterns on top of the defaults. Duplicates are ignored automatically. + +```yaml +remotes: + helm-charts: + base_url: "https://charts.example.com" + type: "remote" + package: "generic" + include_patterns: + - ".*\\.tgz$" # chart archives + index_patterns: + - "index\\.yaml$" # Helm repo index — re-fetched on every TTL expiry + cache: + file_ttl: 0 + index_ttl: 600 # re-check the index every 10 minutes + + apt-mirror: + base_url: "https://apt.example.com" + type: "remote" + package: "generic" + include_patterns: + - ".*\\.deb$" + index_patterns: + - "InRelease$" # signed APT release file + - "Release$" # unsigned APT release file + - "Packages\\.gz$" # compressed package list + - "Packages\\.xz$" + cache: + file_ttl: 0 + index_ttl: 3600 # hourly index refresh + + almalinux-with-extras: + base_url: "https://mirror.example.com/almalinux" + type: "remote" + package: "rpm" # inherits repomd.xml + repodata/* defaults + include_patterns: + - ".*/x86_64/.*\\.rpm$" + - ".*/noarch/.*\\.rpm$" + index_patterns: + - "comps\\.xml$" # optional group metadata (adds to rpm defaults) + cache: + file_ttl: 0 + index_ttl: 7200 +``` + +Pattern matching uses `re.search`, so `"index\\.yaml$"` matches `/stable/index.yaml` and `/index.yaml`. Anchor with `^` to restrict to the path root. ### Cache Configuration @@ -661,4 +735,195 @@ curl "http://localhost:8000/api/github/gruntwork-io/terragrunt/releases/download - Use external managed databases for production workloads - Configure backup strategies for persistent volumes - Set up proper TLS certificates for ingress -- Consider using StatefulSets for databases with persistent storage \ No newline at end of file +- Consider using StatefulSets for databases with persistent storage + +## Docker Image Rewriting with RKE2 + +RKE2 can route container image pulls through registry mirrors using `/etc/rancher/rke2/registries.yaml`. The artifact API implements the Docker Registry HTTP API v2 at `/v2/`, so it acts as a transparent caching mirror for any upstream registry. + +### How it works + +1. A pod requests `docker.io/library/nginx:latest` +2. RKE2 intercepts the pull and rewrites the image path using the `rewrite` rules +3. The rewritten request hits the artifact API (`/v2/dockerhub/library/nginx/manifests/latest`) +4. On first access the API fetches the manifest and layers from Docker Hub and caches them in S3 +5. Subsequent pulls are served directly from cache, with no upstream traffic + +### registries.yaml + +Place this file on every RKE2 node at `/etc/rancher/rke2/registries.yaml`. The `rewrite` field maps the original image path (as the upstream registry sees it) to the path the artifact API expects under `/v2/{remote_name}/...`. + +#### Docker Hub + +Docker Hub resolves unqualified image names like `nginx` as `library/nginx`. The rewrite prepends the remote name so the request lands on the correct remote. + +```yaml +# /etc/rancher/rke2/registries.yaml +mirrors: + docker.io: + endpoint: + - "https://artifacts.example.com" + rewrite: + "^(.*)$": "dockerhub/$1" +``` + +Corresponding `remotes.yaml` entry: + +```yaml +remotes: + dockerhub: + base_url: "https://registry-1.docker.io" + type: "remote" + package: "docker" + username: "your-dockerhub-username" + password: "your-dockerhub-token" # PAT with read scope + cache: + file_ttl: 0 + index_ttl: 300 +``` + +A pull of `nginx:latest` becomes `/v2/dockerhub/library/nginx/manifests/latest` on the artifact API. + +#### GitHub Container Registry (ghcr.io) + +```yaml +mirrors: + ghcr.io: + endpoint: + - "https://artifacts.example.com" + rewrite: + "^(.*)$": "ghcr/$1" +``` + +```yaml +remotes: + ghcr: + base_url: "https://ghcr.io" + type: "remote" + package: "docker" + username: "your-github-username" + password: "ghp_your_github_pat" # read:packages scope required + cache: + file_ttl: 0 + index_ttl: 300 +``` + +A pull of `ghcr.io/rancher/rke2-runtime:v1.30.0-rke2r1` becomes `/v2/ghcr/rancher/rke2-runtime/manifests/v1.30.0-rke2r1`. + +#### Multiple registries + +```yaml +# /etc/rancher/rke2/registries.yaml +mirrors: + docker.io: + endpoint: + - "https://artifacts.example.com" + rewrite: + "^(.*)$": "dockerhub/$1" + + ghcr.io: + endpoint: + - "https://artifacts.example.com" + rewrite: + "^(.*)$": "ghcr/$1" + + registry.k8s.io: + endpoint: + - "https://artifacts.example.com" + rewrite: + "^(.*)$": "k8s-registry/$1" + + quay.io: + endpoint: + - "https://artifacts.example.com" + rewrite: + "^(.*)$": "quay/$1" +``` + +Each entry needs a matching remote in `remotes.yaml` using the name from the rewrite target (e.g. `k8s-registry`, `quay`). + +#### Restricting which images are cached + +Use `include_patterns` on the remote to allow only specific images through the proxy. Requests for images not matching any pattern return HTTP 403 to the node. + +```yaml +remotes: + dockerhub: + base_url: "https://registry-1.docker.io" + type: "remote" + package: "docker" + include_patterns: + - "^library/nginx" # official nginx only + - "^library/redis" # official redis only + - "^rancher/" # all rancher images + - "^grafana/grafana" # specific image + cache: + file_ttl: 0 + index_ttl: 300 +``` + +Omit `include_patterns` to allow all images from that registry. + +#### TLS configuration + +If the artifact API uses a private CA certificate, tell containerd about it in `registries.yaml`: + +```yaml +mirrors: + docker.io: + endpoint: + - "https://artifacts.example.com" + rewrite: + "^(.*)$": "dockerhub/$1" + +configs: + "artifacts.example.com": + tls: + ca_file: /etc/ssl/certs/internal-ca.crt +``` + +### Applying the configuration + +```bash +# Write registries.yaml on each node (server and agent) +sudo mkdir -p /etc/rancher/rke2 +sudo tee /etc/rancher/rke2/registries.yaml <<'EOF' +mirrors: + docker.io: + endpoint: + - "https://artifacts.example.com" + rewrite: + "^(.*)$": "dockerhub/$1" + ghcr.io: + endpoint: + - "https://artifacts.example.com" + rewrite: + "^(.*)$": "ghcr/$1" +EOF + +# Restart the RKE2 service (server nodes) +sudo systemctl restart rke2-server + +# Or on agent nodes +sudo systemctl restart rke2-agent + +# Confirm containerd picked up the mirror config +sudo /var/lib/rancher/rke2/bin/crictl info | jq '.config.registry.mirrors' +``` + +### Verifying pulls go through the cache + +```bash +# Pull an image on a node +sudo /var/lib/rancher/rke2/bin/crictl pull nginx:latest + +# Check the artifact API received the request +kubectl logs deployment/artifactapi -n artifact-storage | grep "nginx" +# Expect: Cache MISS on first pull, Cache HIT on subsequent pulls + +# Query the manifest endpoint directly — 200 means it's cached +curl -I https://artifacts.example.com/v2/dockerhub/library/nginx/manifests/latest + +# Check what's stored in the cache +curl https://artifacts.example.com/ | jq '.remotes' +``` \ No newline at end of file