Virtual helm: cache parsed member indexes (not raw bytes) to skip re-parse on rebuild #36

Closed
opened 2026-05-01 23:44:44 +10:00 by unkinben · 2 comments
Owner

Performance Issue

On a virtual repo cache miss, each member's index.yaml is fetched and stored as raw YAML bytes in S3. When the virtual index needs to be rebuilt (another cache miss), every member's raw bytes must be parsed again from scratch.

For large member repos (e.g. grafana, victoriametrics with hundreds of charts), re-parsing the same YAML on every virtual rebuild is wasteful.

Approach

Consider storing the parsed + URL-rewritten entries as a compact binary format (e.g. msgpack) alongside the raw YAML in S3. Key: {member_name}/parsed/index.msgpack with the same TTL as the raw YAML.

On virtual rebuild:

  1. Try to load the parsed msgpack for each member (fast binary deserialization).
  2. If not available, fall back to loading + parsing the raw YAML (current behaviour).

This reduces the per-member parse cost on rebuilds from O(YAML size) to O(msgpack size), at the cost of an extra S3 key per member.

Tradeoff

Adds storage complexity and a dependency on msgpack. Only worth it if issue #34 (CSafeLoader) doesn't close the gap sufficiently.

## Performance Issue On a virtual repo cache miss, each member's `index.yaml` is fetched and stored as raw YAML bytes in S3. When the virtual index needs to be rebuilt (another cache miss), every member's raw bytes must be parsed again from scratch. For large member repos (e.g. grafana, victoriametrics with hundreds of charts), re-parsing the same YAML on every virtual rebuild is wasteful. ## Approach Consider storing the parsed + URL-rewritten entries as a compact binary format (e.g. `msgpack`) alongside the raw YAML in S3. Key: `{member_name}/parsed/index.msgpack` with the same TTL as the raw YAML. On virtual rebuild: 1. Try to load the parsed msgpack for each member (fast binary deserialization). 2. If not available, fall back to loading + parsing the raw YAML (current behaviour). This reduces the per-member parse cost on rebuilds from O(YAML size) to O(msgpack size), at the cost of an extra S3 key per member. ## Tradeoff Adds storage complexity and a dependency on `msgpack`. Only worth it if issue #34 (CSafeLoader) doesn't close the gap sufficiently.
Author
Owner

Can this message pack be used for all virtual and remotes and local index management?

Can this message pack be used for all virtual and remotes and local index management?
Author
Owner

Resolved in PR #40.

How it was resolved: After fetching each member's index.yaml, the handler now parses it and stores a compact msgpack file (index.msgpack) alongside the raw YAML in S3. On the next rebuild (virtual TTL expired, member caches still valid), msgpack is loaded instead of re-parsing raw YAML — eliminating the 6.3s parse phase that accounted for 60% of merge time.

Phase profiling (19 members, 14 MB total): parse=6314ms (60%), rewrite+dedup=33ms (0.3%), dump=4124ms (39%). Msgpack eliminates the parse phase.

Results:

  • Warm rebuild: 9.6s → 5.9s (38% faster)
  • Cold rebuild: ~21s → ~26s (one-time overhead to build msgpack cache)
  • msgpack=19/19 confirmed in logs

Issues encountered: datetime objects in YAML-parsed index entries cannot be directly msgpack-serialized — required _entries_to_msgpack_safe() to convert to ISO strings before packing.

Potential future improvements: Currently the raw YAML is still downloaded from S3 even when msgpack is available (used as a fallback). Skipping the raw YAML download when msgpack is valid would further reduce the warm rebuild S3 I/O overhead. Also, the YAML dump phase (4.1s, 39%) is the next bottleneck — a pre-serialized msgpack of the merged virtual index could eliminate that too, though it would require cache invalidation on any member change.

Resolved in PR #40. **How it was resolved:** After fetching each member's `index.yaml`, the handler now parses it and stores a compact msgpack file (`index.msgpack`) alongside the raw YAML in S3. On the next rebuild (virtual TTL expired, member caches still valid), msgpack is loaded instead of re-parsing raw YAML — eliminating the 6.3s parse phase that accounted for 60% of merge time. Phase profiling (19 members, 14 MB total): parse=6314ms (60%), rewrite+dedup=33ms (0.3%), dump=4124ms (39%). Msgpack eliminates the parse phase. **Results:** - Warm rebuild: 9.6s → 5.9s (38% faster) - Cold rebuild: ~21s → ~26s (one-time overhead to build msgpack cache) - msgpack=19/19 confirmed in logs **Issues encountered:** datetime objects in YAML-parsed index entries cannot be directly msgpack-serialized — required `_entries_to_msgpack_safe()` to convert to ISO strings before packing. **Potential future improvements:** Currently the raw YAML is still downloaded from S3 even when msgpack is available (used as a fallback). Skipping the raw YAML download when msgpack is valid would further reduce the warm rebuild S3 I/O overhead. Also, the YAML dump phase (4.1s, 39%) is the next bottleneck — a pre-serialized msgpack of the merged virtual index could eliminate that too, though it would require cache invalidation on any member change.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: unkin/artifactapi#36