Virtual helm: cache parsed member indexes (not raw bytes) to skip re-parse on rebuild #36
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Performance Issue
On a virtual repo cache miss, each member's
index.yamlis fetched and stored as raw YAML bytes in S3. When the virtual index needs to be rebuilt (another cache miss), every member's raw bytes must be parsed again from scratch.For large member repos (e.g. grafana, victoriametrics with hundreds of charts), re-parsing the same YAML on every virtual rebuild is wasteful.
Approach
Consider storing the parsed + URL-rewritten entries as a compact binary format (e.g.
msgpack) alongside the raw YAML in S3. Key:{member_name}/parsed/index.msgpackwith the same TTL as the raw YAML.On virtual rebuild:
This reduces the per-member parse cost on rebuilds from O(YAML size) to O(msgpack size), at the cost of an extra S3 key per member.
Tradeoff
Adds storage complexity and a dependency on
msgpack. Only worth it if issue #34 (CSafeLoader) doesn't close the gap sufficiently.Can this message pack be used for all virtual and remotes and local index management?
Resolved in PR #40.
How it was resolved: After fetching each member's
index.yaml, the handler now parses it and stores a compact msgpack file (index.msgpack) alongside the raw YAML in S3. On the next rebuild (virtual TTL expired, member caches still valid), msgpack is loaded instead of re-parsing raw YAML — eliminating the 6.3s parse phase that accounted for 60% of merge time.Phase profiling (19 members, 14 MB total): parse=6314ms (60%), rewrite+dedup=33ms (0.3%), dump=4124ms (39%). Msgpack eliminates the parse phase.
Results:
Issues encountered: datetime objects in YAML-parsed index entries cannot be directly msgpack-serialized — required
_entries_to_msgpack_safe()to convert to ISO strings before packing.Potential future improvements: Currently the raw YAML is still downloaded from S3 even when msgpack is available (used as a fallback). Skipping the raw YAML download when msgpack is valid would further reduce the warm rebuild S3 I/O overhead. Also, the YAML dump phase (4.1s, 39%) is the next bottleneck — a pre-serialized msgpack of the merged virtual index could eliminate that too, though it would require cache invalidation on any member change.