Remove SPEC, ARCHITECTURE, TODO from tracking

Add to .gitignore — these are working documents, not part of the shipped codebase.
This commit is contained in:
2026-05-04 22:16:55 +10:00
parent 2309e9f43a
commit d2e9e4df59
3 changed files with 0 additions and 169 deletions
-112
View File
@@ -1,112 +0,0 @@
# StreamStack Architecture
## Services
| Service | Replicas | Backing stores | Responsibility |
|---------|----------|----------------|----------------|
| **auth** | 2 | Postgres, NATS KV | User accounts, JWT issue/refresh/revoke |
| **catalogue** | 2 | Postgres, NATS pub | Media metadata CRUD, stream token requests |
| **streaming** | 2 | NATS KV, S3 | Token issuance, byte-range video delivery |
| **ingest** | 2 | S3, (catalogue HTTP) | Upload video, extract metadata/thumbnail, register in catalogue |
| **nginx** | 1 | — | Reverse proxy + React SPA |
## Infrastructure
| Component | Purpose |
|-----------|---------|
| **Postgres** | Persistent store for user accounts (auth) and media metadata (catalogue) |
| **NATS JetStream KV** | Short-lived stream tokens (1h TTL); revoked-token list for JWT blacklisting |
| **S3 / MinIO** | Binary storage — `media/` bucket for video files, `thumbnails/` bucket for JPEG thumbnails |
---
## Request flows
### Login
```
Browser → nginx → auth
auth reads Postgres (verify credentials)
auth writes nothing to NATS
auth returns access_token (JWT, RS256, 30min) + refresh_token (7 days)
```
### Browse catalogue
```
Browser → nginx → catalogue
catalogue reads Postgres (published media items)
returns list of metadata (title, duration, thumbnail_s3_key, etc.)
no NATS, no S3
```
### Request a stream token
```
Browser → nginx → catalogue POST /catalogue/{id}/stream-token
catalogue reads Postgres → gets s3_key + size_bytes for the item
catalogue → streaming POST /stream/token {media_id, s3_key, size_bytes}
streaming verifies JWT (public key, local)
streaming writes NATS KV: token → "media_id|user_id|timestamp|s3_key|size_bytes"
streaming returns {stream_url: "/api/v1/stream/<token>"}
catalogue returns stream_url to browser
```
Token TTL: 1 hour. After that, NATS discards it automatically.
### Play video (each range request)
```
Browser → nginx → streaming GET /stream/<token> Range: bytes=X-Y
streaming reads NATS KV (resolve token → s3_key + size_bytes)
streaming → S3 GET object with byte range (aiobotocore, fully async)
streams bytes back to browser
no Postgres, no catalogue HTTP call
```
The browser sends many range requests for a single video. Each one costs only a NATS lookup + S3 range-get.
### Ingest a video (admin only)
```
curl/frontend → nginx → ingest POST /ingest/upload (multipart)
ingest verifies JWT (admin role required)
ingest → S3 upload file → media/{uuid}.ext
ingest → S3 head_object → size_bytes
ingest runs PyAV (in threadpool):
- reads S3 via range-gets → extracts duration, codec, width, height, fps
- decodes first video frame → JPEG → S3 thumbnails/{uuid}.jpg
ingest → catalogue POST /catalogue/ {s3_key, size_bytes, metadata...}
catalogue writes Postgres
catalogue publishes NATS: catalogue.events.media.published
returns catalogue item JSON
```
---
## JWT flow
Auth uses **RS256** (asymmetric). The private key signs tokens; all other services hold only the public key and verify locally — no auth HTTP call on every request.
Revoked tokens are stored as keys in a NATS KV bucket (`revoked-tokens`). Streaming checks this bucket on token issue, not on every range request.
---
## Data ownership
```
Postgres auth users, hashed passwords, roles
catalogue media items, all metadata fields
NATS KV streaming stream tokens (s3_key + size_bytes embedded)
auth revoked JWT list
S3 ingest video files → media/
ingest thumbnails → thumbnails/
(read) streaming reads media/ for range delivery
(read) ingest/PyAV reads media/ for metadata extraction
```
---
## Inter-service HTTP calls
| Caller | Callee | When |
|--------|--------|------|
| catalogue | streaming | Stream token request — passes s3_key + size_bytes |
| ingest | catalogue | After upload — registers the media item |
All other cross-service communication is either direct DB access (own service only) or NATS pub/sub. Services do **not** query each other's databases.
-41
View File
@@ -1,41 +0,0 @@
welcome to stream stack.
this project is to build a media streaming service comprised of a number of microservices and multiple frontends (desktop, mobile, admin). the aim is that every component is highly available and load balanced. state is shared between all processes through NATS, and persistent data is stored in pgsql or s3 (depending on the data). the backends should all be build using fastapi. each backend service should be able to run independently (should it be one pypi package that we enable features for different modules, or a different pypi package for each system?)
the frontend services should be in a fast and responsive language that will consume the fastapi services (react maybe?). there should be a "router" service that the frontend talks to, which proxies connections to the appropriate backend, or should be put different services on different dns addresses?
question: can we stream media from s3? will that enable skipping forward/backwards?
ensure there are unit tests for all file (in tests/)
add a makefile that tests the unit tests (make test) using uvx (so we dont need to install any requirements permamently)
add makefile test for linting with ruff
add Dockerfile to run the streamstack (with booleans to enable different microservices)
- this should use git.unkin.net/unkin/almalinux9-base:latest to build, then the uv container (dhi.io/uv:0.11) to run
add docker-compose for e22 testing of the stack (with makefile targets to start/stop)
required projects:
- https://github.com/pyav-org/pyav
- https://github.com/fastapi/fastapi
- https://github.com/nats-io/nats.py
phase 1:
- build a backend microservice that can read media files with ffmpeg (pyav) from s3 and stream them. the url to stream the media should not include the name of the media. the url should be openable in mpv for testing.
phase 2:
- build a microservice that presents the media catalogue. this will be used by the frontend later to list media available.
phase 3:
- build auth microservice. it should be a jwt provider. when a user autheticates, they have a jwt kept somewhere that is passed to each microservice for each request. each microservice should then verify the jwt against the auth microservice.
phase 4:
- import microservice, for importing video into s3, adding to catalogue, finding metadata (thumbnail, actors, etc)
phase 5:
- simple react frontend (this is just for testing. no auth. just show catalogue and when you click on an item, play that video)
- the frontend should be its own container, so that it can be run in a DMZ
additional requirements:
keep track of where a user is up to in a given video, so that when they replay it, it starts from a few seconds before where they stopped.
when streaming video, send bursts of video to the user so that it caches on the client side
-16
View File
@@ -1,16 +0,0 @@
# TODO
- Transcode MKV uploads to MP4 during ingest — browsers (Firefox/Chrome) cannot natively play MKV containers, so Jellyfish-style uploads fail to load in the video player.
- IMDB metadata microservice — subscribe to `catalogue.events.media.published` (durable consumer `"imdb-fetcher"`), look up title/year against IMDB API, patch catalogue with enriched metadata (rating, genre, plot, cast).
- Subtitle fetcher microservice — subscribe to `catalogue.events.media.published` (durable consumer `"subtitle-fetcher"`), fetch subtitles (e.g. OpenSubtitles API), store as `.vtt` in S3, update catalogue with subtitle_s3_key. Frontend `<video>` supports `<track>` elements for native subtitle display.
## TV show metadata identification
For a file like `Clarkson's.Farm.S01E01.Tractoring.WEBRip-1080p.mp4`, metadata can be identified via:
- **Filename parsing** — extract show name, season, episode number, and episode title from the filename using a regex (e.g. `S(\d+)E(\d+)` pattern). The ingest service or a dedicated parser microservice could do this automatically at upload time, pre-filling `show_name`, `season`, `episode`, `episode_title` fields so the user doesn't have to type them.
- **TheTVDB API** — given `show_name` + `season` + `episode`, look up the canonical title, air date, plot, guest cast, network, and a high-quality episode thumbnail. Free API key available. Subscribe to `catalogue.events.media.published` as a durable consumer `"tvdb-fetcher"`.
- **TMDB (The Movie Database)** — also covers TV series (`/tv/{series_id}/season/{n}/episode/{n}`). Has episode stills, show banners, cast photos. Free API key.
- **IMDb / Cinemagoer** — Python library (`cinemagoer`, formerly IMDbPY) that scrapes IMDb data without an API key. Slower but no key required. IMDb series ID can be cross-referenced from TheTVDB.
- **Video container metadata** — MKV/MP4 files sometimes embed title, show name, season/episode in container tags (readable via PyAV `container.metadata`). Worth checking before hitting external APIs — already have the file open during ingest.
- **Suggested flow**: parse filename → check container tags → query TheTVDB with (show_name, season, episode) → fall back to TMDB → patch catalogue via service JWT.