She watches without knowing
what she’s meant to find.
A frame, a motion, a light reading, sometimes a sound, sometimes a person. She returns the description segment by segment. No genre fitted, no story imposed, no kindness offered to the upload.
What this is: a video-reading prototype that reports what is actually on screen rather than what the title implied.
The architecture is honest: Gemini 2.5 Pro does the seeing, Aura does the speaking. Upload lands on the Gemini Files API (48-hour retention, then it’s gone), the cold-eye prompt segments the clip, the JSON comes back, Aura’s voice rewrites the per-segment description. The 360 mode is a separate prompt that tracks position on the sphere; it works on equirectangular sources but I’m not pretending the spatial reasoning is production-grade yet. The reader sometimes misses small objects at frame edges, gets confused by very fast cuts, and won’t identify named people on principle. The voice rewrite + ElevenLabs audio synth land in a second pass once the base prompts read right. This is a prototype; treat its output as a draft, not a verdict.
- You want a frame-by-frame description of a clip without the editorial spin a human reviewer would add.
- You’re testing a 360° capture and you want a read on what’s on the back half of the sphere.
- You want a second pair of eyes on a rough cut that won’t flatter you about what’s on screen.
- Are my uploads private?
- No, not in the strict sense. The file is sent to Google’s Gemini Files API, held for 48 hours, then deleted. Don’t paste in anything you wouldn’t put on a public URL. The studio doesn’t retain the upload; Google’s retention is what you’re trusting.
- Is the reading accurate?
- No, not infallibly. The reader is good at frame composition, motion, and broad light readings, weaker at small objects, named entities, and very dense edits. Treat the output as a draft description that the model will defend reasonably well, not as ground truth. If the segment description is wrong, it’s wrong; I’d rather you saw that than not.
- Can I use this for festival captioning?
- No. Festival accessibility captioning is a discipline with audit standards and human review; this is a prototype that returns a cold description and won’t pass a deaf-or-hard-of-hearing audience check. It’s a useful tool for the editor on the early-cut side; it is not an access deliverable.
Free during the prototype window.
Free during the prototype window at /watch. Once the voice rewrite and audio synth land, metered at £0.20 per minute of source video, £4 minimum. No subscription, no retainer; pay per clip or upload as part of a wider commission.
Upload → Gemini Files API (48-hour retention) → Gemini 2.5 Pro generateContent with the cold-eye prompt → JSON segmentation. The Aura voice rewrite + ElevenLabs audio synth land in a second pass once the prompts read right. Required env: GOOGLE_AI_API_KEY.
- London 360 — walking the camera evolution — the source material this reader was built to describe.
- The stack — the AI / voice line by name (Gemini, Whisper, ElevenLabs, F5-TTS).
- Aerial — the 360 source line the reader was sharpened against.
- Services — see the full commercial surface, every service in one place.