This is a shot in the dark, but has anyone built or know of an existing library/project that uses LLMS to describe the contents of videos? (Not Whisper transcriptions) I have many videos from the 90s/00s that are cassette tape length in size (~2hrs) and I'd love to be able to have a text form I could query so I can search like "boy eating pear" and it will pick up which video that scene was in (and ideally even roughly the timestamp of that scene!).
Alternatively, is this currently possible with today's tech?
You could take a look at the bots that Mastodon folks use for adding alt-descriptions to images for the seeing-impaired
Maybe use ffmpeg or something to extract an image from your video every so often, and then have those images labelled?
Dustin said:
This is a shot in the dark, but has anyone built or know of an existing library/project that uses LLMS to describe the contents of videos? (Not Whisper transcriptions) I have many videos from the 90s/00s that are cassette tape length in size (~2hrs) and I'd love to be able to have a text form I could query so I can search like "boy eating pear" and it will pick up which video that scene was in (and ideally even roughly the timestamp of that scene!).
Alternatively, is this currently possible with today's tech?
There are Cloud API services that do this for images, not sure about video. I saw some repos on github like this one that does keyframe snapshots. https://github.com/byjlw/video-analyzer I haven't used any of these but I'd be interested in seeing if anyone else in the community has experience with this. AWS and GCP both have services for images but I want to try some video and image tools for doing this analysis work on self-hosted.
Keyframes set me off in a great direction, thanks!
I’m only familiar with Gemini, but you should be able to chop up your videos into 50 minute chunks (I think that’s the limit) with ffmpeg and then ask the model for timestamps of scene changes. Then take those timestamps and use ffmpeg to chop the videos into scenes and then ask gemini to describe each scene. Then you’d get more manageable videos instead of two hour long ones where you don’t know where that part you’re looking for is.
This just popped up in my news feed:
https://www.theverge.com/2025/1/22/24349299/adobe-premiere-pro-after-effects-media-intelligence-search
Search in Premiere Pro has been updated with AI-powered visual recognition, allowing users to find videos by describing the contents of the footage. Users can enter search terms like “a person skating with a lens flare” to find corresponding clips within their media library.
I managed to figure it out!
I'm doing a pretty simple pipeline in python.
table.search("query string")
and lancedb handles embedding the text into the same space as the image embeddings and then doing cosine similarity to find the nearest records!Last updated: Apr 03 2025 at 23:38 UTC