Stream: general

Topic: Labeling home videos with LLMs


view this post on Zulip Dustin (Jan 20 2025 at 20:29):

This is a shot in the dark, but has anyone built or know of an existing library/project that uses LLMS to describe the contents of videos? (Not Whisper transcriptions) I have many videos from the 90s/00s that are cassette tape length in size (~2hrs) and I'd love to be able to have a text form I could query so I can search like "boy eating pear" and it will pick up which video that scene was in (and ideally even roughly the timestamp of that scene!).

Alternatively, is this currently possible with today's tech?

view this post on Zulip Ron Waldon-Howe (Jan 20 2025 at 23:14):

You could take a look at the bots that Mastodon folks use for adding alt-descriptions to images for the seeing-impaired
Maybe use ffmpeg or something to extract an image from your video every so often, and then have those images labelled?

view this post on Zulip Don MacKinnon (Jan 21 2025 at 17:00):

Dustin said:

This is a shot in the dark, but has anyone built or know of an existing library/project that uses LLMS to describe the contents of videos? (Not Whisper transcriptions) I have many videos from the 90s/00s that are cassette tape length in size (~2hrs) and I'd love to be able to have a text form I could query so I can search like "boy eating pear" and it will pick up which video that scene was in (and ideally even roughly the timestamp of that scene!).

Alternatively, is this currently possible with today's tech?

There are Cloud API services that do this for images, not sure about video. I saw some repos on github like this one that does keyframe snapshots. https://github.com/byjlw/video-analyzer I haven't used any of these but I'd be interested in seeing if anyone else in the community has experience with this. AWS and GCP both have services for images but I want to try some video and image tools for doing this analysis work on self-hosted.

view this post on Zulip Dustin (Jan 21 2025 at 22:19):

Keyframes set me off in a great direction, thanks!

view this post on Zulip Alden (Jan 21 2025 at 22:50):

I’m only familiar with Gemini, but you should be able to chop up your videos into 50 minute chunks (I think that’s the limit) with ffmpeg and then ask the model for timestamps of scene changes. Then take those timestamps and use ffmpeg to chop the videos into scenes and then ask gemini to describe each scene. Then you’d get more manageable videos instead of two hour long ones where you don’t know where that part you’re looking for is.

view this post on Zulip James Thurley (Jan 22 2025 at 15:53):

This just popped up in my news feed:
https://www.theverge.com/2025/1/22/24349299/adobe-premiere-pro-after-effects-media-intelligence-search

Search in Premiere Pro has been updated with AI-powered visual recognition, allowing users to find videos by describing the contents of the footage. Users can enter search terms like “a person skating with a lens flare” to find corresponding clips within their media library.

view this post on Zulip Dustin (Feb 01 2025 at 20:59):

I managed to figure it out!

I'm doing a pretty simple pipeline in python.

  1. Use opencv to capture a frame at every second of the video as a jpeg.
  2. I'm using https://lancedb.github.io/lancedb/ for storage. It's got a really easy built in use of the CLiP model so I'm storing the bytes of the jpeg, the CLiP embeddings (provided for me by lancedb), path to the video, and the timestamp (captured in the original opencv step).
  3. Now it's as easy as table.search("query string") and lancedb handles embedding the text into the same space as the image embeddings and then doing cosine similarity to find the nearest records!

Last updated: Apr 03 2025 at 23:38 UTC