VideoGPT? Let’s Build One!

Mukul Pathak
4 min readMar 14, 2024

In the constantly changing landscape of multimedia, video remains a key medium, engaging viewers with its lively presentation of content. However, when it comes to analyzing and manipulating video, we face a distinctive challenge. Unlike text and still images, which LLM models such as ChatGPT can process with relative ease, video analysis is more complicated due to its nature as a sequence of visual frames. This complexity arises because current AI technologies, like ChatGPT, are not yet equipped to directly handle video data. While Google’s Gemini attempts to address video inputs, it struggles particularly with textual content, highlighting the nascent stage of video comprehension in AI. To overcome these hurdles, I will talk about an innovative yet simple solution to handle video files. This method serves as a bridge, transforming the dynamic video content into a form that AI is better equipped to understand, setting the stage for future advancements in Large Language Model (LLM) AI technologies.

Flow chart showing how the VideoGPT is built

The FPS Concept: Unraveling Videos Frame by Frame

At its core, a video is a sequence of images, displayed at a certain rate to convey motion. This rate, known as frames per…