Temporal Collage Prompting: A Cost-Effective Simulator-Based Driving Accident Video Recognition With GPT-4o

Conference proceedings article

Authors/Editors

CHAKARIDA NUKOOLKIT

Strategic Research Themes

Publication Details

Author list: Pratch Suntichaikul, Pittawat Taveekitworachai, Chakarida Nukoolkit, and Ruck Thawonmas

Publication year: 2024

Start page: 1

End page: 6

Number of pages: 6

Languages: English-United States (EN-US)

Abstract

This paper presents temporal collage prompting, a novel approach for detecting and classifying simulator-based driving accident videos using GPT-4o. While recognizing accident videos is crucial for assessing drivers’ abilities and safety, this task traditionally relies on human labor. In addition, it is timeconsuming and inefficient, especially when dealing with numerous videos. Large multi-modal models (LMMs) offer a promising solution to reduce processing time but face a challenge with context window limitation when handling video data. We address this by developing a method that optimizes input efficiency while preserving temporal information. Our approach combines multiple video frames into a collage, significantly reducing input tokens. Testing with custom scenarios generated from CARLA, a driving simulator, we achieve 93% accuracy in accident recognition using a 2x2 collage at 1 frame per second (FPS), outperforming a uniform frames baseline method. This configuration reduces token usage by 91% compared to uniform frames at 3 FPS, while improving accuracy from 72% to 93%. Through ablation studies with various collage layouts and sampling rates, we found that lower frame rates, particularly 1 FPS, are more effective for this task. Our results demonstrate that optimized frame sampling and collage creation can enhance both efficiency and accuracy of video recognition using LMMs, offering a promising solution for simulator-based driving video recognition in driving assessment, with potential applications in other domains requiring temporal visual recognition. We provide our data and source code for public use.

Keywords

CARLA, GPT-4o, Large multi-modal models, Prompt Engineering