Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos

The Univerity of Queensland
ACM Multimedia Dataset Track 2025 Submission
Architecture Diagram

Overview of the SynTVA pipeline and its application.

Abstract

Text-to-video (T2V) synthesis has advanced rapidly, yet current evaluation metrics primarily capture visual quality and temporal consistency, offering limited insight into how synthetic videos perform in downstream tasks such as text-to-video retrieval (TVR). In this work, we introduce SynTVA, a new dataset and benchmark designed to evaluate the utility of synthetic videos for building retrieval models. Based on 800 diverse user queries derived from MSRVTT training split, we generate synthetic videos using state-of-the-art T2V models and annotate each video-text pair along four key semantic alignment dimensions: Object \& Scene, Action, Attribute, and Prompt Fidelity. Our evaluation framework correlates general video quality assessment (VQA) metrics with these alignment scores, and examines their predictive power for downstream TVR performance. To explore pathways of scaling up, we further develop an Auto-Evaluator to estimate alignment quality from existing metrics. Beyond benchmarking, our results show that SynTVA is a valuable asset for dataset augmentation, enabling the selection of high-utility synthetic samples that measurably improve TVR outcomes.

Advertisement

Text Prompt: Sleek robot showcasing innovative features in a high-tech showroom setting.

Cosmos

Mochi

Wan2.1

Animal

Text Prompt: Fluffy puppy playing fetch with owner in the sunny backyard.

Cosmos

Mochi

Wan2.1

Cooking

Text Prompt: Chef chopping vegetables on a cutting board with a sharp knife.

Cosmos

Mochi

Wan2.1

Documentary

Text Prompt: Hillary Clinton delivers presidential speech at a large campaign event stage.

Cosmos

Mochi

Wan2.1

Education

Text Prompt: Instructor explains step-by-step math problem in a classroom setting.

Cosmos

Mochi

Wan2.1

Family

Text Prompt: Young girl singing live on stage with a bright spotlight.

Cosmos

Mochi

Wan2.1

Fashion

Text Prompt: Models showcasing diverse clothing styles on a shiny black runway stage.

Cosmos

Mochi

Wan2.1

Food

Text Prompt: Chef chopping fresh vegetables in a steaming kitchen setting.

Cosmos

Mochi

Wan2.1

Gaming

Text Prompt: Live gamer reacting to unexpected plot twist on game screen.

Cosmos

Mochi

Wan2.1

How-to

Text Prompt: How to fold a detailed origami star on a classroom desk.

Cosmos

Mochi

Wan2.1

Movies

Text Prompt: Actor discussing latest film in a studio with formal attire.

Cosmos

Mochi

Wan2.1

Music

Text Prompt: Watch the band performing live on a crowded concert stage with bright lights.

Cosmos

Mochi

Wan2.1

People

Text Prompt: Smartphone review in a studio with detailed camera feature demonstration.

Cosmos

Mochi

Wan2.1

Politics

Text Prompt: Formal debate on a television set with politicians discussing policies.

Cosmos

Mochi

Wan2.1

Shows

Text Prompt: Host interviewing celebrity guest on live talk show set with bright lights.

Cosmos

Mochi

Wan2.1

Sports

Text Prompt: Fast-paced soccer player dribbling past defenders on a crowded field.

Cosmos

Mochi

Wan2.1

Technology

Text Prompt: Colorful chemicals bubbling in a laboratory beaker during a science demonstration.

Cosmos

Mochi

Wan2.1

Travel

Text Prompt: Motorcycle weaving through crowded city street during nighttime rush hour.

Cosmos

Mochi

Wan2.1

Vehicles

Text Prompt: Review of Audi's smooth drive on a scenic country road.

Cosmos

Mochi

Wan2.1

BibTeX

Will be updated soon