Text-to-video (T2V) synthesis has advanced rapidly, yet current evaluation metrics primarily capture visual quality and temporal consistency, offering limited insight into how synthetic videos perform in downstream tasks such as text-to-video retrieval (TVR). In this work, we introduce SynTVA, a new dataset and benchmark designed to evaluate the utility of synthetic videos for building retrieval models. Based on 800 diverse user queries derived from MSRVTT training split, we generate synthetic videos using state-of-the-art T2V models and annotate each video-text pair along four key semantic alignment dimensions: Object \& Scene, Action, Attribute, and Prompt Fidelity. Our evaluation framework correlates general video quality assessment (VQA) metrics with these alignment scores, and examines their predictive power for downstream TVR performance. To explore pathways of scaling up, we further develop an Auto-Evaluator to estimate alignment quality from existing metrics. Beyond benchmarking, our results show that SynTVA is a valuable asset for dataset augmentation, enabling the selection of high-utility synthetic samples that measurably improve TVR outcomes.
Advertisement
Text Prompt: Sleek robot showcasing innovative features in a high-tech showroom setting.
Cosmos
Mochi
Wan2.1
Animal
Text Prompt: Fluffy puppy playing fetch with owner in the sunny backyard.
Cosmos
Mochi
Wan2.1
Cooking
Text Prompt: Chef chopping vegetables on a cutting board with a sharp knife.
Cosmos
Mochi
Wan2.1
Documentary
Text Prompt: Hillary Clinton delivers presidential speech at a large campaign event stage.
Cosmos
Mochi
Wan2.1
Education
Text Prompt: Instructor explains step-by-step math problem in a classroom setting.
Cosmos
Mochi
Wan2.1
Family
Text Prompt: Young girl singing live on stage with a bright spotlight.
Cosmos
Mochi
Wan2.1
Fashion
Text Prompt: Models showcasing diverse clothing styles on a shiny black runway stage.
Cosmos
Mochi
Wan2.1
Food
Text Prompt: Chef chopping fresh vegetables in a steaming kitchen setting.
Cosmos
Mochi
Wan2.1
Gaming
Text Prompt: Live gamer reacting to unexpected plot twist on game screen.
Cosmos
Mochi
Wan2.1
How-to
Text Prompt: How to fold a detailed origami star on a classroom desk.
Cosmos
Mochi
Wan2.1
Movies
Text Prompt: Actor discussing latest film in a studio with formal attire.
Cosmos
Mochi
Wan2.1
Music
Text Prompt: Watch the band performing live on a crowded concert stage with bright lights.
Cosmos
Mochi
Wan2.1
People
Text Prompt: Smartphone review in a studio with detailed camera feature demonstration.
Cosmos
Mochi
Wan2.1
Politics
Text Prompt: Formal debate on a television set with politicians discussing policies.
Cosmos
Mochi
Wan2.1
Shows
Text Prompt: Host interviewing celebrity guest on live talk show set with bright lights.
Cosmos
Mochi
Wan2.1
Sports
Text Prompt: Fast-paced soccer player dribbling past defenders on a crowded field.
Cosmos
Mochi
Wan2.1
Technology
Text Prompt: Colorful chemicals bubbling in a laboratory beaker during a science demonstration.
Cosmos
Mochi
Wan2.1
Travel
Text Prompt: Motorcycle weaving through crowded city street during nighttime rush hour.
Cosmos
Mochi
Wan2.1
Vehicles
Text Prompt: Review of Audi's smooth drive on a scenic country road.
Cosmos
Mochi
Wan2.1
Will be updated soon