BrandFusion: A Multi-Agent Framework for Seamless Brand Integration in Text-to-Video Generation

Automatically embedding advertiser brands into prompt-generated videos while preserving semantic fidelity — enabling sustainable monetization for T2V services.

Zihao Zhuzihaozhu@link.cuhk.edu.cn1   Ruotong Wangruotongwang1@link.cuhk.edu.cn1   Siwei Lyusiweilyu@buffalo.edu3   Min Zhangzhangmin2021@hit.edu.cn4   Baoyuan Wuwubaoyuan@cuhk.edu.cn1,2*

1The Chinese University of Hong Kong, Shenzhen  ·  2Shenzhen Loop Area Institute  ·  3State University of New York at Buffalo  ·  4Harbin Institute of Technology

Paper arXiv Demo Videos Cite
BrandFusion Examples — seamless brand integration in generated videos

Overview

The rapid advancement of text-to-video (T2V) models has revolutionized content creation, yet their commercial potential remains largely untapped. We introduce, for the first time, the task of seamless brand integration in T2V: automatically embedding advertiser brands into prompt-generated videos while preserving semantic fidelity to user intent.


This task confronts three core challenges: maintaining prompt fidelity, ensuring brand recognizability, and achieving contextually natural integration. To address them, we propose BrandFusion, a novel multi-agent framework comprising two synergistic phases. In the offline phase (advertiser-facing), we construct a Brand Knowledge Base by probing model priors and adapting to novel brands via lightweight fine-tuning. In the online phase (user-facing), five agents jointly refine user prompts through iterative refinement, leveraging the shared knowledge base and real-time contextual tracking to ensure brand visibility and semantic alignment.


Experiments on 18 established and 2 custom brands across multiple state-of-the-art T2V models demonstrate that BrandFusion significantly outperforms baselines in semantic preservation, brand recognizability, and integration naturalness. Human evaluations further confirm higher user satisfaction, establishing a practical pathway for sustainable T2V monetization.

Key Contributions

Three fundamental advances in brand-integrated T2V generation

🎯
Contribution 01

Novel Task Definition

We introduce seamless brand integration in T2V generation as a novel task, accompanied by a comprehensive evaluation protocol covering semantic fidelity, brand visibility, and integration naturalness across 270 (brand, prompt) pairs.

🤖
Contribution 02

BrandFusion Framework

We propose BrandFusion, a multi-agent system with offline brand knowledge construction (prior probing + LoRA adaptation) and online collaborative prompt refinement via five specialized agents: Brand Selector, Strategy Generator, Prompt Refiner, Critic, and Experience Learner.

📊
Contribution 03

Extensive Validation

We conduct comprehensive experiments on 18 well-known and 2 novel brands across Veo3, Sora2, and Kling2.1, achieving state-of-the-art performance. Human evaluations confirm superior user satisfaction over all baselines.

Ecosystem of Brand Integration

A sustainable commercial ecosystem connecting brand owners, T2V providers, and end users

BrandFusion Ecosystem — brand owners, T2V providers, and end users

BrandFusion Framework

Two synergistic phases for seamless brand integration: offline knowledge construction and online multi-agent refinement

BrandFusion Framework Overview — offline and online phases
Phase I — Offline
Brand Knowledge Base Construction
Prior knowledge probing + selective LoRA fine-tuning for novel brands. Stores brand profiles, adapters, and experience pools.
Phase II — Online
Multi-Agent Brand Integration
Five LLM-powered agents collaboratively refine prompts through iterative optimization with dual memory: brand KB + session context.
Five Agents
Specialized Agent Roles
Brand Selector · Strategy Generator · Prompt Refiner · Critic · Experience Learner — each powered by GPT-5 with temperature 0.7.

Quantitative Results

BrandFusion significantly outperforms all baselines on semantic fidelity and brand integration while maintaining video quality

We evaluate on 270 (brand, prompt) pairs spanning high, medium, and low compatibility across 18 well-known brands and 3 commercial T2V models. BrandFusion achieves comparable video quality scores while dramatically outperforming baselines on semantic preservation, brand presence rate, and integration naturalness.

T2V Model Method VBench-Quality ↑ CLIPScore ↑ VQAScore ↑ LLMScore ↑ BPR ↑ Naturalness ↑
Veo3
Direct Append0.81120.26710.83420.78210.72212.83
Template Rewriting0.82670.28420.87560.92340.88453.12
Single Rewriting0.82890.29560.88910.94120.89683.90
BrandFusion (Ours)0.82830.32740.90980.95560.94744.70
Sora2
Direct Append0.79450.26450.82980.87560.64342.71
Template Rewriting0.80330.28680.87120.91870.78453.84
Single Rewriting0.80290.29680.88670.93680.82783.98
BrandFusion (Ours)0.80310.31770.92310.98750.90664.60
Kling2.1
Direct Append0.77540.26340.82760.87420.69892.65
Template Rewriting0.78030.28550.86890.91650.78123.73
Single Rewriting0.78830.29510.88230.93510.81453.84
BrandFusion (Ours)0.78180.31650.92080.98530.88344.48

User Study Results

10 participants scored videos on a 1–5 Likert scale across three dimensions

Semantic Consistency
BrandFusion4.09
Single Rewriting2.78
Template1.51
Direct Append1.31
Integration Naturalness
BrandFusion4.14
Single Rewriting2.44
Template2.21
Direct Append1.15
Overall Acceptability
BrandFusion4.22
Single Rewriting2.99
Template2.00
Direct Append1.53

Brand Integration Examples

Select a brand to explore generated videos alongside original and BrandFusion-refined prompts

Cite This Work

If you find BrandFusion useful for your research, please cite our paper

BibTeX
@article{zhu2026brandfusion, title = {BrandFusion: A Multi-Agent Framework for Seamless Brand Integration in Text-to-Video Generation}, author = {Zhu, Zihao and Wang, Ruotong and Lyu, Siwei and Zhang, Min and Wu, Baoyuan}, journal = {arXiv preprint arXiv:2603.02816}, year = {2026} }