MMA-Bench
MMA-Bench · Project Page

Some Modalities are More Equal Than Others

Multimodal LLMs often fail the moment their sensory inputs disagree. Despite impressive capabilities, modern MLLMs show strong textual bias, collapse under audio-video conflict, and struggle to identify which modality the user actually wants the model to ground in.
MMA-Bench is our effort to systematically expose and fix these failure modes. We design controlled audio-video-text conflict scenarios, evaluate models under stress, inspect their internal attention behaviour, and introduce a lightweight alignment-aware tuning method that restores proper modality grounding.
Our key finding: MLLMs do not naturally know when to trust sight, sound, or text when they present conflicting information - but with targeted supervision, they can learn.

Benchmark

Audio-video-text conflicts with paired visual & audio questions per clip.

Diagnostics

Black-box robustness tests + whitebox layer-wise attention statistical analysis.

Alignment Tuning

Modality-selective fine-tuning that teaches models when to trust which cue.

Illustration of conflicting audio-video-text scenarios in MMA-Bench
Authors
Tianle Chen1,*, Chaitanya Chakka1,*, Arjun Reddy Akula2, Xavier Thomas1, Deepti Ghadiyaram1
1 Boston University    2 Google DeepMind
* Equal contribution
Acknowledgement

We thank our collaborators and colleagues for their valuable feedback and support throughout this project. We also respectfully acknowledge that Arjun Reddy Akula participated in an advisory capacity only.

MMA-Bench: Controlled Modality Conflicts

MMA-Bench instantiates a small set of canonical audio-video-text conflict patterns. The explorer below mirrors the hero examples in the paper: each scenario shows a clip, a question, and the correct audio- and video-grounded answers.

Scenario Explorer

Click a scenario to see an example video, the question we ask, and the answers that are correct if the model grounds itself in audio or video.

Scenario

A church bell video with matching bell sounds and neutral text. Both visual and audio questions have consistent answers; models should behave like ideal multimodal reasoners.

Question

What object is repeatedly making sound in this clip?

Audio Answer

A ringing church bell.

Video Answer

A church bell swinging in the tower.

Dataset Curation Pipeline

Step 1 — Ontology-Based Filtering

We simplify the AudioSet ontology by absorbing overly fine-grained leaves, removing ambiguous, abstract, and restricted nodes, and keeping only visually-grounded, action-bearing classes. This produces a compact ontology well-suited for the task.

Two-step ontology-based filtering pipeline

Step 2 — MLLM-Based AV Consistency Filter

For each candidate clip, a multimodal LLM judge answers four consistency queries (visual-only, audio-only, and cross-modal) to verify that the same object is both visible and sounding. Clips that pass are then human-verified and used to form aligned / misaligned MMA-Bench pairs.

LLM-based audio-video consistency filtering pipeline

Results on MMA-Bench

We evaluate a range of open- and closed-source MLLMs under controlled modality perturbations and report detailed trends across tasks, models, and perturbation types.

Overall Performance on MMA-Bench

Accuracy on audio- and visual-focused questions under aligned and misaligned settings. This table summarizes how often models select the correct modality-specific answer.

Model Visual Prompt (%) Audio Prompt (%)
Align Misalign Align Misalign
Closed-Source Baselines
Gemini-2.5-Pro 97.90 95.28 60.37 24.95
Gemini-2.0-Flash 96.71 91.91 57.21 9.42
Gemini-2.0-Flash-Lite 94.89 94.11 59.19 4.04
Open-Source Baselines
Qwen3-Omni-30B-Instruct 92.88 83.73 57.39 14.58
Qwen2.5-Omni-7B (Base) 76.68 58.72 46.60 25.16
VideoLLaMA2 56.35 36.11 36.12 18.46
ChatBridge 51.64 54.71 41.61 7.07
PandaGPT 28.75 29.79 13.12 1.18
Qwen2.5-Omni-7B + Ours 94.68 94.37 88.14 79.79

Benchmarking against State-of-the-Art. Comparison of our fine-tuned model against a wide range of baselines. Bold indicates best performance, underline indicates second best. Our method achieves the highest audio robustness and strong cross-modal consistency under conflict.

Textual Bias & Misleading Captions

Performance drop when we prepend misleading captions or long irrelevant text, showing how strongly models over-trust language compared to audio-visual evidence.

Condition Visual Prompt Audio Prompt
(a) Text Misalignment
Qwen2.5-Omni-7B 37.81 11.98
+ Ours (Fine-tuned) 91.88 (+54.07) 28.63 (+16.65)
(b) Long Context (10K Tokens)
Qwen2.5-Omni-7B 63.65 34.75
+ Ours (Fine-tuned) 78.02 (+14.37) 28.36 (-6.39)

Effect of misleading textual context on Qwen2.5-Omni-7B. Accuracy before and after modality-aware fine-tuning under: (a) text misalignment (incorrect captions), and (b) long-context (10K irrelevant tokens).

Unimodal Ablations

Accuracy when we remove or corrupt one modality at a time (video blacked out or audio muted), revealing brittle integration and lack of abstention.

Condition Visual Prompt Audio Prompt
(a) Audio Removed
Qwen2.5-Omni-7B 71.49 54.39
+ Ours (Fine-tuned) 95.30 (+23.81) 16.17 (-38.22)
(b) Frames Zeroed
Qwen2.5-Omni-7B 45.28 33.74
+ Ours (Fine-tuned) 8.51 (-36.77) 82.52 (+48.78)

Unimodal ablation results for Qwen2.5-Omni-7B. Accuracy (%) before and after modality-aware fine-tuning when one modality is removed.

Black-Box Diagnostics Across Models

Use the model selector to inspect how each model behaves under unimodal ablations, semantic AV conflicts, misleading captions, and long-context perturbations. Check out the paper for more in-depth findings!

Model Comparison Explorer

Robustness Profiles

Our tuned model maintains high accuracy under aligned conditions and shows the smallest degradation under semantic AV conflicts and misleading text, indicating better modality selectivity and grounding.

Unimodal Ablations

Unimodal ablation robustness for selected model

Semantic Misalignment

Semantic AV misalignment robustness for selected model

Misleading Captions

Misleading caption robustness for selected model

Long Context

Long-context robustness for selected model

White-Box Attention Diagnostics

Use the model selector to to see their respective attention statistics and heatmaps. Check out the paper for more details!

Choose Model for White-Box View

Qwen2.5-Omni: Modality Selectivity

Qwen2.5-Omni shows strong textual dominance but exhibits noticeable shifts between audio and video tokens under modality-specific prompts, especially during misalinged samples.

Cohen's D & Layer-wise Modality Shifts

Cohen's D measures how far apart two attention distributions are (in standard-deviation units): attention under a visual prompt vs attention under an audio prompt.

D > 0 · visual-prompt attention is higher
D < 0 · audio-prompt attention is higher
|D| large · stronger modality shift
Cohen's D for video tokens

Video tokens: D > 0 → higher attention under the visual prompt.

Cohen's D for audio tokens

Audio tokens: D < 0 → higher attention under the audio prompt.

Attention Heatmaps

Representative last layer heatmaps for visual, audio and text tokens. These illustrate where the model actually attends when different modalities are emphasized or misaligned.

Attention heatmaps for Qwen

What Do We Learn from MMA-Bench?

We combine black-box evaluation with white-box attention analysis to understand how models actually integrate modalities, not just how they score on classic benchmarks.

When we remove or ablate a modality (e.g., silence the audio or zero out frames), MLLMs rarely degrade gracefully. Some models lean heavily on vision, others on text, and almost none abstain when the requested evidence is missing.

Text tokens absorb most of the attention mass, with visual tokens next and audio last. Misleading captions or long irrelevant context can completely override clear audio-visual cues, mirroring the strong textual attention dominance we observe inside the model.

Using effect sizes between attention distributions under visual vs. audio prompts, we find that models slightly reallocate attention toward the prompted modality, especially in deeper layers — but the shifts are too weak for robust reasoning under conflict until we apply alignment-aware tuning.

Alignment-Aware Tuning

Training Pipeline Overview

🎞️
Filtered training data
🧩
Process videos into
8 FPS, 504x504 frames
Generate QA pairs
(video- & audio-focused)
⚙️
LoRA SFT on WQ, WV
(backbone frozen)

We start from the AudioSet training split, pass it through the same two-stage ontology-based curation used for MMA-Bench (skipping the MLLM filtering and manual inspection stages), then construct paired visual- and audio-focused questions and fine-tune with lightweight LoRA adapters.

What a Training Sample Looks Like

🎞️
Video preprocessing

Each clip is rescaled and center-cropped to 504x504, matching Qwen's TMRoPE patch grid. Frames are sampled at 8 FPS to form a compact visual token sequence used in training.

Pairing QA prompts with samples

For every clip we create two short QA pairs: one visual-focused question about what is seen, and one audio-focused question about what is heard. The same video is reused, but each prompt explicitly asks the model to ground its answer in a single modality.

📊
Training set size

From the curated AudioSet train split, we obtain approximately ≈26k modality-specified QA pairs, pairing every clip with both a visual- and audio-focused question.

Attention Reallocation Effect

Layer-wise Cohen's D for video tokens in the misaligned setting, before and after tuning
Maximum Cohen's-D before tuning - 0.6
0
2
Maximum Cohen's-D after tuning - 1.6
0
2

The jump in Cohen's-D magnitudes after training shows that alignment-aware tuning changes internal attention behaviour: the model reallocates substantially more attention toward the queried modality instead of keeping mixed, low-contrast patterns.

Qualitative Demo Gallery

Semantic AV misalignment examples where our tuned model demonstrates robust grounding in the requested sensory input. Each clip is shown once; the rows below compare the baseline Qwen2.5-Omni-7B prediction to our alignment-aware tuned model.

Semantic Misalignment Examples

Toggle between audio-prompt and visual-prompt questions. Text-prompt demos coming soon.

Resources & BibTeX

Links will be updated as the project is released. For now, you can cite the MMA-Bench paper as:

BibTeX

              @misc{chen2025modalitiesequalothersdecoding,
                    title={Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs}, 
                    author={Tianle Chen and Chaitanya Chakka and Arjun Reddy Akula and Xavier Thomas and Deepti Ghadiyaram},
                    year={2025},
                    eprint={2511.22826},
                    archivePrefix={arXiv},
                    primaryClass={cs.CV},
                    url={https://arxiv.org/abs/2511.22826}, 
              }
              

Contact

Questions about MMA-Bench, code, or collaboration? Reach out to any of us below.

TC
Tianle Chen
tianle@bu.edu
CC
Chaitanya Chakka
chvskch@bu.edu
DG
Deepti Ghadiyaram
dghadiya@bu.edu