MMA-Bench · Project Page
Some Modalities are More Equal Than Others
Multimodal LLMs often fail the moment their sensory inputs disagree.
Despite impressive capabilities, modern MLLMs show strong textual bias, collapse under audio-video conflict,
and struggle to identify which modality the user actually wants the model to ground in.
MMA-Bench is our effort to systematically expose and fix these failure modes.
We design controlled audio-video-text conflict scenarios, evaluate models under stress, inspect their internal
attention behaviour, and introduce a lightweight alignment-aware tuning method that restores proper modality
grounding.
Our key finding: MLLMs do not naturally know when to trust sight, sound, or text when they present
conflicting information - but with targeted
supervision, they can learn.
Benchmark
Audio-video-text conflicts with paired visual & audio questions per clip.
Diagnostics
Black-box robustness tests + whitebox layer-wise attention statistical analysis.
Alignment Tuning
Modality-selective fine-tuning that teaches models when to trust which cue.