ArXiv preprint

Cross-Modal Typographic Attacks

We study how semantically targeted perturbations delivered through audio, visual text, and text prompts influence audio-visual MLLMs. The paper introduces Multi-Modal Typography and shows that spoken injection is a realistic and effective attack channel, with stronger failures under aligned multi-modal perturbations.

Tianle Chen · Deepti Ghadiyaram

Department of Computer Science, Boston University

{tianle, dghadiya}@bu.edu

Audio typography

Visual typography

Text prompt injection

Boston University authors

Explore the findings Read preprint PDF

The page highlights the main quantitative findings, attack settings, ablations, and safety implications from the paper in a concise public release format.

Headline result 83.43% ASR

Aligned audio + visual perturbations are markedly stronger than single-modality audio attacks.

Audio only

34.93

Aligned AV

83.43

Figure showing clean audio-video input, injected audio and visual distractors, and resulting model prediction shift.

Safety stress test 8.04% harmful detection

Prompt-style benign speech can suppress harmful-content detection in the MetaHarm evaluation slice.

WorldSense 64.03%

Standalone spoken typography steers Qwen2.5-Omni-7B strongly on audio-visual questions.

Overview

Abstract and main contributions

Multi-Modal Typography studies how semantically matched perturbations across audio, visual, and text channels can redirect audio-visual reasoning. The central question is whether MLLMs treat semantically similar signals consistently across modalities, or whether the delivery channel fundamentally changes model behavior.

Speech is a native and realistic attack surface for audio-visual MLLMs. Injected spoken cues can steer audio-grounded tasks, spill over into visually grounded tasks, and become substantially stronger when paired with aligned visual perturbations.

Audio as typography The paper treats synthesized speech as a primary typographic modality and injects short spoken target cues while keeping the visual stream unchanged.

Cross-modal fragility Attacks are evaluated across audio, visual, and text channels to measure how different delivery modalities propagate adversarial semantics.

Multi-modal amplification Aligned audio-visual attacks are stronger than single-modality perturbations, showing that semantic agreement across channels amplifies failure.

Safety relevance The study extends beyond benchmark accuracy to content moderation, showing that benign spoken cues can reduce harmful-content detection.

Unimodal audio attack

Standalone spoken typography on WorldSense.

64%

On WorldSense, Qwen2.5-Omni-7B reaches 64.03% targeted ASR under spoken semantic injection.

Cross-modal impact

Visual questions can still be affected by speech.

12.85

On MMA-Bench visual questions, injected speech reduces Qwen2.5-Omni-7B accuracy by 12.85% even though the visual input is unchanged.

Aligned multi-modal attack

Coordinated audio and visual perturbations.

83%

Aligned audio-visual perturbations reach 83.43% ASR on MMA-Bench audio questions for Qwen2.5-Omni-7B.

Key findings

Audio typography is effective on its own, stronger when aligned across modalities.

These summary cards highlight the strongest quantitative results from the paper across aligned attacks, unimodal spoken attacks, cross-modal spillover, and safety evaluation.

Aligned multimodal attack

83.43%

On MMA-Bench audio questions for Qwen2.5-Omni-7B, aligned audio + visual perturbations outperform audio-only and visual-only attacks.

Standalone audio typography

64.03%

Audio injection alone reaches strong targeted attack success on WorldSense, showing that speech is already a powerful channel.

Cross-modal spillover

12.85%

Speech does not only hurt audio-centric reasoning. It also reduces accuracy on visually focused questions in MMA-Bench.

Safety degradation

8.04%

Under stronger prompt-style benign speech, harmful-content detection on the MetaHarm slice falls sharply for Qwen2.5-Omni-7B.

Attack channels

The study compares semantically targeted perturbations delivered through three channels: spoken audio, on-screen visual text, and text prompts.

Audio typography

Use spoken target cues as the native adversarial channel. Visually, this card should feel fluid and temporal, with wave-like accents and soft cyan glows.

TTS injection Volume Repetition Timing

Visual typography

Overlayed on-screen text is the most legible visual cue. Give this block stronger geometric framing, slightly brighter borders, and a warmer gold highlight.

On-screen text Prompt hijack Scene override

Text prompt channel

Prompt text completes the triad. This card should feel slightly more abstract and symbolic, using violet gradients and compact language-oriented badges.

Instruction text Target semantics Cross-modal conflict

Method

Method pipeline

The core experimental workflow starts from clean audio-video inputs, injects controlled target semantics, and measures both accuracy degradation and targeted steering.

Start from clean audio-video inputs

Use a calm baseline card with neutral lighting and clean spacing to represent the original multimodal scene before perturbation.

Inject spoken target cues

Represent TTS-based perturbation with waveform motifs and parameter chips for volume, repetition, temporal placement, and voice identity.

Compose cross-modal conflicts

Show audio, visual, and text streams as coordinated rails so visitors can immediately understand aligned and conflicting attack settings.

Evaluate ACC and ASR

Convert the benchmark story into readable cards rather than raw tables. Emphasize targeted redirection, not just overall degradation.

Stress safety-sensitive moderation

Close the method pipeline with a warmer, more urgent card to preview why this matters beyond benchmark performance.

Figure gallery

Selected figures

These figures summarize the main ablations: parameter sensitivity, effectiveness-stealth trade-offs, and prediction redistribution under stronger attacks.

Figure summarizing audio-visual typography takeaways and parameter sensitivity across volume, insertion position, repetition, and voice identity. — **Figure focus:** Sensitivity across volume, insertion position, repetition, and voice identity. This figure shows how volume, insertion position, repetition, and voice identity affect targeted attack strength.

Figure plotting effectiveness-stealth trade-off for audio questions with relative RMS and speech recognition shift. — **Trade-off view:** Volume is strongest but least stealthy, while repetition offers a more favorable effectiveness-stealth balance.

WorldSense parameter sensitivity figure for Qwen2.5-Omni-7B showing targeted ASR and label accuracy changes. — **WorldSense stress test:** Gain and repetition remain the dominant controls even in longer, speech-rich videos.

Prediction redistribution figure showing the effect of gain variation on ground-truth and injected target predictions. — **Prediction redistribution:** This redistribution analysis shows that stronger attacks reallocate predictions toward the injected target rather than causing only generic corruption.

Experimental matrix

Experimental scope

The paper evaluates multiple frontier MLLMs, multiple benchmarks, modality-specific question splits, and safety-oriented test settings.

Evaluated models

Qwen2.5-Omni-7B Qwen3-Omni-30B PandaGPT ChatBridge Gemini-2.5-Flash-Lite Gemini-3.1-Flash-Lite-preview

The evaluation includes open and closed frontier models with different multimodal architectures and varying levels of speech capability.

Benchmarks and slices

MMA-Bench Music-AVQA WorldSense MetaHarm I2P Visual questions Audio questions Audio-visual questions

Benchmarks include modality-partitioned QA settings, general audio-visual reasoning, and safety-oriented evaluations.

Safety and deployment

Safety implications

The paper concludes with content-moderation results showing that spoken benign cues can weaken harmful-content detection even when the harmful visual evidence remains present.

MetaHarm result

For Qwen2.5-Omni-7B, harmful-content detection on MetaHarm drops from 26.16% on clean inputs to 20.41% with a keyword-style benign cue and 8.04% with a stronger prompt-style benign cue.

26.16% Clean harmful-content detection on the MetaHarm slice.

20.41% After a keyword-style benign spoken cue.

8.04% After a stronger prompt-style benign spoken cue.

Interpretation

These results show that spoken semantics can override grounded visual evidence in safety-sensitive settings, not only in standard benchmark classification.

Clean MetaHarm detection

26.16

Keyword benign cue

20.41

Prompt-style benign cue

8.04

Resources

Direct links to the paper and supplementary release materials.

Project links

Links to the paper and supplementary materials for the public release.

ArXiv preprint Primary public entry point for the paper and abstract.

Open preprint

Paper PDF Direct PDF access for readers who want the full manuscript immediately.

Open PDF

Code Repository for attack generation, evaluation, and visualization.

Coming soon

Data Benchmark splits, qualitative examples, and supplementary release materials.