MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs

1Beijing University of Posts and Telecommunications  2University of California, Santa Barbara 3MAIS & NLPR, Institute of Automation, Chinese Academy of Sciences
Teaser Image

Top: Previous methods often assume a single misinformation source and conduct single-source detection. Bottom: We collaborate generative models and AI tools to build a mixed-source multimodal misinformation benchmark and achieve mixed-source detection.

Abstract

Current multimodal misinformation detection (MMD) methods often assume a single source and type of forgery for each sample, which is insufficient for real-world scenarios where multiple forgery sources coexist. The lack of a benchmark for mixed-source misinformation has hindered progress in this field.

To address this, we introduce MMFakeBench, the first comprehensive benchmark for mixed-source MMD. MMFakeBench includes 3 critical sources: textual veracity distortion, visual veracity distortion, and cross-modal consistency distortion, along with 12 sub-categories of misinformation forgery types.

We further conduct an extensive evaluation of 6 prevalent detection methods and 15 large vision-language models (LVLMs) on MMFakeBench under a zero-shot setting. The results indicate that current methods struggle under this challenging and realistic mixed-source MMD setting.

Additionally, we propose an innovative unified framework, which integrates rationales, actions, and tool-use capabilities of LVLM agents, significantly enhancing accuracy and generalization.

We believe this study will catalyze future research into more realistic mixed-source multimodal misinformation and provide a fair evaluation of misinformation detection methods.

Method

Teaser Image

We present a simple yet effective framework called MMD-Agent, which integrates the rationales, actions, and tool-use capabilities of LVLM agents. MMD-Agent involves two main processes: (1) Hierarchical decomposition and (2) Integration of multi-perspective rationales.

We first instruct LVLMs to decompose the task of mixed-source multimodal misinformation detection into three smaller subtasks: textual veracity check, visual veracity check, and cross-modal consistency reasoning. During the intermediary phase, each subtask is addressed by generating multi-perspective rationales and integrating them with model actions to facilitate decision-making.

The rationales aim to provide valuable insights by reasoning within the current context, addressing the requirement of detecting diverse clues in misinformation. These rationales can encompass multiple perspectives, such as textual key entity (Rationale 1), injecting retrieval knowledge (Rationale 2), factual analysis (Rationale 3), and commonsense analysis ( Rationale 5 and Rationale 7). The model is guided to generate reasoning paths and make decisions for each subtask individually, leveraging its inherent understanding and capabilities.

Experiments

Teaser Image

Overall results (%) of different models on the MMFakeBench validation and test set with the comparison of standard prompt (SP) and proposed MMD-Agent framework.

Visualization Examples

Examples for Textual Veracity Distortion

Examples for Textual Veracity Distortion.

Examples for Visual Veracity Distortion

Examples for Visual Veracity Distortion.

Examples for Cross-modal Consistency Distortion

Examples for Cross-modal Consistency Distortion.

BibTeX

@article{liu2024mmfakebench,
  title={MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs},
  author={Liu, Xuannan and Li, Zekun and Li, Peipei and Xia, Shuhan and Cui, Xing and Huang, Linzhi and Huang, Huaibo and Deng, Weihong and He, Zhaofeng},
  journal={arXiv preprint arXiv:2406.08772},
  year={2024}
}