A Study of Multimodal Perception Mechanisms for Target Detection Accuracy Enhancement in Computer Vision

Main Article Content

Tao Li

Abstract

In this paper, a multimodal perception mechanism framework is constructed to comprehensively perceive complex environments by integrating different modal data. The target detection framework consists of five modules such as feature fusion block and feature enhancement, which fuse visual and textual features. Multi-scale text features are extracted by constructing line-level text embedding maps and converting them to 2D feature maps. During the feature extraction process, text features are incorporated multiple times to achieve deep fusion of multimodal features. In addition, attention mechanism and contextual information are introduced to optimize the target features and enhance the detection ability of complex scenes. The results show that the average accuracy of the multimodal perception mechanism is above 95%, and the equilibrium point is around 0.96. Under the color interference condition, the average accuracy of multimodal perception is 93.28, and the mAP is significantly improved to 95.4% when the attention mechanism and contextual information are introduced. A series of results verified that in computer vision target detection, the multimodal perception mechanism has the highest detection accuracy and achieved the best enhancement performance in the same period. 

Article Details

Section
Articles