Bridging Vision and Speech: A Novel VGG16-RNN Approach for Arabic Continuous Speech Recognition
Main Article Content
Abstract
Automatic Speech Recognition (ASR) has seen significant advancements due to deep learning, but Arabic remains a challenging language for these systems. Arabic's morphological richness, phonological complexity, and dialectal variations make Continuous Speech Recognition (CSR) for the language particularly difficult. This research explores a hybrid deep learning architecture, leveraging transfer learning from a pre-trained VGG16 model combined with Recurrent Neural Networks (RNN) for improved Arabic CSR performance. Using the MGB-2 dataset a diverse collection of Arabic broadcast news recordings, which presents the realistic and challenging variability in accents, speaker styles, and background noise, we focus on the effectiveness of integrating Convolutional Neural Networks (CNN) for feature extraction from Mel-frequency cepstral coefficients (MFCCs) and Bidirectional Long Short-Term Memory (BiLSTM) layers for capturing temporal dependencies. The proposed model achieves a Word Error Rate (WER) of 13%, significantly outperforming traditional ASR systems and several state-of-the-art models. This research highlights the potential of deep learning and transfer learning in overcoming the challenges of Arabic CSR, including handling dialectal variations and the morphological complexity of the language. The findings indicate that transfer learning from image-based CNNs to speech recognition tasks offers a robust method for feature extraction, contributing to the overall improvement in Arabic CSR. Future work should focus on further optimizing models to achieve human-level transcription accuracy, particularly for low-resource dialects and more diverse speech environments.
Article Details
This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.