, vol. 14, issue 6, pp. 2023 - 2039, 2021.
Background: Multimodal speech recognition is proved to be one of the most promisingsolutions for robust speech recognition, especially when the audio signal is corrupted by noise. As
the visual speech signal not affected by audio noise, it can be used to obtain more information used
to enhance the speech recognition accuracy in noisy system. The critical stage in designing robust
speech recognition system is choosing of reliable classification method from large variety of available
classification techniques. Deep learning is well-known as a technique that has the ability to
classify a nonlinear problem, and takes into consideration the sequential characteristic of the speech
signal. Numerous researches have been done in applying deep learning to overcome Audio-Visual
Speech Recognition (AVSR) problems due to its amazing achievements in both speech and image
recognition. Even though optimistic results have been obtained from the continuous studies, researches
on enhancing accuracy in noise system and selecting the best classification technique are
still gaining lots of attention.
Objective: This paper aims to build AVSR system that uses both acoustic combined with visual
speech information and use classification technique based on deep learning to improve the recognition
performance in a clean and noisy environment.
Methods: Mel Frequency Cepstral Coefficient (MFCC) and Discrete Cosine Transform (DCT) are
used to extract the effective features from audio and visual speech signal respectively. The audio
feature rate is greater than the visual feature rate, so that linear interpolation is needed to obtain
equal feature vectors size then early integrating them to get combined feature vector. Bidirectional
Long-Short Term Memory (BiLSTM), one of the Deep learning techniques, are used for classification
process and compare the obtained results to other classification techniques like Convolution
Neural Network (CNN) and the traditional Hidden Markov Models (HMM). The effectiveness of
the proposed model is proved by using two multi-speaker AVSR datasets termed AVletters and
GRID.
Results: The proposed model gives promising results where the obtained results In case of GRID,
using integrated audio-visual features achieved highest recognition accuracy of 99.07% and 98.47%
, with enhancement up to 9.28% and 12.05% over audio-only for clean and noisy data respectively.
For AVletters, the highest recognition accuracy is 93.33% with enhancement up to 8.33% over audio-
only.
Conclusion: Based on the obtained results, we can conclude that increasing the size of audio feature
vector from 13 to 39 doesn’t give effective enhancement for the recognition accuracy in clean
environment, but in noisy environment, it gives better performance. BiLSTM is considered to be
the optimal classifier for a robust speech recognition system when compared to CNN and traditional
HMM, because it takes into consideration the sequential characteristic of the speech signal (audio
and visual). The proposed model gives great improvement in the recognition accuracy and decreasing
the loss value for both clean and noisy environments than using audio-only features. Comparing
the proposed model to previously obtain results which using the same datasets, we found that our
model gives higher recognition accuracy and confirms the robustness of our model.