El Maghraby, E. E., A. M. Gody, and H. M. Farouk, "Audio-Visual Speech Recognition Using LSTM and CNN", (Formerly <a href='/journal/99' class="text-white">Recent Patents on Computer Science</a>), vol. 14, issue 6, pp. 2023 - 2039, 2021. Abstract

Background: Multimodal speech recognition is proved to be one of the most promisingsolutions for robust speech recognition, especially when the audio signal is corrupted by noise. As
the visual speech signal not affected by audio noise, it can be used to obtain more information used
to enhance the speech recognition accuracy in noisy system. The critical stage in designing robust
speech recognition system is choosing of reliable classification method from large variety of available
classification techniques. Deep learning is well-known as a technique that has the ability to
classify a nonlinear problem, and takes into consideration the sequential characteristic of the speech
signal. Numerous researches have been done in applying deep learning to overcome Audio-Visual
Speech Recognition (AVSR) problems due to its amazing achievements in both speech and image
recognition. Even though optimistic results have been obtained from the continuous studies, researches
on enhancing accuracy in noise system and selecting the best classification technique are
still gaining lots of attention.
Objective: This paper aims to build AVSR system that uses both acoustic combined with visual
speech information and use classification technique based on deep learning to improve the recognition
performance in a clean and noisy environment.
Methods: Mel Frequency Cepstral Coefficient (MFCC) and Discrete Cosine Transform (DCT) are
used to extract the effective features from audio and visual speech signal respectively. The audio
feature rate is greater than the visual feature rate, so that linear interpolation is needed to obtain
equal feature vectors size then early integrating them to get combined feature vector. Bidirectional
Long-Short Term Memory (BiLSTM), one of the Deep learning techniques, are used for classification
process and compare the obtained results to other classification techniques like Convolution
Neural Network (CNN) and the traditional Hidden Markov Models (HMM). The effectiveness of
the proposed model is proved by using two multi-speaker AVSR datasets termed AVletters and
GRID.
Results: The proposed model gives promising results where the obtained results In case of GRID,
using integrated audio-visual features achieved highest recognition accuracy of 99.07% and 98.47%
, with enhancement up to 9.28% and 12.05% over audio-only for clean and noisy data respectively.
For AVletters, the highest recognition accuracy is 93.33% with enhancement up to 8.33% over audio-
only.
Conclusion: Based on the obtained results, we can conclude that increasing the size of audio feature
vector from 13 to 39 doesn’t give effective enhancement for the recognition accuracy in clean
environment, but in noisy environment, it gives better performance. BiLSTM is considered to be
the optimal classifier for a robust speech recognition system when compared to CNN and traditional
HMM, because it takes into consideration the sequential characteristic of the speech signal (audio
and visual). The proposed model gives great improvement in the recognition accuracy and decreasing
the loss value for both clean and noisy environments than using audio-only features. Comparing
the proposed model to previously obtain results which using the same datasets, we found that our
model gives higher recognition accuracy and confirms the robustness of our model.

Farouk, M. H., and N. Yassin, "Application of Quantum-Clustering on Thermograms of WiFi Circuits in Different Operation Modes", Pattern Recognition and Image Analysis, vol. 29, issue July 2019, pp. 565–571, 2019.
Jafar, A., M. Fakhr, and M. Farouk, "Enhanced Clustering-Based Topic Identification of Transcribed Arabic Broadcast News", The International Arab Journal of Information Technology, vol. 14, issue 5, pp. 721-728, 2017.
Farouk, H. M., "On the application of quantum clustering on speech data", International Journal of Speech Technology, vol. 20, issue 4, pp. 891-896, 2017.
Farouk, M. H., M. W. Fakhr, and A. A. Jafar, "Clustering-Based Topic Identification of Transcribed Arabic Broadcast News", New Trends in Networking, Computing, E-learning, Systems Sciences, and Engineering, Cham, Swizerland, Springer, 2015.
Farouk, M. H., "Application of Genetic Algorithms for the Estimation of Ultrasonic Parameters, ch. 3,", Computational Intelligence Applications in Modeling and Control, Series: Studies in Computational Intelligence, vol. 575, Berlin, Springer, 2015. Abstract

Abstract In this chapter, the use of genetic algorithm (GA) is investigated in the
field of estimating ultrasonic (US) propagation parameters. Recent works are, then,
surveyed showing an ever-spread of employing GA in different applications of US.
A GA is, specifically, used to estimate the propagation parameters of US waves in
polycrystalline and composite materials for different applications. The objective
function of the estimation is the minimization of a rational difference error between
the estimated and measured transfer functions of US-wave propagation. The US
propagation parameters may be the phase velocity and attenuation. Based on tentative
experiments, we will demonstrate how the evolution operators and parameters
of GA can be chosen for modeling of US propagation. The GA-based
estimation is applied, in a test experiment, on steel alloy and Aluminum specimens
with different grain sizes. Comparative results of that experiment are presented on
different evolution operators for less estimation errors and complexity. The results
prove the effectiveness of GA in estimating parameters for US propagation.
Keywords Genetic algorithm (GA)  Inverse problem characterization  Ultrasonic
(US) non-destructive testing (NDT)  Transfer function (TF) parameter estimation 
Materials characterization