Joint Neural AEC and Beamforming with Double-Talk Detection

Vinay Kothapally 1     Yong Xu 2     Meng Yu 3     Shi-Xiong Zhang 4     Dong Yu 5    

Center for Robust Speech Systems (CRSS), The University of Texas at Dallas, Texas, USA1,   Tencent AI Lab, Bellevue, WA, USA2,3,4,5

 

Figure 1. Architectural detials of all deep learning based joint echo cancellation and beamformer trained using time-domain scale-invariant SNR.

 

Abstract

Acoustic echo cancellation (AEC) in full-duplex communication systems eliminates acoustic feedback. However, nonlinear distortions induced by audio devices, background noise, reverberation, and double-talk reduce the efficiency of conventional AEC systems. Several hybrid AEC models were proposed to address this, which use deep learning models to suppress residual echo from standard adaptive filtering. This paper proposes deep learning-based joint AEC and beamforming model (JAECBF) building on our previous self-attentive recurrent neural network (RNN) beamformer. The proposed network consists of two modules: (i) multi-channel neural-AEC, and (ii) joint AEC-RNN beamformer with a double-talk detection (DTD) that computes time-frequency (T-F) beamforming weights. We train the proposed model in an end-to-end approach to eliminate background noise and echoes from far-end audio devices, which include nonlinear distortions. From experimental evaluations, we find the proposed network outperforms other multi-channel AEC and denoising systems in terms of speech recognition rate and overall speech quality.

 

Key contributions

Our main contributions towards the proposed all-deep-learning (ADL) Joint AEC and Beamforming system are summarized as follows:

  1. We propose using a joint spatial covariance matrix computed using microphone signals and far-end speech as input features, which accounts for cross-correlation between far-end speech and multiple microphones, essential for designing an efficient multi-channel AEC system.
  2. We extend our recently proposed generalized spatio-temporal RNN beamformer (GRNNBF) to a joint spatio-temporal AEC-beamformer (JAECBF) for handling AEC and beamforming simultaneously using original and AEC processed signals.
  3. We employ a double-talk detection (DTD) module based on the multi-head attention and recurrent networks, that computes attention over time while leveraging double-talk detection to suppress far-end residuals.

 

Python Scripts

  1. Proposed Joint AEC Beamformer model script (model.py)
  2. Run "python model.py" to print architectural details for the proposed netowrk

 

Model Architecture

 

Results on speech quality metrics and word error rate (WER)

 

Enhanced audio samples from various Systems in the study

No Processing

SpeexDSP + JAECBF

FTLSTM + JAECBF

Proposed JAECBF w. DTD

Reverberant Near-End Clean Speech

No Processing

SpeexDSP + JAECBF

FTLSTM + JAECBF

Proposed JAECBF w. DTD

Reverberant Near-End Clean Speech

No Processing

SpeexDSP + JAECBF

FTLSTM + JAECBF

Proposed JAECBF w. DTD

Reverberant Near-End Clean Speech

No Processing

SpeexDSP + JAECBF

FTLSTM + JAECBF

Proposed JAECBF w. DTD

Reverberant Near-End Clean Speech

No Processing

SpeexDSP + JAECBF

FTLSTM + JAECBF

Proposed JAECBF w. DTD

Reverberant Near-End Clean Speech

No Processing

SpeexDSP + JAECBF

FTLSTM + JAECBF

Proposed JAECBF w. DTD

Reverberant Near-End Clean Speech

No Processing

SpeexDSP + JAECBF

FTLSTM + JAECBF

Proposed JAECBF w. DTD

Reverberant Near-End Clean Speech

No Processing

SpeexDSP + JAECBF

FTLSTM + JAECBF

Proposed JAECBF w. DTD

Reverberant Near-End Clean Speech

No Processing

SpeexDSP + JAECBF

FTLSTM + JAECBF

Proposed JAECBF w. DTD

Reverberant Near-End Clean Speech

No Processing

SpeexDSP + JAECBF

FTLSTM + JAECBF

Proposed JAECBF w. DTD

Reverberant Near-End Clean Speech

No Processing

SpeexDSP + JAECBF

FTLSTM + JAECBF

Proposed JAECBF w. DTD

Reverberant Near-End Clean Speech

No Processing

SpeexDSP + JAECBF

FTLSTM + JAECBF

Proposed JAECBF w. DTD

Reverberant Near-End Clean Speech

----------------------------------------------------------------------------------------------------------------------------------------------------

More audio samples can be released upon approval of the submission and company's sharing policy.

----------------------------------------------------------------------------------------------------------------------------------------------------

 

References

[1] Q. Wang and et al., “A frequency-domain nonlinear echo processing algorithm for high quality hands-free voice communication devices,” Multimedia Tools and Applications, vol. 80, no. 7, 2021.
[2] Y. Park and et al., “Frequency domain acoustic echo suppression based on soft decision,” IEEE Signal Processing Letters, vol. 16, no. 1, pp. 53–56, 2008.
[3] S. Zhang and et al., “F-T-LSTM Based Complex Network for Joint Acoustic Echo Cancellation and Speech Enhancement,” in Interspeech, 2021, pp. 4758–4762.
[4] I. Amir and et al., “Nonlinear Acoustic Echo Cancellation with Deep Learning,” in Interspeech, 2021.
[5] J. Valin, “Speex: A free codec for free speech,” ArXiv, vol.abs/1602.08668, 2016.
[6] L. Ma and et al., “Acoustic echo cancellation by combining adaptive digital filter and recurrent neural network,” arXiv preprint arXiv:2005.09237, 2020.
[7] L. Ma and et al., “Echofilter: End-to-end neural network for acoustic echo cancellation,” arXiv preprint arXiv:2105.14666, 2021.
[8] X. Zhou and et al., “Residual acoustic echo suppression based on efficient multi-task convolutional neural network,” arXiv preprint arXiv:2009.13931, 2020.
[9] C. Zhang and et al., “A robust and cascaded acoustic echo cancellation based on deep learning.,” in Interspeech, 2020, pp. 3940–3944.
[10] H. Zhang and et al., “A Deep Learning Approach to Multi-Channel and Multi-Microphone Acoustic Echo Cancellation,” in Interspeech, 2021, pp. 1139–1143.
[11] L. Ma and et al., “Multi-scale attention neural network for acoustic echo cancellation,” arXiv preprint arXiv:2106.00010, 2021.
[12] R. Peng and et al., “Acoustic Echo Cancellation Using Deep Complex Neural Network with Nonlinear Magnitude Compression and Phase Information,” in Interspeech, 2021, pp.4768–4772.
[13] M. M. Halimeh and et al., “Combining adaptive filtering and complex-valued deep postfiltering for acoustic echo cancellation,” in ICASSP, 2021, pp. 121–125.
[14] Y. Xu and et al., “Generalized spatio-temporal rnn beamformer for target speech separation,” Interspeech, 2021.
[15] Zhuohuang Zhang, Yong Xu, and et al., “ADL-MVDR: All deep learning MVDR beamformer for target speech separation,” in ICASSP, 2021, pp. 6089–6093.
[16] W. Mack and et al., “Deep filtering: Signal extraction and reconstruction using complex time-frequency filters,” IEEE Signal Processing Letters, vol. 27, pp. 61–65, 2019.
[17] M. M. Halimeh and et al., “Combining adaptive filtering and complex-valued deep postfiltering for acoustic echo cancella-tion,” in ICASSP, 2021, pp. 121–125.

Last update: March 24, 2022