Vinay Kothapally Yong Xu Meng Yu Shi-Xiong Zhang Dong Yu
Tencent AI Lab, Bellevue, WA, USA
Figure 2. In-Car Speech Separation Task using only 2-Microphones |
Abstract
While current deep learning (DL)-based beamforming techniques have been proved effective in speech separation, they are often designed to process narrow-band (NB) frequencies independently which results in higher computational costs and inference times, making them unsuitable for real-world use. In this paper, we propose DL-based mel-subband spatio-temporal beamformer to perform speech separation in a car environment with reduced computation cost and inference time. As opposed to conventional subband (SB) approaches, our framework uses a mel-scale based subband selection strategy which ensures a fine-grained processing for lower frequencies where most speech formant structure is present, and coarse-grained processing for higher frequencies. In a recursive way, robust frame-level beamforming weights are determined for each speaker location/zone in a car from the estimated subband speech and noise covariance matrices. Furthermore, proposed framework also estimates and suppresses any echoes from the loudspeaker(s) by using the echo reference signals. We compare the performance of our proposed framework to several NB, SB, and full-band (FB) processing techniques in terms of speech quality and recognition metrics. Based on experimental evaluations on simulated and real-world recordings, we find that our proposed framework achieves better separation performance over all SB and FB approaches and achieves performance closer to NB processing techniques while requiring lower computing cost.
Proposed Mel-Subband Beamformer
Figure 1. An overview of the proposed all deep learning-based speech separation system for in-car applications |
Key contributions
Our main contributions towards the proposed Deep Neural Mel-Subband Beamformer system are summarized as follows:
Python Scripts
Model Architecture
Results on speech quality metrics and word error rate (WER)
Separated Speech Signals using various approaches in the study
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
----------------------------------------------------------------------------------------------------------------------------------------------------
More audio samples can be released upon approval of the submission and company's sharing policy.
----------------------------------------------------------------------------------------------------------------------------------------------------
References
[1] Matheja, T., Buck, M., & Fingscheidt, T. (2013). A dynamic multi-channel speech enhancement system for distributed microphones in a car environment. EURASIP Journal on Advances in Signal Processing, 2013(1), 1-21.
[2] Saruwatari, H., Sawai, K., Lee, A., Shikano, K., Kaminuma, A., & Sakata, M. (2003). Speech enhancement and recognition in car environment using blind source separation and subband elimination processing.
[3] Yamada, T., Tawari, A., & Trivedi, M. M. (2012, September). In-vehicle speaker recognition using independent vector analysis. In 2012 15th International IEEE Conference on Intelligent Transportation Systems (pp. 1753-1758). IEEE. 2012
[4] O'Malley, T., Narayanan, A., & Wang, Q. (2022). A Universally-Deployable ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement, and Voice Separation. arXiv preprint arXiv:2209.06410.
[5] Erdogan, H., Hershey, J. R., Watanabe, S., Mandel, M. I., & Le Roux, J. (2016, September). Improved mvdr beamforming using single-channel mask prediction networks. In Interspeech (pp. 1981-1985) Chicago
[6] Zhang, Z., Xu, Y., Yu, M., Zhang, S. X., Chen, L., & Yu, D. (2021, June). ADL-MVDR: All deep learning MVDR beamformer for target speech separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6089-6093). IEEE.
[7] Zhang, Z., Xu, Y., Yu, M., Zhang, S. X., Chen, L., Williamson, D. S., & Yu, D. (2021). Multi-channel multi-frame ADL-MVDR for target speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3526-3540.
[8] Xu, Y., Zhang, Z., Yu, M., Zhang, S. X., & Yu, D. (2021). Generalized spatio-temporal RNN beamformer for target speech separation. arXiv preprint arXiv:2101.01280. Chicago
[9] Kothapally, V., Xu, Y., Yu, M., Zhang, S. X., & Yu, D. (2021). Joint AEC AND Beamforming with Double-Talk Detection using RNN-Transformer. arXiv preprint arXiv:2111.04904.
[10] Tawara, N., Kobayashi, T., & Ogawa, T. (2019, September). Multi-Channel Speech Enhancement Using Time-Domain Convolutional Denoising Autoencoder. In INTERSPEECH (pp. 86-90).
[11] Li, X., & Horaud, R. (2019). Narrow-band deep filtering for multichannel speech enhancement. arXiv preprint arXiv:1911.10791. Chicago
[12] Quan, C., & Li, X. (2022, May). Multi-Channel Narrow-Band Deep Speech Separation with Full-Band Permutation Invariant Training. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 541-545). IEEE.
[13] Lv, S., Hu, Y., Zhang, S., & Xie, L. (2021). Dccrn+: Channel-wise subband dccrn with snr estimation for speech enhancement. arXiv preprint arXiv:2106.08672.
Last update: November 23, 2022