Deep Neural Mel-Subband Beamformer for In-car Speech separation

Vinay Kothapally    Yong Xu    Meng Yu    Shi-Xiong Zhang    Dong Yu

Tencent AI Lab, Bellevue, WA, USA

 

 

Problem Defintion

We consider the problem of In-car "N"-speaker speech separation using "M"-channel microphone array with the assumption that only one speaker is present in each zone and is permitted to move within a designated zone while speaking. We divide the space inside the car into a total of 4 zones. The model trained is ensured to be robust to in-car noise(s) such as background noise (music/speech from loudspeaker), reverberation, and overlapping speaker speech.

Figure 2. In-Car Speech Separation Task using only 2-Microphones

 

Abstract

While current deep learning (DL)-based beamforming techniques have been proved effective in speech separation, they are often designed to process narrow-band (NB) frequencies independently which results in higher computational costs and inference times, making them unsuitable for real-world use. In this paper, we propose DL-based mel-subband spatio-temporal beamformer to perform speech separation in a car environment with reduced computation cost and inference time. As opposed to conventional subband (SB) approaches, our framework uses a mel-scale based subband selection strategy which ensures a fine-grained processing for lower frequencies where most speech formant structure is present, and coarse-grained processing for higher frequencies. In a recursive way, robust frame-level beamforming weights are determined for each speaker location/zone in a car from the estimated subband speech and noise covariance matrices. Furthermore, proposed framework also estimates and suppresses any echoes from the loudspeaker(s) by using the echo reference signals. We compare the performance of our proposed framework to several NB, SB, and full-band (FB) processing techniques in terms of speech quality and recognition metrics. Based on experimental evaluations on simulated and real-world recordings, we find that our proposed framework achieves better separation performance over all SB and FB approaches and achieves performance closer to NB processing techniques while requiring lower computing cost.

 

Proposed Mel-Subband Beamformer

Figure 1. An overview of the proposed all deep learning-based speech separation system for in-car applications

 

Key contributions

Our main contributions towards the proposed Deep Neural Mel-Subband Beamformer system are summarized as follows:

  1. We aim to reduce the computational cost of beamformer while preserving the back-end speech application performances as much as we can.
  2. We extend our recently proposed joint spatio-temporal AEC-beamformer (JAECBF) which can handle AEC and beamforming simultaneously using unprocessed microphone signals and AEC processed signals.
  3. We propose a Mel-scale and convolutional based subband analysis and synthesis filters to perform beamforming in subband domain to reduce the overall computations for faster inferences.

 

Python Scripts

  1. Proposed Mel-Scale Subband Beamformer model script (model.py)
  2. Run "python model.py" to print architectural details for the proposed netowrk

 

Model Architecture

 

Results on speech quality metrics and word error rate (WER)

 

Separated Speech Signals using various approaches in the study

No Processing (Mixture)

Speakers Available in Zone-2, Zone-3, and Zone-4

Reference (S1)

Reference (S2)

Reference (S3)

Reference (S4)

[Full-Band LSTM+MHSA] Zone-1/S1

[Full-Band LSTM+MHSA] Zone-2/S2

[Full-Band LSTM+MHSA] Zone-3/S3

[Full-Band LSTM+MHSA] Zone-4/S4

[Full-Band ConvTasNet] Zone-1/S1

[Full-Band ConvTasNet] Zone-2/S2

[Full-Band ConvTasNet] Zone-3/S3

[Full-Band ConvTasNet] Zone-4/S4

[Full-Band GRNNBF] Zone-1/S1

[Full-Band GRNNBF] Zone-2/S2

[Full-Band GRNNBF] Zone-3/S3

[Full-Band GRNNBF] Zone-4/S4

[Traditional Subband Beamformer (#SB 64)] Zone-1/S1

[Traditional Subband Beamformer (#SB 64)] Zone-2/S2

[Traditional Subband Beamformer (#SB 64)] Zone-3/S3

[Traditional Subband Beamformer (#SB 64)] Zone-4/S4

[Proposed Mel-Subband Beamformer (#SB 64)] Zone-1/S1

[Proposed Mel-Subband Beamformer (#SB 64)] Zone-2/S2

[Proposed Mel-Subband Beamformer (#SB 64)] Zone-3/S3

[Proposed Mel-Subband Beamformer (#SB 64)] Zone-4/S4

[Narrow-Band GRNNBF] Zone-1/S1

[Narrow-Band GRNNBF] Zone-2/S2

[Narrow-Band GRNNBF] Zone-3/S3

[Narrow-Band GRNNBF] Zone-4/S4

No Processing (Mixture)

Speakers Available in Zone-1, Zone-2, and Zone-4

Reference (S1)

Reference (S2)

Reference (S3)

Reference (S4)

[Full-Band LSTM+MHSA] Zone-1/S1

[Full-Band LSTM+MHSA] Zone-2/S2

[Full-Band LSTM+MHSA] Zone-3/S3

[Full-Band LSTM+MHSA] Zone-4/S4

[Full-Band ConvTasNet] Zone-1/S1

[Full-Band ConvTasNet] Zone-2/S2

[Full-Band ConvTasNet] Zone-3/S3

[Full-Band ConvTasNet] Zone-4/S4

[Full-Band GRNNBF] Zone-1/S1

[Full-Band GRNNBF] Zone-2/S2

[Full-Band GRNNBF] Zone-3/S3

[Full-Band GRNNBF] Zone-4/S4

[Traditional Subband Beamformer (#SB 64)] Zone-1/S1

[Traditional Subband Beamformer (#SB 64)] Zone-2/S2

[Traditional Subband Beamformer (#SB 64)] Zone-3/S3

[Traditional Subband Beamformer (#SB 64)] Zone-4/S4

[Proposed Mel-Subband Beamformer (#SB 64)] Zone-1/S1

[Proposed Mel-Subband Beamformer (#SB 64)] Zone-2/S2

[Proposed Mel-Subband Beamformer (#SB 64)] Zone-3/S3

[Proposed Mel-Subband Beamformer (#SB 64)] Zone-4/S4

[Narrow-Band GRNNBF] Zone-1/S1

[Narrow-Band GRNNBF] Zone-2/S2

[Narrow-Band GRNNBF] Zone-3/S3

[Narrow-Band GRNNBF] Zone-4/S4

No Processing (Mixture)

Speakers Available in Zone-1, Zone-2, and Zone-4

Reference (S1)

Reference (S2)

Reference (S3)

Reference (S4)

[Full-Band LSTM+MHSA] Zone-1/S1

[Full-Band LSTM+MHSA] Zone-2/S2

[Full-Band LSTM+MHSA] Zone-3/S3

[Full-Band LSTM+MHSA] Zone-4/S4

[Full-Band ConvTasNet] Zone-1/S1

[Full-Band ConvTasNet] Zone-2/S2

[Full-Band ConvTasNet] Zone-3/S3

[Full-Band ConvTasNet] Zone-4/S4

[Full-Band GRNNBF] Zone-1/S1

[Full-Band GRNNBF] Zone-2/S2

[Full-Band GRNNBF] Zone-3/S3

[Full-Band GRNNBF] Zone-4/S4

[Traditional Subband Beamformer (#SB 64)] Zone-1/S1

[Traditional Subband Beamformer (#SB 64)] Zone-2/S2

[Traditional Subband Beamformer (#SB 64)] Zone-3/S3

[Traditional Subband Beamformer (#SB 64)] Zone-4/S4

[Proposed Mel-Subband Beamformer (#SB 64)] Zone-1/S1

[Proposed Mel-Subband Beamformer (#SB 64)] Zone-2/S2

[Proposed Mel-Subband Beamformer (#SB 64)] Zone-3/S3

[Proposed Mel-Subband Beamformer (#SB 64)] Zone-4/S4

[Narrow-Band GRNNBF] Zone-1/S1

[Narrow-Band GRNNBF] Zone-2/S2

[Narrow-Band GRNNBF] Zone-3/S3

[Narrow-Band GRNNBF] Zone-4/S4

----------------------------------------------------------------------------------------------------------------------------------------------------

More audio samples can be released upon approval of the submission and company's sharing policy.

----------------------------------------------------------------------------------------------------------------------------------------------------

 

References

[1] Matheja, T., Buck, M., & Fingscheidt, T. (2013). A dynamic multi-channel speech enhancement system for distributed microphones in a car environment. EURASIP Journal on Advances in Signal Processing, 2013(1), 1-21.
[2] Saruwatari, H., Sawai, K., Lee, A., Shikano, K., Kaminuma, A., & Sakata, M. (2003). Speech enhancement and recognition in car environment using blind source separation and subband elimination processing.
[3] Yamada, T., Tawari, A., & Trivedi, M. M. (2012, September). In-vehicle speaker recognition using independent vector analysis. In 2012 15th International IEEE Conference on Intelligent Transportation Systems (pp. 1753-1758). IEEE. 2012
[4] O'Malley, T., Narayanan, A., & Wang, Q. (2022). A Universally-Deployable ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement, and Voice Separation. arXiv preprint arXiv:2209.06410.
[5] Erdogan, H., Hershey, J. R., Watanabe, S., Mandel, M. I., & Le Roux, J. (2016, September). Improved mvdr beamforming using single-channel mask prediction networks. In Interspeech (pp. 1981-1985) Chicago
[6] Zhang, Z., Xu, Y., Yu, M., Zhang, S. X., Chen, L., & Yu, D. (2021, June). ADL-MVDR: All deep learning MVDR beamformer for target speech separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6089-6093). IEEE.
[7] Zhang, Z., Xu, Y., Yu, M., Zhang, S. X., Chen, L., Williamson, D. S., & Yu, D. (2021). Multi-channel multi-frame ADL-MVDR for target speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3526-3540.
[8] Xu, Y., Zhang, Z., Yu, M., Zhang, S. X., & Yu, D. (2021). Generalized spatio-temporal RNN beamformer for target speech separation. arXiv preprint arXiv:2101.01280. Chicago
[9] Kothapally, V., Xu, Y., Yu, M., Zhang, S. X., & Yu, D. (2021). Joint AEC AND Beamforming with Double-Talk Detection using RNN-Transformer. arXiv preprint arXiv:2111.04904.
[10] Tawara, N., Kobayashi, T., & Ogawa, T. (2019, September). Multi-Channel Speech Enhancement Using Time-Domain Convolutional Denoising Autoencoder. In INTERSPEECH (pp. 86-90).
[11] Li, X., & Horaud, R. (2019). Narrow-band deep filtering for multichannel speech enhancement. arXiv preprint arXiv:1911.10791. Chicago
[12] Quan, C., & Li, X. (2022, May). Multi-Channel Narrow-Band Deep Speech Separation with Full-Band Permutation Invariant Training. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 541-545). IEEE.
[13] Lv, S., Hu, Y., Zhang, S., & Xie, L. (2021). Dccrn+: Channel-wise subband dccrn with snr estimation for speech enhancement. arXiv preprint arXiv:2106.08672.

Last update: November 23, 2022