Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset

Neil Shah1,2, Shirish Karande2, Vineet Gandhi1

1International Institute of Information Technology, Hyderabad, India

2TCS Research, Pune, India

Accepted at IEEE ICASSP 2025

International Conference on Acoustics, Speech, and Signal Processing

Hyderabad, India

Abstract:

Current Non-Audible Murmur (NAM)-to-speech techniques rely on voice cloning to simulate ground-truth speech from paired whispers. However, the simulated speech often lacks intelligibility and fails to generalize well across different speakers. To address this issue, we focus on learning phoneme-level alignments from paired whispers and text and employ a Text-to-Speech (TTS) system to synthesize the ground-truth. To reduce dependence on whispers, we learn phoneme alignments directly from NAMs, though the quality is constrained by the available training data. To further mitigate reliance on NAM/whisper data for ground-truth simulation, we propose incorporating the lip modality to infer speech and introduce a novel diffusion-based method that leverages recent advancements in lip-to-speech technology. Additionally, we release the MultiNAM dataset with over 7.96 hours of paired NAM, whisper, video, and text data from two speakers and benchmark all methods on this dataset.

Table of Contents

MultiNAM Dataset

Explore our dataset containing paired NAM, whisper, video, and text data:

Access the MultiNAM Dataset

Samples from Our Proposed Dataset

Text Video Lip Video NAM Whisper
They could, when they exerted themselves, make among them about twelve pounds of pins in a day.
And there is at this day a village in Scotland, where it is not uncommon.
Gate to the last was the debtors' prison for freemen of the city of London.
In England about this time, an attempt was made.

R1: Simulated ground-truth speech using Whisper speech and text as input for CSTR NAM TIMIT Plus corpus

Note: The HuBERT-Hifi method follows from the paper: N. Shah, S. Karande, and V. Gandhi, “Towards improving nam-to-speech synthesis intelligibility using self-supervised speech models,” in Interspeech 2024, 2024, pp. 2470–2474.

The MFA-TTS method follows from the paper: N. Shah, N. Sahipjohn, V. Tambrahalli, R. Subramanian, and V. Gandhi, “Stethospeech: Speech generation through a clinical stethoscope attached to the skin,” Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 8, no. 3, Sep. 2024. [Online]. Available: https://doi.org/10.1145/3678515

Ground-truth Text Input Whisper Speech (WER: 1.07) HuBERT-Hifi (WER: 23.77) MFA-TTS (Modalities used: Whisper & text) (WER: 41.07)
but it was for charity just a bit of fun.
if the red of the 2nd bow falls upon the green of the 1st the result is to give a bow with an abnormally wide yellow band since red and green light when mixed form yellow.

R2: Simulated Ground-Truth Speech Using Whisper Speech and Text as Input for Our Proposed Corpus

Ground-Truth Text Input Whisper Speech (WER ~ 5.54 - 12.97) HuBERT-Hifi (WER ~ 84.53 - 100.14) MFA-TTS (Modalities used: Whisper & Text) (WER ~ 5.84 - 11.23)
This hurtful practice might be discouraged.
And you will perceive that the number of people, of whose industry a part.
If, for the sake of equality, it was thought necessary to lay a tax upon this liquor.
These are in some countries called transit duties.
Is that levied by the king of Denmark upon all merchant ships which pass through the Sound.
America has produced a good many showy books.
And in dimensions varying from thirty feet by fifteen to fifteen feet by ten.
Near it a grating through which the debtors receive their beer from the neighboring public houses.
Three hundred for the kingdom, and seven thousand twenty for Middlesex.

R3: Simulated ground-truth speech using NAM / Lip and text as input for our proposed corpus.

Ground-truth Text Input NAM vibrations (WER ~ 192.77 - 200.50) MFA-TTS (Modalities used: NAM & Text) (Both speaker samples for training) (WER: 12.37 - 19.64) DiffNAM (Modalities used: Lip, NAM & Text) (WER: 17.23 - 21.73) MFA-TTS (Modalities used: NAM & Text) (speaker-specific training) (WER: 23.81 - 33.62) LipVoicer with Ground-truth Text in inference (Modalities used: Lip & text) (WER: 27.01 - 33.94) LipVoicer with Predicted Text in inference (Modalities used: Lip & text) (WER: 39.04 - 49.99) Cross-Attention (Modalities used: Lip & Text) (WER: 62.13 - 67.54)
this hurtful practice might be discouraged.
and you will perceive that the number of people, of whose industry a part.
If, for the sake of equality, it was thought necessary to lay a tax upon this liquor.
These are in some countries called transit duties.

R4: Converted speech samples by training Seq2Seq using simulated ground-truth speech from R2 and R3 for our proposed corpus.

Ground-truth Text Input NAM vibrations (WER ~ 192.77 - 200.50) Method: MFA-TTS (Whisper and Text aligned) (WER ~ 15.21 - 25.69) Method: MFA-TTS (NAM and Text aligned) (WER ~ 26.38 - 32.61) Method: DiffNAM (WER ~ 32.39 - 38.94) Method: LipVoicer (Ground-truth Text) (WER ~ 54.11 - 59.66) Method: LipVoicer (Predicted Text) (WER ~ 74.51 - 79.46) Method: Cross-Attention (WER ~ 87.98 - 93.04) Method: Mspec-Net (WER ~ 141.23 - 156.19)
or may be assessed, so as to leave no doubt concerning either what ought to be paid.
for the maintenance of the road or of the navigation.
and in manufacturing art and industry.
especially as regards the lower case letters.
what are the taxes which fall finally upon the rent of the land.
the highest orders of people are rated according to their rank.
I have no right to interfere with his work.
For two more years these high figures were steadily maintained.
he had loaded with heavy irons.
ten pence more, and the turn key two shillings.
had it not been appropriated by the turnkey who winked at this evasion of the rules.