Time-Accurate Speech Rich Transcription with Non-Fluencies

Abstract. Speech is a hierarchical collection of text, prosody, emotions, dysfluencies, etc. Automatic transcription of speech that goes beyond text (words) is an underexplored problem. We focus on transcribing speech along with non-fluencies (dysfluencies). The current state-of-the-art pipeline suffers from complex architecture design, training complexity, and significant shortcomings in the local sequence aligner, and it does not explore in-context learning capacity. In this work, we propose SSDM 2.0, which tackles those shortcomings via four main contributions: (1) We propose a novel neural articulatory flow to derive highly scalable speech representations. (2) We developed a full-stack connectionist subsequence aligner that captures all types of dysfluencies.(3) We introduced a mispronunciation prompt pipeline and consistency learning module into LLM to leverage dysfluency in-context pronunciation learning abilities. (4) We curated Libri-Dys and open-sourced the current largest-scale co-dysfluency corpus, Libri-Co-Dys, for future research endeavors. Overall, SSDM 2.0 outperforms SSDM and all other dysfluency transcription models by a large margin.


Demo

nfvPPA Speech

People with nfvPPA gradually have more trouble expressing themselves, even though they still understand the meaning of words. They might begin speaking in shorter phrases or pausing mid-sentence or mid-word. They may have increasing difficulty with pronouncing or mixing up sounds in familiar words.

Audio SALMONN 13B GPT-4o (Real-Time) SSDM SSDM 2.0 (Ours)

Instruction: What do you think of the pronunciation?
Reference Text (Optional): Well, he is nearly 93 years old.
The articulation in this sentence is clear and easy to understand. The speaker is pronouncing each word clearly and distinctly, which makes it easier for the listener to follow along and understand the message being conveyed. Overall, the articulation in this sentence is excellent. From the audio, it seems like the speaker might be struggling with clarity in pronunciation. The words are a bit slurred, which could be due to a variety of factors, such as age or a speech impairment. In this case, it sounds like there might be some difficulty with articulation, making it a bit challenging to understand each word clearly. The person is saying, "Well, he is nearly 93 years old." There is a stutter on the word "nearly" at 1.20 seconds. "For the word 'nearly,' there is a stutter on the 'n' sound at 1.48 seconds, which affects the smoothness of articulation and may draw attention to the pronunciation. This type of stutter can briefly interrupt the word’s natural flow, altering listener perception. Additionally, there is an irregular pause at 0.36 seconds, which introduces a minor break in the rhythm of the speech. Despite these slight irregularities, the other pronunciations sound clear and fluent."

Instruction: What do you think of the pronunciation?
Reference Text (Optional): He dresses himself in an old black frock coat
The pronunciation of the words in the sentence is clear and easy to understand. The articulation of the consonants and vowels is accurate, and there are no mispronunciations. Overall, the pronunciation and articulation of the words in the sentence are good. It seems like the speaker is having difficulty with fluency, resulting in a pause in the middle of the sentence. The pronunciation of "dresses" is also a bit unclear. This could be due to a speech disorder or simply a moment of hesitation. The pronunciation is almost ok. There is repetition of phoneme "b" on word "black" at 3.30 seconds. There is a repetition of the phoneme 'b' in the word 'black' at 3.50 seconds.It seems the speaker is pronouncing it as 'p,' which indicates a voicing error. The other articulations are clear.

Instruction: What do you think of the pronunciation?
Reference Text (Optional): Usually several buttons missing
I think the pronunciation/articulation is a bit difficult to understand, especially for non-native speakers. The speaker seems to be emphasizing certain syllables and words, which can be distracting and take away from the overall clarity of the message. Additionally, the speaker seems to be speaking at a fast pace, which can make it difficult for listeners to follow along. Overall, I would say that the pronunciation/articulation could be improved to make the message easier to understand. In this clip, the speaker's pronunciation is not very clear, especially with the word "multiple," which sounded more like "muh-button." The rhythm of the speech is also a bit uneven, which might make it harder to understand. This could be due to issues with articulation or perhaps nervousness affecting the flow of speech. There are multiple problems. The speaker should say "usually several buttons missing". There is an insertion of phoneme on the word "usually" at 1.00 seconds and there is a stutter on the word "buttons" at 3.48 seconds. "For the word 'usually,' there is an insertion of the phoneme 's' at 1.24 seconds, which adds an unexpected sound and may momentarily interrupt the natural pronunciation. Such additions can slightly impact the fluency and listener’s focus on the intended word. Additionally, in the word 'missing,' the speaker stutters on 'buttons' with a repeated sound of 'muh' at 3.64 seconds, which influences the clarity and rhythm. Despite these small deviations, the other pronunciations are clear and well-articulated."

Instruction: What do you think of the pronunciation?
Reference Text (Optional): A long beard clings to his chin
As an AI language model, I don't have personal opinions or preferences. However, I can tell you that the pronunciation/articulation of the sentence appears to be correct based on the rules of the International Phonetic Alphabet (IPA). The speech here seems quite disjointed. The phrase "clings to his tent" is unclear, particularly the word "tent," which is pronounced with some distortion. It sounds like the speaker may have difficulty with certain consonants or with the overall articulation, leading to a lack of clarity in the pronunciation. The man is saying "A long beard clings to his chin". There is a stutter of phoneme "ch" on the word "chin" at 3.60 seconds. There is an irregular pause at 1.00 seconds, lasting around 0.2 seconds, which may influence the natural rhythm of the sentence, briefly interrupting the expected flow. This pause might not greatly impact comprehension, yet it introduces a minor disruption. At 3.82 seconds, the phoneme 'ch' is stuttered on the word 'chin,' which affects fluency slightly but remains understandable. Overall, this is a commendable effort with largely clear and precise pronunciation

Instruction: What do you think of the pronunciation?
Reference Text (Optional): Giving those who observe him a pronounced feeling of the utmost respect
I think the pronunciation/articulation is clear and easy to understand. The phrase "feeling of them... of respect" is quite jumbled. It sounds like the speaker is struggling with fluency and clear articulation, making it challenging to follow the words. This could be due to a speech disorder or another factor affecting speech production. There are several pronunciation problems. First, there is a repetition of the word "those" at 1.36 seconds. Second, there is a repetition of the word "pronounced" at 6.10 seconds. Try to speak smoothly. There is a repetition of the sound 'th' in the word 'those,' which may draw attention to the articulation and slightly disrupt the natural flow. Such repetitions can momentarily affect listener comprehension, though they don’t typically obscure the intended meaning. This occurs at 1.44 seconds. Additionally, the sound 'pro' is also repeated in the word 'pronounced,' affecting the fluency at 6.02 seconds. Overall, the pronunciation remains clear and understandable.

Artificial Simulated Speech 1

Samples from Libri-Dys test set.

Audio SALMONN 13B GPT-4o (Real-Time) SSDM SSDM 2.0 (Ours)

Instruction: What do you think of the pronunciation?
Reference Text (Optional): Also, a popular contrivance whereby love making may be suspended but not stopped during the picnic season.
I think that the problems for pronunciation and articulation for each phoneme and word are significant. It can be challenging for learners to master the sounds and rhythms of the language, especially if they are not native speakers. However, with practice and exposure to the language, learners can improve their pronunciation and articulation over time. In the audio clip, the speaker's pronunciation is generally clear. However, the word "contrivance" might be slightly challenging due to its uncommon usage. It could be emphasized a bit more to ensure clarity. Other than that, the articulation and pacing seem well done. You are attempting to say, 'Also, a popular contrivance whereby lovemaking may be suspended but not stopped during the picnic season.' There is a missing phoneme in the word 'picnic' at 7.8s to 8.0s. At 8.04 seconds, the phoneme 'c' is missing in the word 'picnic,' which alters the clarity of the word slightly. This omission may affect listener comprehension, especially in cases where precise articulation is critical. Additionally, in the word 'season,' there is a slight prolongation of the 'ea' sound, which elongates the vowel and impacts the overall flow of the pronunciation. While these are minor deviations, they contribute to the distinct speech pattern observed in the recording

Instruction: What do you think of the pronunciation?
Reference Text (Optional):Also, a popular contrivance whereby love making may be suspended but not stopped during the picnic season.
The pronunciation is clear and easy to understand. In this example, the speaker's pronunciation of "contrivance" is clear and more emphasized, which improves the clarity. Overall, the articulation and pacing are good. It sounds well-enunciated and easy to understand. Your articulation is almost perfect. Note that there is a prolongation at the word 'season' at 8.0s. At 7.88 seconds, there is a noticeable prolongation of the 'sea' sound in the word 'season,' which lengthens the vowel and affects the rhythm of the word. This extended sound may draw attention to the pronunciation, potentially disrupting the natural flow of the sentence.

Instruction: What do you think of the pronunciation?
Reference Text (Optional): Also, a draft on futurity, sometimes honored, but generally extended.
The pronunciation is perfect. In this clip, the phrase "draft on tourney" might be a bit unclear, possibly due to the uncommon combination of words. It could help to articulate "draft" and "tourney" more distinctly to improve clarity. The rest of the sentence is well enunciated, making it easy to understand. Your pronunciation is clear. There is a missing syllable in the word 'futurity' at 1.9s to 2.1s. At 2.04 seconds, the initial 'fu' sound in the word 'futurity' is missing, which affects the intelligibility of the word. This omission alters the listener’s perception, as it may lead to initial confusion about the intended term. While the rest of the word is articulated correctly, the absence of the initial sound impacts the overall clarity and completeness of the pronunciation.

Instruction: What do you think of the pronunciation?
Reference Text (Optional): This is a LibriVox recording.
As an AI language model, I don't have personal opinions, but I can tell you that pronunciation and articulation can be challenging for learners of any language. In this example, the speaker's pronunciation is clear and well-paced. The phrase "LibriVox recording" is articulated clearly, making it easy to understand. Overall, it's a well-pronounced. There is a significant prolongation of the word 'LibriVox' at 0.8s to 1.4s. At 0.82 seconds, there is a significant prolongation of the 'lib' sound in the word 'LibriVox,' which results in an extended initial syllable. This elongation can affect the rhythm and pacing of the pronunciation, potentially drawing undue attention to the word and altering the natural flow of the sentence.

Instruction: What do you think of the pronunciation?
Reference Text (Optional): You took the thing down?
The pronunciation is clear and easy to understand. In this clip, the phrase "took the thing down" is articulated clearly. However, there might be a slight hesitation or pause between the words, which could affect the flow of the sentence. Overall, the pronunciation is clear, but smoothing out the delivery could enhance clarity. It seems for the word 'you,' you are saying 'who' at 0.5s to 0.8s. The word 'you' is substituted with 'who,' which changes the intended meaning and may create confusion in the context. This substitution shifts the grammatical focus of the sentence and can affect clarity for listeners, especially when precision is crucial. This occurs at 0.56 seconds.

Artificial Simulated Speech 2

Samples from Libri-Dys-Co test set.

Audio SALMONN 13B GPT-4o (Real-Time) SSDM SSDM 2.0 (Ours)

Instruction: What do you think of the pronunciation?
Reference Text (Optional):He took no notice of her; he looked at me, but as if, instead of me, he saw what he spoke of.
Your pronunciation is flawless, with each word articulated clearly and precisely. The natural flow and clarity make it easy to follow, and every sound is well-defined. It sounds like the speaker might be having trouble with clear enunciation, particularly with the "th" sound in words like "he," which could come across as "e." Also, the "t" sound in "at" seemed slightly unclear. Your pronunciation maintains clarity for the majority of the recording, with only a brief stutter from 5.20 to 5.80 seconds. A one-second block occurs within the word 'saw' at the 5-second mark , which introduces a break that impacts the fluidity of pronunciation. This kind of block may temporarily interrupt the speech pattern, creating a slight disruption for the listener. Furthermore, at 8 seconds, there is a phonetic error on the word 'of' after 'spoken,' which could briefly hinder clarity. Beyond these instances, the overall articulation is clear.

Instruction: What do you think of the pronunciation?
Reference Text (Optional):Yes, yes, continued Morrel; "recall the scene, for the words you thought were only given to silence and solitude fell into my ears."
Your pronunciation is excellent, exhibiting precise articulation and clarity. Unable to transcribe the full sentence due to a block, which introduces an interruption. There appears to be a significant block in the word 'continued' from 1.20 to 4.88 seconds. All other pronunciations sound good. There is a noticeable block after the phoneme 'c' in the word 'continued' at 1.30 seconds, lasting approximately 3 seconds. This interruption affects the flow of speech, potentially impacting listener comprehension by creating a significant pause. Additionally, there is a replacement error for the word 'for' at 8.00 seconds, which may briefly alter the intended meaning. Despite these issues, the rest of the pronunciation remains clear and fluent.

Instruction: What do you think of the pronunciation?
Reference Text (Optional): We have before stated that all the servants had fled.
The pronunciation is perfect. In this example, it sounds like the "s" in "stated" is a bit softer, making the word less distinct. There's also a slight pause between "have" and "been," which can disrupt the flow of the sentence. There is a prolongation in the word 'we' from 0.44 to 1.00 seconds, and there is also a stutter in the word 'stated' at 1.00 seconds. In the word 'we,' there is a prolongation of the 'IY' sound at 0.20 seconds, lasting approximately 0.60 seconds. This elongation adds an extra emphasis to the vowel, which may slightly affect the natural rhythm of the speech. Similarly, in the word 'stated,' the 'EY' sound is extended for about 0.60 seconds at 0.88 seconds. These prolonged vowel sounds contribute to a slower pace, which can influence the overall fluency. However, the rest of the pronunciation is consistent and clear.

Instruction: What do you think of the pronunciation?
Reference Text (Optional): Sir, he said, "are you disposed to confer a great obligation on an unhappy father who has just lost his daughter?"
Your pronunciation is truly remarkable, with a natural flow that enhances the clarity of each word. In this example, the pronunciation is generally clear There is a significant prolongation in the word 'disposed' from 2.00 to 3.06 seconds. The other pronunciations sound good. There is a prolongation of the 'OW' sound in the word 'disposed' from 2.06 to 2.86 seconds, which introduces an extended emphasis on the vowel. This prolongation may affect the pacing of the word, adding a slight irregularity to the flow. Additionally, in the word 'who,' the 'UW' sound is prolonged for approximately 0.60 seconds at 7.00 seconds. These elongations can subtly impact the natural rhythm of the speech, yet the overall pronunciation remains clear and intelligible.