A SEGMENT-BASED PROBABILISTIC GENERATIVE MODEL OF SPEECH

Kannan Achan, Sam T. Roweis, Aaron Hertzmann, Brendan J. Frey  

University of Toronto


We have presented a simple segmental Hidden Markov Model for analyzing speech waveforms directly in the time domain and derived an efficient algorithm for MAP inference in this model.  The proposed method directly analyzes the speech wave in an unsupervised fashion and decomposes it into fundamental atomic blocks by identifying waveform samples at the boundaries between glottal pulse periods (in voiced speech) or at the boundaries between unvoiced segments.

Results using our algorithm on an utterance from Wall Street Journal dataset. Voicing/unvoicing decisions are indicated using the bars above the signal. Upwards arrows are used to mark the inferred segment boundaries.

After this segmentation, many disparate speech processing tasks are quite naturally performed, indicating that we have managed to extract some fundamental structure from the signal. We highlight that the appeal of our model is that it enables a wide range of applications in a single framework.

..............................................................................................................................

Publication:

A Segment-Based Probabilistic Generative Model of Speech
Kannan Achan, Sam T. Roweis, Aaron Hertzmann, Brendan J. Frey
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005). to appear [ps.gz] [pdf]
A Segmental HMM for Speech Waveforms,
Kannan Achan, Sam T. Roweis, Aaron Hertzmann, Brendan J. Frey
Technical Report UTML-TR-2004-001, University of Toronto, (Revised May 2004). [ps.gz] [pdf]

..............................................................................................................................

Applications:
Time scale modification
Time scale modification allows a speech signal to be played at a slower or faster rate without altering the important features in the source. An illustration of how time scale modification is performed using our approach is shown below.

tsm
In the table below, we have presented results on timescale modification and compared it with SOLA FS a state-of-the-art time domain technique.  Our results are comparable with SOLA FS  - in fact SOLA FS breaks down substantially at higher rates. 


Our algorithm

SOLA FS
(time domain variant
 of SOLA)

Sub-sampling in
time domain
2 x faster
[wav]


3 x faster
[wav]






2 x slower
[wav]



3 x slower
[wav]


5 x slower
[wav]




Pitch tracking / Voicing detection
Pitch tracking is trivially achieved by taking the reciprocal of the segment lengths in the voiced regions. Results for an utterance spoken by a female speaker is shown below.
pt

Excitation for voiced speech manifests as sharp bursts at integer multiples of the fundamental frequency. Using the pitch estimates obtained using our approach we have marked the spectrogram of the signal with a few integer multiples of the fundamental frequency; please follow this link.

Filling in missing/corrupted region of speech
In this section we present a preliminary experiment on cleaning severely corrupted signals. We created a noisy signal by adding severe noise every 30ms to the original waveform. A section of this corrupted signal(top) and its reconstruction(bottom) obtained using our method is shown below.



Estimated glottal pulse boundaries are marked by vertical arrows. Our segmentation algorithm treats the corrupted region as unvoiced. We  filled in the corrupted region by generating new segments with periods between the two bounding voiced regions. The scale factor for the filled-in regions was computed by matching the two bounding segments and interpolating.
Input signal (corrupted speech)

Denoising using a low pass filter

Restored  signal (by filling in using our method)

The low pass filter was applied only on the corrupted regions. Our method offers significant improvement over denoising using a low pass filter. We are currently investigating more effective schemes for segmentation to further improve our restoration algorithm.