Phoneme Segmentation of Tamil Speech Signals Using Spectral Transition Measure

Process of identifying the end points of the acoustic units of the speech signal is called speech segmentation. Speech recognition systems can be designed using sub-word unit like phoneme. A Phoneme is the smallest unit of the language. It is context dependent and tedious to find the boundary. Automated phoneme segmentation is carried in researches using Short term Energy, Convex hull, Formant, Spectral Transition Measure (STM), Group Delay Functions, Bayesian Information Criterion, etc. In this research work, STM is used to find the phoneme boundary of Tamil speech utterances. Tamil spoken word dataset was prepared with 30 words uttered by 4 native speakers with a high quality microphone. The performance of the segmentation is analysed and results are presented.


INTRODUCTION
In Natural Languages, speech is considered as the sequential link of phonemes.The automatic segmentation of speech using only the phoneme sequence is an important task, especially if manually pre-segmented sentences are not available for training.The availability of segmented speech databases is useful for many purposes, mainly for the training of phoneme-based speech recognizers 1 .Such an automatic segmentation can be used as the primary input data to train other more powerful systems like those based on Hidden Markov Models or Artificial Neural Networks 2 .
In linguistics, a phoneme is defined as the minimal information bearing distinct unit 3 .In the acoustic realization, it is uncertain to define phoneme boundaries.In the short-time representation, the speech signal is considered stationary and the voiced segments quasi-periodic.In statistical phoneme segmentation, it is assumed that the properties of the speech signal change when transition occurs from one phoneme to the next phoneme.But in reality, the transitions occur smoothly due to the influence of the adjacent phonemes.
Automatic Speech Segmentation (ASS) methods can be classified into two categories.In the first category the basic acoustic unit that the Automatic Speech Recognition (ASR) can handle will be kept as transcription of the speech.Segmentation algorithms can use this transcription to find the sub word boundaries and the number of subunits in it.In the second case, there will be no linguistic knowledge prior.This type of Speech segmentation algorithms also called as blind segmentation algorithms in which the number of sub word units and the boundaries found based on the acoustic cues only 4 .
I n M o d e r n A S R a p p r o a c h e s, t h e concatenation principle is used to represent words by its successive phonemes.Since phonemes are context dependent, context dependent model such as triphone, demiphone are also proposed in which the fundamental unit is phoneme and the words are represented in the pronunciation lexicon as concatenations of phonemes 5 .

Related Work
Sharma and Mammone 6 designed a Level Building Dynamic Programming (LBDP) based speech segmentation, a dynamic programming based algorithm to optimally locate the sub-word boundaries by minimizing distortion metric.They have proposed a novel blind speech segmentation procedure to determine the optimal number of subword units present in the given speech sample as well as the boundary locations based on acoustic cues, without any linguistic knowledge.Odette Scharenborg 7 et al. investigated the fundamental problems in unsupervised segmentation algorithms.The authors have compared phoneme segment obtained using only the acoustic information derived from the signal with a reference segment created by human annotators.From the experiments, it is concluded that the acoustic change is a fairly good indicator of segment boundaries and proved that the errors are related to segment duration, sequences of similar segments, and inherently dynamic phones.
To improve the unsupervised automatic speech segmentation, instead of using one-stage bottomup segmentation method, it is suggested to propose two-stage segmentation methods which uses both bottom -up data extracted from speech signal and automatically derived top-down information.
Zió³ko, Bartosz 8 , et al. proposed a new phoneme segmentation method based on the analysis of Discrete Wavelet Transform(DWT) spectra.The values of power envelopes and their first derivatives for six frequency sub bands were used segmentation of Polish Language [9].Specific scenarios which suits for phoneme boundaries are searched first and then the discrete times with such events are recorded and graded using a distribution-like event function.This event function represents the change of the energy distribution in the frequency domain.Finally, the decision on localization of boundaries is taken from the analysis of the event function.Boundaries are extracted using information from all sub bands.This method was tested with a small set of Polish hand segmented words and tested on another large corpus containing 16,425 utterances and recall and precision measure used to measure the quality of speech segmentation with F-score equal to 72.49%.Kuo et.al 10 presented an improved HMM/SVM method for a two stage phoneme segmentation framework.The first stage performs hidden Markov model (HMM) forced alignment according to the Minimum Boundary Error (MBE) criterion and the second stage uses the support vector machine (SVM) method to refine the hypothesized phoneme boundaries derived by HMM-based forced alignment.The designed to align phoneme boundary based on MBE-trained HMMs and explicit phoneme duration models.They tested their method with TIMIT database and MATBN Mandarin Chinese database.Mousmita Sarma et.al 11 described an Artificial Neural Network (ANN) based algorithm for the segmentation and recognition of the vowel phonemes of Assamese language from the words containing vowels.Self-Organizing Map (SOM) used to train and segmentation was done to segment the word into its constituent phonemes.Probabilistic Neural Network (PNN) trained with clean vowel phonemes was used to recognize the vowel segment.The experimental speech samples were recorded from five female speakers and five male speakers.In the authors proposed method, the first formant frequency of all the Assamese vowels was predetermined by estimating pole or formant location from the linear prediction (LP) model of the vocal tract.The proposed algorithm showed a high recognition performance in comparison to the conventional Discrete Wavelet Transform (DWT) based segmentation.inexpensive 12,13 .So, an attempt to segment the Tamil speech utterances into phoneme segments is made using spectral transition measure and the outline of the method is given in Fig. 2.

Data Set
Tamil speech utterances consisting of 30 unique Tamil words constituting 172 phonemes uttered by 4 native speakers are recorded with the help of a unidirectional microphone and considered as data set.Data are recorded using a recording tool audacity in a normal room with

Mathematical Formulation of Segmentation
The problem of speech segmentation is described in Fig. 1.Let denote the sequence of mel cepstrum vectors calculated for each frame of every Tamil uttered word from the data set, where N is the number of speech frames and is p dimensional parameter vector at frame 'i'.The objective of the segmentation problem is to divide the sequence X into M non-overlapping consecutive segments where each sub sequence corresponds to a phoneme.Let the boundaries of the segment be denoted by the sequence of integers .The i th segment starts at frame B i-1 +1 and ends at frame B i ; where B 0 =0 and B M =N.

METHOD
Compared to generative methods based on HMMs, phonemic segmentation methods based on spectral distortion measures are independent of linguistic constraints and computationally minimum external noise.The sampling rate used for recording is 16 kHz.The description about the data used is also given in Table 1.

Preprocessing and Feature Extraction
The signal, sampled at 16 kHz, is decomposed into a sequence of overlapping frames.The frame size of 25 ms and 10ms frame shift were used for the segmentation approach considered.The input speech data are pre emphasized with co-efficient of 0.97 using a first order digital filter.The samples are weighted by a Hamming window for avoiding spectral distortions.
The windowed frame obtained after hamming is used to extract the twelve Mel Frequency Cepstral Coefficients (MFCC).Usually the zero order coefficient that represent the total energy and for our experiment vector excluding the total energy is used for further process.The shorttime fourier transform analysis is then performed to compute the magnitude spectrum.Filter bank design with triangular filters uniformly spaced on the mel scale between 300 Hz to 3400 Hz as lower and upper frequency limits is followed.The filter bank is applied to the magnitude spectrum values to produce Filter Bank Energies (FBEs) 20 per frame.Log-compressed FBEs are then de-correlated using the Discrete Cosine Transform (DCT) to produce cepstral coefficients.The co-efficients obtained are then rescaled to have the similar magnitudes achieved through liftering with the value of 22 as L. The steps involved in the MFCC feature extraction are shown in Fig. 3.

Spectral Transition Measure (STM)
Mitchell et al. 13 introduced Spectral Variation Function (SVF), calculated as the angle between two normalized cepstral vectors for phoneme segmentation.The spectral transition measure employed in this study was the same as that proposed in 14 .Dusan and Rabiner 14 detected the maximum spectral transition frames as phoneme boundaries, where spectral transition represents the magnitude of the spectral rate of change.This spectral transition measure (STM), at frame m, can be computed as a mean squared value as in Eq. 1.
… (1)   where D is the dimension of the spectral feature vector which is 12 coefficients in this experiment without gain term.The regression coefficient or the rate of change of the spectral feature is computed using eqn.2.

…(2)
where n represents the frame index and I defines the number of frames to be included in both side of the current frame to form a symmetric window for computing regression coefficients.In this experiment, I with 1,2,3 is used.

Phoneme boundary Detection
Phoneme boundaries detection can be defined as a bi-step process: a peak picking method and a post-processing method for removing local boundaries.In the proposed method, all the peaks in the spectral transition measure are computed for every frame.Then, from the STM values of all frames, the locations of all peaks which proceeded with negative region are identified.They are referred

Experiment and Discussions
Segmentation method using the spectral transition measure is experimented with the frame tolerance with ±(10ms<"30ms).The dataset developed contains 688 boundaries excluding BM.The optimal tolerance is obtained as the value for which the peak-picking procedure gives the number of segments same as that of the actual number of segments in the manual segmentation of Tamil word considered.The window length parameter L is assigned with values of 1, 2 and 3 and optimized for a maximum segmentation match with manually segmented data and the number of boundaries detected within a tolerance window ±(10ms to 30ms).The automatic segment boundaries with parameter value of I as 2 and the tolerance ±(10ms to 30ms) is shown in Table 2.The performance of phoneme segmentation with respect to the manually segmented Tamil utterances using three measures: percentage of match (%M), percentage of insertion (%I) and percentage of deletion (%D) within a tolerance window ±(10ms to 30ms).
Percentage of Insertion(%I) gives the percentage of segments obtained by the automatic segmentation without corresponding manual segment within the tolerance window.Percentage of Deletion (%D) gives the percentage of segments obtained by the manual method without any corresponding automatically segmented boundary within the tolerance window.

CONCLUSION
In this paper, the phoneme boundaries of Tamil speech utterances are found spectral transition measure.The performance of the segmentation is analysed in terms of percentage of matches with the manual boundary.It is suggested to have better alignment techniques in future to get the better results.Furthermore larger linguistic units of language than phoneme may also be proposed.

Fig. 3 :
MFCC Features Extractionas valleys.Peak picking method is applied to select prominent peaks with deep valleys.Peaks which are closer to other peaks within 40ms to 60ms are considered for elimination.After choosing the peaks, the frames in which the peaks occur are identified and used to perform phoneme segmentation.Let, M be the number of segments and the boundaries be .Then the M-1 most significant peaks are to be obtained i.e. and B M = N.

Fig. 4 :
Fig. 4: (a) Speech signal of sample Tamil word 'PERIDHU' with manual boundary (b) Spectral Transition Measure of the word (c) boundaries found using automated STM (d) boundary found as Insertion and Deletion.