To develop an automatic continuous speech recognition system for a language , an acoustic model and language Model has to be developed for that particular language. At present acoustic and language models , for continuous speech recognition , are not available for Malayalam Language .
CMU Sphinx is an open source toolkit for speech recognition developed by carnegie mellon university.It contains series of speech recognizers of which latest is sphinx4 , acoustic model trainer (sphinx train) and a statsitical language model builder (cmuclmtk). For developing a continous speech recognition system we need well trained acoustic model and language model.An acousitc model process audio recordings with their transcriptions and form statstical representations of word. A language model describes the likelihood, probability, or penalty taken when a sequence or collection of words is seen. CMUSphinx project comes with several high-quality acoustic models and language model for language like english, french, spanish etc.
The aim of this project as a whole is to develop
a high-quality acoustic model and language model for malayalam.
The entire project can be subdivided in four parts :
#Collecting voice data and making transcription for acoustic model
#Collecting text corpora for language model
:*Optimising the database
::Careful selection of voice and text data that can better represent the language , can be performed at this stage/phase so that quality of the acoustic model and language model created can be improved.Text data can be optimised using grapheme to phoneme converters and optimal text selection algorithms. Selecting appropriate speaker and analysing data statistics can lead to better acoustic data .
:*Training the acoustic model
::Acoustic modeling of speech typically refers to the process of establishing statistical representations for the feature vector sequences computed from the speech waveform. Hidden Markov Model (HMM) is one most common type of acoustic models. We use SphinxTrain ,to train the acoustic model, which is based on HMM. The quality of the model can be increased significantly by adjusting the parameters of the trainer(sphinxtrain). The tedious task of finding the appropriate language specific parameter values and configuring the trainer is done during this stage.
:*Building language model
::A language model gives the probabilities of sequences of words. Here for continuous
speech speech recognition we use statistical modelling of language using CMUCLMTK. Estimating the probability of sequences can become difficult in corpora, in which phrases or sentences can be arbitrarily long and hence some sequences are not observed during training of the language model (data sparseness problem of overfitting). Hence forming a good quality language model is a challenge .
#Languages : C,C++,Python,Java,Bash
#Embedded Platforms : Arduino , Atmel AVR , SiliconLabs CIP-
I am passionate about technology and free and open source systems. I am actively involved in the free software community and have volunteered for various free/open projects in the past . I have participated in various national level conferences ( like FOSS.IN , Pycon India ) that promote FOSS.
*Intern as Linux Distribution Developer [ Winter 2010 ]
===Unavailable - May 6th to June 2nd===
University tests and other academic responsibilities .
===June 2nd - June 15th===
I am familiar with the usage of sphinxtrain and cmuclmtk so i will be using this time to understand and learn to configure the internal parameters of the sphinx engine to improve performance of models formed.
===June 15th - June 30===
During this period i will be collecting all the voice data and text corpora required for the acoustic model and language model respectively.
===July 1st - July 15th===
Training the initial acoustic model and building the language model .
===July 16th - July 28th===
Handling any unexpected issues regarding the data collected and finally retrainingthe models.
Mid-Term should provide the community with a reasonably good acoustic model and language model for Malayalam.
Applying optimisations including graphemes to phoneme conversion and optimal text selection algorithms for text corpora . Choosing appropriate speakers based on data statistics is also done during this period. Finally training of the optimised data to form the high quality acoustic model and language model.
===September 1st- September 15th===
Can be used for general bug fixing and detailed documentation.
Expects to complete a high quality acoustic model and language model for malayalam with low WER(word error rate).