Developing Acoustic and Language Model for Malayalam Recognition

Personal Information

#Name : A.R.Rahul
#Email Address : 2ar.rahul@gmail.com
#Telephone : +919446048820
# University and Education : BTech in Computer Science , College of Engineering Trivandrum ( University of Kerala )

Mailing Address :

Arackal 
Thirunelliyoor Lane , Pallimukku  
Peyad P.O 
Thiruvananathapuram - 695573
Kerala

My name is A.R.Rahul . I hail from Kerala one of the beautiful southern state in India widely called as 'God's Own Country'. I am 22 years old and i major in computer science . I am passionate about science and technology and i am a free software enthusiast.

Why do you want to work with the Swathanthra Malayalam Computing ?

I think most of the technological advancements in the field of computer science is inaccessible to the majority of general public due to lack of local language support .SMC, with its slogan "എന്റെ കമ്പ്യൂട്ടറിനു് എന്റെ ഭാഷ " (my language for my computer), has always been in the forefront working for the same.Malyalam being my mother tongue i believe i can contribute to the SMC community.

Do you have any past involvement with the Swathanthra Malayalam Computing or another open source project as a contributor?

I have participated in a localisation camp organised by SMC . Other than that i have also actively participated in developing a GNU/Linux distribution , based on debian , aimed at students of technical courses (www.rithos.org) .

Did you participate with the past GSoC programs, if so which years, which organizations?

No . I’am applying GSOC for the first time .

Do you have other obligations between May and August ?

No . I am confident that i can finish this project in time . I can devote 40hrs a week for this project.

Will you continue contributing/ supporting the Swathanthra Malayalam Computing after the GSoC 2013 program, if yes, which area(s), you are interested in?

Yes . Speech Recognition and Artificial Intelligence is my area of interest . The scope of this project is much beyond a single SoC . My dream is to improve speech recognition engine currently available for malayalam to a level better or at least on par with English language.

Why should we choose you over other applicants?

For the past four months i have been working on a project that involved modeling a closed vocabulary acoustic model in malayalam. I have good experience working with the sphinx engine which is the speech recognition system that i am going to use in creating acoustic and language model. I have experience using sphinxtrain and cmuclmtk which are used to train acoustic model and language model respectively.I also have experience writing python scripts for automating creation of database description files such as dictionary, transcription etc. With the experience that I have I am confident of creating an acoustic model and language model for malayalam language with acceptable WER(word error rates) in time .

Proposal Description

Problem Statement

To develop an automatic continuous  speech recognition system for a language , an acoustic model and language Model has to be 
developed for that particular language. At present  acoustic  and language models , for continuous speech recognition , are not 
available  for Malayalam Language .

Synopsis

The project aims at building an Acoustic model and Language Model for Malayalam language using CMUSphinx toolkit , which will be very 
useful for research and development purposes in Malayalam Speech Recognition and Processing area . Project also aims at applying 
optimisations to the ac oustic data and text corpora used for training to improve the efficiency of the model .Efficiency of the model 
will be calculated as WER(Word Error Rate).

project proposal

CMU Sphinx is an open source toolkit for speech recognition developed by carnegie mellon university.It contains series of speech recognizers of which latest is sphinx4 , acoustic model trainer (sphinx train) and a statsitical language model builder (cmuclmtk). For developing a continous speech recognition system we need well trained acoustic model and language model.An acousitc model process audio recordings with their transcriptions and form statstical representations of word. A language model describes the likelihood, probability, or penalty taken when a sequence or collection of words is seen. CMUSphinx project comes with several high-quality acoustic models and language model for language like english, french, spanish etc.

The aim of this project as a whole is to develop a high-quality acoustic model and language model for malayalam.

The entire project can be subdivided in four parts :

Collection of data

Collecting voice data and making transcription for acoustic model

Collecting text corpora for language model

Optimising the database

Careful selection of voice and text data that can better represent the language , can be performed at this stage/phase so that quality of the acoustic model and language model created can be improved.Text data can be optimised using grapheme to phoneme converters and optimal text selection algorithms. Selecting appropriate speaker and analysing data statistics can lead to better acoustic data .

Training the acoustic model

Acoustic modeling of speech typically refers to the process of establishing statistical representations for the feature vector sequences computed from the speech waveform. Hidden Markov Model (HMM) is one most common type of acoustic models. We use SphinxTrain ,to train the acoustic model, which is based on HMM. The quality of the model can be increased significantly by adjusting the parameters of the trainer(sphinxtrain). The tedious task of finding the appropriate language specific parameter values and configuring the trainer is done during this stage.

Building language model

A language model gives the probabilities of sequences of words. Here for continuous speech speech recognition we use statistical modelling of language using CMUCLMTK. Estimating the probability of sequences can become difficult in corpora, in which phrases or sentences can be arbitrarily long and hence some sequences are not observed during training of the language model (data sparseness problem of overfitting). Hence forming a good quality language model is a challenge.

Experience

Technical Skills

Languages : C,C++,Python,Java,Bash
Software Packages : GDB , Emacs , Eclipse
Embedded Platforms : Arduino , Atmel AVR , SiliconLabs CIP-

Free Software

I am passionate about technology and free and open source systems. I am actively involved in the free software community and have volunteered for various free/open projects in the past . I have participated in various national level conferences ( like FOSS.IN , Pycon India ) that promote FOSS.

Developer Experience

Intern as Linux Distribution Developer [ Winter 2010 ]

Zyxware Technologies,

KD Road,Marappalam Pattom P.O ,Trivandrum

Developed a Linux Distribution aimed at technical courses . ( http://www.rithuos.org )

Jukebox Software [ June 2011 ]

Project involve development of java based user-friendly application software designed to operate a partially automated music-playing device that will play a patron's selection from self-contained media.

Automation of Equitorial paltform of a telescope [ January 2012 ]

Project involve development of a hardware to automate the movement of euatorial platform of a telescope. Chip used was atmega16. Programming was done using micro-c.

Linux From Scratch [June 2013 ]

Build a compact Linux system entirely from source code to understand the internal working of a Linux distribution system .

Swaram-Malayalam Speech Recognition System [Jan - June 2013]

Swaram is a free software initiative aimed towards recognising malayalam speech. The initial goal of the project was to extend the language support of CMU Sphinx engine and use it to recognise malayalam.

Link : https://github.com/jerrin001/swaram.git

Timeline

Unavailable - May 6th to June 2nd

University tests and other academic responsibilities .

June 2nd - June 15th

I am familiar with the usage of sphinxtrain and cmuclmtk so i will be using this time to understand and learn to configure the internal parameters of the sphinx engine to improve performance of models formed.

June 15th - June 30

During this period i will be collecting all the voice data and text corpora required for the acoustic model and language model respectively.

July 1st - July 15th

Training the initial acoustic model and building the language model .

July 16th - July 28th

Handling any unexpected issues regarding the data collected and finally retraining the models.

Midterm Evaluation

Mid-Term should provide the community with a reasonably good acoustic model and language model for Malayalam.

August

Applying optimisations including graphemes to phoneme conversion and optimal text selection algorithms for text corpora . Choosing appropriate speakers based on data statistics is also done during this period. Finally training of the optimised data to form the high quality acoustic model and language model.

September 1st- September 15th

Can be used for general bug fixing and detailed documentation.

Final Evaluation

Expects to complete a high quality acoustic model and language model for malayalam with low WER(word error rate).

Pens Down

Improve Documentation and Final Touchups

Mentor

My proposal is based on my discussion with Deepa P Gopinath , SMC mentor . I have discussed and understood the various challenges i might face during this project and i am confident enough to complete this project in time under her mentoring .

Anonymous

Search

User:Ar rahul/GSoC2013/