Anonymous

Changes

From SMC Wiki

User:Ar rahul/GSoC2013/

7,414 bytes added, 05:33, 26 January 2017
m
Reverted edits by Sperminator (talk) to last revision by Ar rahul
== Personal Information ==
#Name : A.R.Rahul
#Email Address : 2ar.rahul@gmail.com
#Telephone : +919446048820
# University and Education : BTech in Computer Science , College of Engineering Trivandrum ( University of Kerala )
#Email Mailing Address : 2ar.rahul@gmail Arackal Thirunelliyoor Lane , Pallimukku Peyad P.comO #Telephone : +919446048820 Thiruvananathapuram - 695573# University and Education : BTech in Computer Science , College of Engineering Trivandrum ( University of Kerala )
My name is A.R.Rahul . I hail from Kerala one of the beautiful southern state in India widely called as 'God's Own Country'. I am 22 years old and i major in computer science . I am passionate about science and technology and i am a free software enthusiast.
=== Why do you want to work with the Swathanthra Malayalam Computing ? ===
I think most of the technological advancements in the field of computer science is inaccessible to the majority of general public due to lack of local language support .SMC, with its slogan "എന്റെ കമ്പ്യൂട്ടറിനു് എന്റെ ഭാഷ " (my language for my computer), has always been in the forefront working for the same.Malyalam being my mother tongue i believe i can contribute to the SMC community.
=== Do you have any past involvement with the Swathanthra Malayalam Computing or another open source project as a contributor? ===
== Proposal Description ==
 
===Problem Statement===
 
Malayalam is one among 22 languages spoken in India with about 38 million speakers. Development of malayalam speech recognition system is in its infancy stage; although many works have been done in other Indian languages. To develop an automatic continuous speech recognition system for a language , an acoustic model and language Model has to be developed for that particular language. At present acoustic and language models , for continuous speech recognition , are not available for Malayalam Language .
 
===Synopsis===
The project aims at building an Acoustic model and Language Model for Malayalam language using CMUSphinx toolkit , which will be very useful for research and development purposes in Malayalam Speech Recognition and Processing area . Project also aims at applying optimisations to the ac oustic data and text corpora used for training to improve the efficiency of the model .Efficiency of the model will be calculated as WER(Word Error Rate).
===project proposal===
CMU Sphinx is an open source toolkit for speech recognition developed by carnegie mellon university.It contains series of speech recognizers of which latest is sphinx4 , acoustic model trainer (sphinx train) and a statsitical language model builder (cmuclmtk). For developing a continous speech recognition system we need well trained acoustic model and language model.An acousitc model process audio recordings with their transcriptions and form statstical representations of word. A language model describes the likelihood, probability, or penalty taken when a sequence or collection of words is seen. CMUSphinx project comes with several high-quality acoustic models and language model for language like english, french, spanish etc. The aim of this project as a whole is to develop an acoustic model and language model for Malayalam with reasonable WER ( Word Error Rate ).The entire project can be subdivided in four parts : :*Collection of data  ::1.Collecting voice data and making transcription for acoustic model ::2.Collecting text corpora for language model  :*Optimising the database :::Careful selection of voice and text data that can better represent the language , can be performed at this stage/phase so that quality of the acoustic model and language model created can be improved.Text data can be optimised using grapheme to phoneme converters and optimal text selection algorithms. Selecting appropriate speaker and analysing data statistics can lead to better acoustic data .  :*Training the acoustic model :::Acoustic modeling of speech typically refers to the process of establishing statistical representations for the feature vector sequences computed from the speech waveform. Hidden Markov Model (HMM) is one most common type of acoustic models. We use SphinxTrain ,to train the acoustic model, which is based on HMM. The quality of the model can be increased significantly by adjusting the parameters of the trainer(sphinxtrain). The tedious task of finding the appropriate language specific parameter values and configuring the trainer is done during this stage.  :*Building language model :::A language model gives the probabilities of sequences of words. Here for continuous speech recognition we use statistical modelling of language using CMUCLMTK. Estimating the probability of sequences can become difficult in corpora, in which phrases or sentences can be arbitrarily long and hence some sequences are not observed during training of the language model (data sparseness problem of over fitting). Hence forming a good quality language model is a challenge.  '''Examples of language specific challenges''' Malayalam has 37 consonants and 16 vowels in the language. It is a syllable based language and written with syllabic alphabet in which all consonants have an inherent vowel /a/. There are different spoken forms in Malayalam although the literary dialect throughout Kerala is almost uniform. :*People have a hard time pronouncing breathy-voiced plosives and tends to substitute them with voiceless aspirated ones in the same place of articulation. :*ഫ ( ph’a ) pronounced differently in ഫലം and ഫാന്‍ . ന (na) (Nasal dental and Nasal alveolar) is pronounced differently even though the grapheme notation is same (eg. നനക്കുക (nan’naykkuka). phonological rules have been applied manually and edited the dictionary :*In continuous speech, word boundaries are also challenging. For instance, the word "thalasthanam” (തലസ്ഥാനം ) can be misconstrued as "thala sthanam” (തല സ്ഥാനം ). :*Articulation of certain phonemes are context dependent .For eg: the words ബലം and ജലം are pronounced as ബെലം and ജെലം respectively .
CMUSphinx project comes with several high-quality acoustic models and language model for language like english, french, spanish etc:*The prosody of spoken Malayalam makes it difficult to correctly identify the sound units.(Phonemes).
The aim In order to address these language specific issues of this project as Malayalam speech recognition we need to have a whole working acoustic model and language model, which is unfortunately not available or in naive state for Malayalam language. Our aim is to develop a high-quality working acoustic and language model and thereafter address language model for malayalamspecific issues one by one as possible in the limited time constraint.
The initial goal of the project is creating the database required which involves :::*Collecting voice data and making transcription for acoustic model::*Collecting text corpora for language model===Benefits===
Once the database # Language data is formed we can start training the acoustic model using sphinxtrain key ingredient in terms of research and build development in the area of language model using cmuclmtk technology. Although we have not applied any optimisation at The data ( speech corpora and text corpora ) collected for this stage of the project we will have successfully created a working be made publicly available for future works .# High quality acoustic model and language modelfor Malayalam with low WER(word error rate) will be developed which can be used for research and development purposes in Malayalam Speech Recognition and Processing area .# Acoustic and Language model developed can be used by programmers/developers directly to create solutions to many existing problems that need speech recognition in local language.
Optimisations , careful selection of voice and text data that can better represent the language , can be performed at this stage/phase so that quality of the acoustic model and language model created can be improved.===Challenges===
*Grapheme to phoneme converters # Lack of appropriate annotated speech databases for Malayalam Language .# Careful selection of voice and optimal text selection algorithm data that can be used better represent the language# Understanding CMU Sphinx Engine and its tools to select a set phonetically rich sentences from a huge text corpus.make language specific improvements and increase efficiency# Finding the appropriate language specific parameter values and configuring the trainer
*Appropriate speaker selection and using data statistics can greatly improve the quality of collected acoustic data.===References===
#[http://cmusphinx.sourceforge.net/wiki/‎ CMU Sphinx Website]#[http://www.speech.cs.cmu.edu/sphinx/tutorial.html Learning to use the CMU SPHINX Automatic Speech Recognition system ]#[http://cmusphinx.sourceforge.net/wiki/tutoriallm Building Language Model]#[http://www.cambridge.org/gb/knowledge/isbn/item1150358/?site_locale===Experience===en_GB Introducing Speech and Language Processing - John Coleman]
==Experience== ===Technical Skills===
#Languages : C,C++,Python,Java,Bash
#Embedded Platforms : Arduino , Atmel AVR , SiliconLabs CIP-
===Free Software===
I am passionate about technology and free and open source systems. I am actively involved in the free software community and have volunteered for various free/open projects in the past . I have participated in various national level conferences ( like FOSS.IN , Pycon India ) to that promote FOSS.
===Developer Experience===
#*Intern as Linux Distribution Developer [ Winter 2010 ]
::Zyxware Technologies,
::KD Road,Marappalam Pattom P.O ,Trivandrum
:::Developed a Linux Distribution aimed at technical courses . ( http://www.rithuos.org )
#*Jukebox Software [ June 2011 ]
:: Project involve development of java based user-friendly application software designed to operate a partially automated music-playing device that will play a patron's selection from self-contained media.
#*Automation of Equitorial paltform of a telescope [ January 2012 ]
::Project involve development of a hardware to automate the movement of euatorial platform of a telescope. Chip used was atmega16. Programming was done using micro-c.
#*Linux From Scratch [June 2013 ]
::Build a compact Linux system entirely from source code to understand the internal working of a Linux distribution system .
#*Swaram-Malayalam Speech Recognition System [Jan - June 2013]
::Swaram is a free software initiative aimed towards recognising malayalam speech. The initial goal of the project was to extend the language support of CMU Sphinx engine and use it to recognise malayalam.
::Link : https://github.com/jerrin001/swaram.git
 
==Timeline==
 
===Unavailable - May 6th to June 2nd===
 
University tests and other academic responsibilities .
 
===June 2nd - June 15th===
 
I am familiar with the usage of sphinxtrain and cmuclmtk so i will be using this time to understand and learn to configure the internal parameters of the sphinx engine to improve performance of models formed.
 
===June 15th - June 30===
 
During this period i will be collecting all the voice data and text corpora required for the acoustic model and language model respectively. Applying optimisations including graphemes to phoneme conversion and optimal text selection algorithms for text corpora . Choosing appropriate speakers based on data statistics is also done during this period.
 
===July 1st - July 15th===
 
Configuring the trainer parameters such as number of states in HMM(Hidden Markov Chain), different number of gaussian mixtures and tied states based on language specific features. Stressed consonants need to be treated separately and not as variants of the parent consonant for acoustic modeling . Finally training the acoustic model and building language model.
 
===July 16th - July 28th===
 
Once a working acoustic and language model has been formed further language specific improvements can be performed . Consulting linguists for incorporating Malayalam grammar rules to improve the recognition accuracy of the speech recognition system is one such method .
 
===Midterm Evaluation===
 
Mid-Term should provide the community with a reasonably good acoustic model and language model for Malayalam.
 
===August===
 
This time will be utilized to address and solve as many of the language specific problems mentioned in the project proposal . Multiple Forced alignment iterations could be done to further improve the model.
 
===September 1st- September 15th===
 
Can be used for general bug fixing and detailed documentation.
 
===Final Evaluation===
 
Acoustic model and language model for malayalam with low WER(word error rate) will be formed.
 
===Pens Down===
Improve Documentation and Final Touchups
 
==Mentor==
My proposal is based on my discussion with Deepa P Gopinath , SMC mentor . I have discussed and understood the various challenges i might face during this project and i am confident enough to complete this project in time under her mentoring .