User:Ragha: Difference between revisions

From SMC Wiki
No edit summary
No edit summary
Line 5: Line 5:




The main objective of speech recognition system is to transcript speech  
The main objective of speech recognition system is to transcript speech into text. Malayalam is spoken by approximately around 33 million people. Several of the revolutionizing technologies gifted by computer science can be incorporated in people's lives by making this technology more closer and native. One such prominent methods is bringing ease and nativity in modes of communication with computing systems. Speech  
into text. Malayalam is spoken by approximately around 33 million  
recognition system contributes towards the same. Speech recognition systems can be broadly classified into two categories based on how they are built. The first type is the one with a limited vocabulary, and in specific a unit is considered to be a single word, which are also known as word recognition system. The second category of systems perform the task of recognition with large vocabulary. The main issue that we face is that prior knowledge of word boundaries are not known to the system. And even if this knowledge is provided, there would be several choices from which the algorithm has to choose the appropriate one, either based on context or statistical likelihood. The idea is that the system needs to check at every instant if that is the word boundary.
people. Several of the revolutionizing technologies gifted by computer  
science can be incorporated in people's lives by making this technology  
more closer and native. One such prominent methods is bringing ease and  
nativity in modes of communication with computing systems. Speech  
recognition system contributes towards the same.
Speech recognition systems can be broadly classified into two categories
based on how they are built. The first type is the one with a limited  
vocabulary, and in specific a unit is considered to be a single word,  
which are also known as word recognition system. The second category of  
systems perform the task of recognition with large vocabulary. The main  
issue that we face is that prior knowledge of word boundaries are not  
known to the system. And even if this knowledge is provided, there would
be several choices from which the algorithm has to choose the  
appropriate one, either based on context or statistical likelihood. The  
idea is that the system needs to check at every instant if that is the  
word boundary.


== SYNOPSIS ==
== SYNOPSIS ==


The main aim of the project is to build an acoustic and language models  
The main aim of the project is to build an acoustic and language models for Malayalam. As per the proposal, CMU Sphinx toolkit is used. If the data modeled is sufficiently representative of the language, then efficiency improves. Hence rigorous processing and modeling of acoustic data is also aimed. Language models that are sophisticated enough to provide adequate context or semantics of the word are required to disambiguate between all the available hypotheses shortlisted. Another issue that needs to be addressed is the effects produced by co-articulatory effects. A particular sound of a word is affected by its predecessor and/or successor. In general natural language conversational speech, such effects are strong. Indic languages provide several instances to show how deep this problem is. The effectiveness of speech recognition relies on 4 primary pointers which are explained in detail on how I plan to address in implementation details. The first one is how the data is handled, i.e size of the vocabulary, speaker independence criterion based on the application needs to be developed. The second one is acoustic modeling which involves the choice of selection of features for each frame or a set of frames. Next is the modeling of language. Finally it is the search problem. Broadly till date, there are two  schools of speech recognition technology. They are HMM statistical model and neural net model. I am interested in using statistical model and implement the speech recognition problem using Viterbi algorithm.
for Malayalam. As per the proposal, CMU Sphinx toolkit is used. If the  
data modeled is sufficiently representative of the language, then  
efficiency improves. Hence rigorous processing and modeling of acoustic  
data is also aimed. Language models that are sophisticated enough to  
provide adequate context or semantics of the word are required to  
disambiguate between all the available hypotheses shortlisted. Another  
issue that needs to be addressed is the effects produced by  
co-articulatory effects. A particular sound of a word is affected by its
predecessor and/or successor. In general natural language  
conversational speech, such effects are strong. Indic languages provide  
several instances to show how deep this problem is. The effectiveness of
speech recognition relies on 4 primary pointers which are explained in  
detail on how I plan to address in implementation details. The first one
is how the data is handled, i.e size of the vocabulary, speaker  
independence criterion based on the application needs to be developed.  
The second one is acoustic modeling which involves the choice of  
selection of features for each frame or a set of frames. Next is the  
modeling of language. Finally it is the search problem. Broadly till  
date, there are two  schools of speech recognition technology. They are  
HMM statistical model and neural net model. I am interested in using  
statistical model and implement the speech recognition problem using  
Viterbi algorithm.


== IMPLEMENTATON DETAILS ==
== IMPLEMENTATON DETAILS ==


The aim of this project as a whole is to develop an acoustic model and  
The aim of this project as a whole is to develop an acoustic model and language model for Malayalam with reasonable WER ( Word Error Rate ). The entire project can be subdivided in four parts :  
language model for Malayalam with reasonable WER ( Word Error Rate ).  
The entire project can be subdivided in four parts :  


1) Data compilation
1) Data compilation
Line 67: Line 27:


'''1) Data Compilation:'''
'''1) Data Compilation:'''
Annotated speech data is collected. If transcriptions are not already  
Annotated speech data is collected. If transcriptions are not already available, we need to make worthy accurate transcription of the audio speech files. Text corpora of Malayalam also need to be collected from a
available, we need to make worthy accurate transcription of the audio  
  reliable source. Repetition of same words in all the training files do not contribute much. So, speech and text data should be carefully selected such that the collection sufficiently is representative of  
speech files. Text corpora of Malayalam also need to be collected from a
Malayalam. Once this is done to the best extent we can, the acoustic modeling improves. Data is to be collected from multiple speakers so that dependence on a particular style, dialect or speaker itself is  
  reliable source. Repetition of same words in all the training files do  
not contribute much. So, speech and text data should be carefully  
selected such that the collection sufficiently is representative of  
Malayalam. Once this is done to the best extent we can, the acoustic  
modeling improves. Data is to be collected from multiple speakers so  
that dependence on a particular style, dialect or speaker itself is  
avoided.
avoided.


'''2) Data Handling and Acoustic Model:'''
'''2) Data Handling and Acoustic Model:'''


CMU Sphinx includes a series of speech recognizers. The latest of them  
CMU Sphinx includes a series of speech recognizers. The latest of them is Sphinx4 and an acoustic model trainer (SphinxTrain). Properly trained and designed acoustic model is very essential to improve the  
is Sphinx4 and an acoustic model trainer (SphinxTrain). Properly trained
efficiencies of speech recognizers. A database of audio files which are annotated (I mean transcriptions available) are taken and are acoustically modeled to give their respective words along with their  
and designed acoustic model is very essential to improve the  
statistics. Criteria and choice of selecting an appropriate unit of speech is a very important parameter that determines the quality of speech recognition system. Since Indian languages are syllabic, choice of unit could be syllable. Words are made up of sequence of syllables. Another issue that the algorithm should address is the co-articulatory effects. With an increase in the vocabulary size, confusion in these aspects increases. Appropriate n gram can be chosen in the HMM model to address this issue, and can further be clustered into equivalence classes. Such context dependent model is to be developed. (most of the systems take trigrams). Firstly, speech input is sampled and pre-processing is done on them to obtain feature vectors for each frame. Sphinx characterizes frame by four features which are cepstra, del(cepstra), del(del(cepstra)), and power. (where del denotes the differential)
efficiencies of speech recognizers. A database of audio files which are  
annotated (I mean transcriptions available) are taken and are  
acoustically modeled to give their respective words along with their  
statistics.  
Criteria and choice of selecting an appropriate unit of speech is a very
important parameter that determines the quality of speech recognition  
system. Since Indian languages are syllabic, choice of unit could be  
syllable. Words are made up of sequence of syllables. Another issue that
the algorithm should address is the co-articulatory effects. With an  
increase in the vocabulary size, confusion in these aspects increases.  
Appropriate n gram can be chosen in the HMM model to address this issue,
and can further be clustered into equivalence classes. Such context  
dependent model is to be developed. (most of the systems take trigrams).
Firstly, speech input is sampled and pre-processing is done on them to  
obtain feature vectors for each frame. Sphinx characterizes frame by  
four features which are cepstra, del(cepstra), del(del(cepstra)), and  
power. (where del denotes the differential)


By acoustic modeling, we obtain the statistical representations of the  
By acoustic modeling, we obtain the statistical representations of the feature vectors that we get from the given speech files. SphinxTrain, which is based on HMM, is used to train the acoustic model. By adjusting the parameters of SphinxTrain, efficiency can be improved. Selection of suitable parameters to adjust could be executed here. The training data includes multiple samples of each word from different speakers so that  
feature vectors that we get from the given speech files. SphinxTrain,  
the system created is independent of the speaker.From implementation point of view, this can be achieved by modeling the constituent syllables that make up the word.
which is based on HMM, is used to train the acoustic model. By adjusting
the parameters of SphinxTrain, efficiency can be improved. Selection of
suitable parameters to adjust could be executed here. The training data
includes multiple samples of each word from different speakers so that  
the system created is independent of the speaker.From implementation  
point of view, this can be achieved by modeling the constituent  
syllables that make up the word.


In the training phase, random searches show better results as compared  
In the training phase, random searches show better results as compared  
Line 116: Line 46:


'''3) Language model:'''
'''3) Language model:'''
Since explicit word boundaries are not present, the machine would be in a
Since explicit word boundaries are not present, the machine would be in a situation to make a selection of output from a large number of word sequence hypotheses. There is a possibility that the alternate hypothesis is syntactically correct. The idea of language model is to select the most likely sequence from all the options the system has. N gram model can be applied here but the issue is memory as large number can be possible. Language modeling can be done in 3 primary ways:
situation to make a selection of output from a large number of word  
1)Context Free Grammars. But this model is highly restrictive and has to follow the prescription. Hence it is not a good idea to use it in large vocabulary systems.
sequence hypotheses. There is a possibility that the alternate  
2)N gram models: An n gram model need not contain the information about the probabilities of all possible n sequences of words. Instead a back off technique, by assigning weights can be applied when the required n  
hypothesis is syntactically correct. The idea of language model is to  
select the most likely sequence from all the options the system has. N  
gram model can be applied here but the issue is memory as large number  
can be possible. Language modeling can be done in 3 primary ways:
1)Context Free Grammars. But this model is highly restrictive and has to
follow the prescription. Hence it is not a good idea to use it in large
vocabulary systems.
2)N gram models: An n gram model need not contain the information about  
the probabilities of all possible n sequences of words. Instead a back  
off technique, by assigning weights can be applied when the required n  
gram is not present.  
gram is not present.  
3)Class n gram models: They are similar to the n gram model, with the  
3)Class n gram models: They are similar to the n gram model, with the difference that the tokens are word classes such as months, Named entities, digits etc,. Experiments in the past has shown effective results using trigrams.
difference that the tokens are word classes such as months, Named  
entities, digits etc,.
Experiments in the past has shown effective results using trigrams.


Finally, all of it is held on search! Search for the best word sequence  
Finally, all of it is held on search! Search for the best word sequence given the complete speech input. Viterbi decoding, A* algorithm can be used to implement this task. This can be implemented by processing each  
given the complete speech input. Viterbi decoding, A* algorithm can be  
frame or a fixed set of frames together and making required updating till that point. This is like time synchronous processing. As discussed earlier, the implementation could be proceeded with stack decoding or  
used to implement this task. This can be implemented by processing each  
frame or a fixed set of frames together and making required updating  
till that point. This is like time synchronous processing. As discussed  
earlier, the implementation could be proceeded with stack decoding or  
dynamic approach. A brief working of stack decoding is explained here.
dynamic approach. A brief working of stack decoding is explained here.


'''4) Stack decoding:'''
'''4) Stack decoding:'''
The possible hypotheses and their respective probabilities are stored.  
The possible hypotheses and their respective probabilities are stored. The best hypothesis is thus chosen. If it is complete, then it is the output or else expansion is done for each word and places them in the  
The best hypothesis is thus chosen. If it is complete, then it is the  
stack to check further. But this takes a lot of time based on its complexity, so it is better to proceed in dynamic programming based Viterbi approach.  
output or else expansion is done for each word and places them in the  
stack to check further. But this takes a lot of time based on its  
complexity, so it is better to proceed in dynamic programming based  
Viterbi approach.  


'''5) Viterbi algorithm:'''
'''5) Viterbi algorithm:'''
It is based on Hidden Markov Model. The HMM states are traversed in a  
It is based on Hidden Markov Model. The HMM states are traversed in a dynamic programming approach. In a list of dictionaries, the value at 't'th element of 's'th dictionary stores the value of probability corresponding to the best state sequence leading from the initial state at time 0 to state s at time t. In the training phase, the task of classification could be achieved better using SVM(Support Vector Machines). In order to optimize Viterbi, I also propose to use the Elitist model to decide upon the direction during the search phase. I am planning to implement the solution to this problem in Python language. I have also had the experience of building my own POS tagger using Viterbi algorithm. Some of the modules can be adopted here as  
dynamic programming approach. In a list of dictionaries, the value at  
't'th element of 's'th dictionary stores the value of probability  
corresponding to the best state sequence leading from the initial state  
at time 0 to state s at time t. In the training phase, the task of  
classification could be achieved better using SVM(Support Vector  
Machines). In order to optimize Viterbi, I also propose to use the  
Elitist model to decide upon the direction during the search phase.
I am planning to implement the solution to this problem in Python  
language. I have also had the experience of building my own POS tagger  
using Viterbi algorithm. Some of the modules can be adopted here as  
well.
well.


Let N be the total number of states and T be total duration or the  
Let N be the total number of states and T be total duration or the states that are being checked for. Then the complexity of Viterbi decoding is O(N^2T).
states that are being checked for. Then the complexity of Viterbi  
decoding is O(N^2T).'''


Tree structured lexicon:'''
'''Tree structured lexicon:'''
The direct implementation of Viterbi decoding still remains expensive.  
The direct implementation of Viterbi decoding still remains expensive. The search space can be optimized by the use of lexical trees based on the fact that for each root, there are several words that share the same prefix. So the model for the system can also be shared.
The search space can be optimized by the use of lexical trees based on  
the fact that for each root, there are several words that share the same
prefix. So the model for the system can also be shared.


Efficiency of the system can be calculated by Word Error Rate(WER) and  
Efficiency of the system can be calculated by Word Error Rate(WER) and based on these scores, the system can be improved further.
based on these scores, the system can be improved further.




== Expected Challenges ==
== Expected Challenges ==


1) Properly studying CMU Sphinx engine to adapt it to the current  
1) Properly studying CMU Sphinx engine to adapt it to the current problem and specifics of the language.  
problem and specifics of the language.  


2) Most of the Indian languages, and in specific the Dravidian  
2) Most of the Indian languages, and in specific the Dravidian languages(including Malayalam) do not have proper speech database with reliable annotations(translations).
languages(including Malayalam) do not have proper speech database with  
reliable annotations(translations).


3) Data should be collected such that it should be sufficiently  
3) Data should be collected such that it should be sufficiently representative of the language.
representative of the language.




Line 198: Line 87:
(I am having my end semester exams from 19th to 30th April. So it would  
(I am having my end semester exams from 19th to 30th April. So it would  
be preferable to start the work from 2nd May)
be preferable to start the work from 2nd May)
May 02 - May 06  : Take suggestions from the mentors and incorporating  
May 02 - May 06  : Take suggestions from the mentors and incorporating them into the implementation. I also would like to study on SVM to improve pattern recognition accuracies, and if so, I want to discuss with the mentors to incorporate the changes in the algorithm.
them into the implementation. I also would like to study on SVM to  
improve pattern recognition accuracies, and if so, I want to discuss  
with the mentors to incorporate the changes in the algorithm.


May 07 - May 16 : Collect speech data and text corpora from multiple  
May 07 - May 16 : Collect speech data and text corpora from multiple speakers and which is sufficiently representative of the language. Making a choice of good speakers based on the feedback and statistics.
speakers and which is sufficiently representative of the language.  
Making a choice of good speakers based on the feedback and statistics.


May 17 - May 18 : Getting feedback from the mentoring community to  
May 17 - May 18 : Getting feedback from the mentoring community to initiate the implementations.
initiate the implementations.


'''CODING PERIOD:'''
'''CODING PERIOD:'''


May 19 - May 29 : Handle the CMU Sphinx dataset to be worked upon and  
May 19 - May 29 : Handle the CMU Sphinx dataset to be worked upon and extract feature vectors for each frame. cepstra, del(cepstra), del(del(cepstra)), and power are to be considered as feature vectors. Understanding the Sphinx engine along with efficient usages of SphinxTrain and cmuclmtk.
extract feature vectors for each frame. cepstra, del(cepstra),  
del(del(cepstra)), and power are to be considered as feature vectors.  
Understanding the Sphinx engine along with efficient usages of  
SphinxTrain and cmuclmtk.


May 30 - June 10 : Selecting the parameters to train the language model.
May 30 - June 10 : Selecting the parameters to train the language model. Understanding an optimum number of states to be used in Viterbi. Adapt the exact language model to be opted based on feedback from the  
Understanding an optimum number of states to be used in Viterbi. Adapt  
the exact language model to be opted based on feedback from the  
mentoring community.
mentoring community.
   
   
June 10 - June 24 : Implement the Viterbi algorithm to select the best  
June 10 - June 24 : Implement the Viterbi algorithm to select the best possible word hypotheses for a given input.
possible word hypotheses for a given input.


June 25 - June 26 : Testing for consistency and efficiency of the models
June 25 - June 26 : Testing for consistency and efficiency of the models used. Discuss with the mentors regarding the progress of the work to get their feedback and incorporate the suggested changes. Planning to  
used. Discuss with the mentors regarding the progress of the work to  
get their feedback and incorporate the suggested changes. Planning to  
make further error corrections to improve upon the accuracy.
make further error corrections to improve upon the accuracy.


'''MID TERM DELIVERABLES:'''
'''MID TERM DELIVERABLES:'''


June 27 : Mid term evaluation. Present a reasonably good working  
June 27 : Mid term evaluation. Present a reasonably good working acoustic and language model with speech recognition.
acoustic and language model with speech recognition.


June 29 - July 31 : General Bug fixing. Develop application specific  
June 29 - July 31 : General Bug fixing. Develop application specific details that are required to be performed by contacting the mentors. Do some proactive further reading if the application or the system could be improved upon in any ways. Get suggestions from mentors regarding possible errors and correcting them to improve performance and consistency.
details that are required to be performed by contacting the mentors. Do  
some proactive further reading if the application or the system could be
improved upon in any ways. Get suggestions from mentors regarding  
possible errors and correcting them to improve performance and  
consistency.


August 1 - August 9 : Documentation of the details of the code. Testing  
August 1 - August 9 : Documentation of the details of the code. Testing the model repeatedly and correcting errors. Incorporate language specific touch up details in the code.
the model repeatedly and correcting errors. Incorporate language  
specific touch up details in the code.
   
   
August 10 - August 21 : Backup time for delays not anticipated.
August 10 - August 21 : Backup time for delays not anticipated.


'''END DELIVERABLE:'''
'''END DELIVERABLE:'''
A comprehensive accomplished speech recognition system for Malayalam  
A comprehensive accomplished speech recognition system for Malayalam language in CMU Sphinx.
language in CMU Sphinx.


'''Post GSOC''' : Stay in touch with the mentors and the linguistics  
'''Post GSOC''' : Stay in touch with the mentors and the linguistics community to be a part of the further projects and actively contribute towards the same.
community to be a part of the further projects and actively contribute  
towards the same.


== About me ==
== About me ==


I am a student of IIIT-H(International Institute of Information  
I am a student of IIIT-H(International Institute of Information Technology, Hyderabad). I am an integrated dual degree student currently pursuing B.Tech in Computer Science and MS by Research in Computational Linguistics. I am working under the guidance of Dr.Kishore Prahallad for my research and MS. Currently I have begun studying Deep Neural Network based speech segmentation to work on my research.  
Technology, Hyderabad). I am an integrated dual degree student currently
pursuing B.Tech in Computer Science and MS by Research in Computational
Linguistics. I am working under the guidance of Dr.Kishore Prahallad  
for my research and MS. Currently I have begun studying Deep Neural  
Network based speech segmentation to work on my research.  


Previous experience in the fields of Speech technologies and  
Previous experience in the fields of Speech technologies and Computational Linguistics:
Computational Linguistics:


1) Earlier I have worked on Text to Speech conversion on mobile  
1) Earlier I have worked on Text to Speech conversion on mobile platforms for Indian languages, and in specific put effort along Telugu and Kannada, and am also in the process of developing an android application for the same. For improvement in the accuracy and consistency, backoff techniques are implemented.
platforms for Indian languages, and in specific put effort along Telugu  
and Kannada, and am also in the process of developing an android  
application for the same. For improvement in the accuracy and  
consistency, backoff techniques are implemented.


2) I had the opportunity to learn from Professor Lakshmi Bhai and have  
2) I had the opportunity to learn from Professor Lakshmi Bhai and have worked on the project, Etymological reconstruction of Proto form of Dravidian languages by comparing Malayalam and Telugu.  
worked on the project, Etymological reconstruction of Proto form of  
Dravidian languages by comparing Malayalam and Telugu.  


3) I have previously implemented my own POS tagger based on Viterbi  
3) I have previously implemented my own POS tagger based on Viterbi algorithm. Also developed an unsupervised working model of tagging as a part of course project.
algorithm. Also developed an unsupervised working model of tagging as a  
part of course project.


4) Developed a plugin for Domain specific morph analysis.
4) Developed a plugin for Domain specific morph analysis.


5) Built Intra Chunk Expansion tool for English: Developed a tool to  
5) Built Intra Chunk Expansion tool for English: Developed a tool to mark intra chunk dependencies of words in English with their expansions from Shakti Standard Format(SSF).
mark intra chunk dependencies of words in English with their expansions  
from Shakti Standard Format(SSF).


I have submitted a paper on “Domain Adaptation in Morphological  
I have submitted a paper on “Domain Adaptation in Morphological Analysis” to ICON-2013: 10th International Conference on Natural Language Processing organized by CDAC. I have also submitted a paper titled “A dynamic programming based approach for generating syllable level templates in statistical parametric synthesis” to IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2014.
Analysis” to ICON-2013: 10th International Conference on Natural  
Language Processing organized by CDAC. I have also submitted a paper  
titled “A dynamic programming based approach for generating syllable  
level templates in statistical parametric synthesis” to IEEE  
International Conference on Acoustics, Speech, and Signal Processing  
(ICASSP) 2014.


'''Why do you want to work with the Swathanthra Malayalam Computing?'''
'''Why do you want to work with the Swathanthra Malayalam Computing?'''


I have always admired and appreciated the contributions of open source  
I have always admired and appreciated the contributions of open source development. Moreover due to my mode of study, I have always deeply felt an interest towards linguistically motivated speech technologies. I want to work for SMC as it seems to be the combination of both my passions. I would be extremely enraptured to be able to contribute to Indian languages.
development. Moreover due to my mode of study, I have always deeply felt
an interest towards linguistically motivated speech technologies. I  
want to work for SMC as it seems to be the combination of both my  
passions. I would be extremely enraptured to be able to contribute to  
Indian languages.


'''Do you have any past involvement with the Swathanthra Malayalam  
'''Do you have any past involvement with the Swathanthra Malayalam Computing or another open source project as a contributor?'''
Computing or another open source project as a contributor?'''


No. But I am very much interested and love to be a part of it.
No. But I am very much interested and love to be a part of it.


'''Did you participate with the past GSoC programs, if so which years,  
'''Did you participate with the past GSoC programs, if so which years, which organizations?'''
which organizations?'''


No, this is my first time.
No, this is my first time.
Line 318: Line 152:
'''Do you have other obligations between May and August ?'''
'''Do you have other obligations between May and August ?'''


I recognize that GSOC expects a full time commitment. My vacation starts
I recognize that GSOC expects a full time commitment. My vacation starts at the beginning of May and classes reopen in the first week of August. We do not have any examinations till the time GSOC coding is officially
at the beginning of May and classes reopen in the first week of August.
  completed for this year. So I will definitely be able to dedicate 40 hours per week and am ready to put in the best effort I can to make substantial contribution towards Indian languages speech technologies. I am confident that I can finish the targeted work.
We do not have any examinations till the time GSOC coding is officially
  completed for this year. So I will definitely be able to dedicate 40  
hours per week and am ready to put in the best effort I can to make  
substantial contribution towards Indian languages speech technologies. I
am confident that I can finish the targeted work.


'''Will you continue contributing/ supporting the Swathanthra Malayalam  
'''Will you continue contributing/ supporting the Swathanthra Malayalam Computing after the GSOC 2014 program, if yes, which area(s), you are interested in?'''
Computing after the GSOC 2014 program, if yes, which area(s), you are  
interested in?'''


Yes, I definitely would love to be a part of the coding and developing  
Yes, I definitely would love to be a part of the coding and developing community of SMC even after GSOC. I am interested towards making my work and contributions towards speech related and NLP related problems. Moreover there is always a scope for improvement in this field by fusing the computational linguistics aspect to it.
community of SMC even after GSOC. I am interested towards making my work
and contributions towards speech related and NLP related problems.  
Moreover there is always a scope for improvement in this field by fusing
the computational linguistics aspect to it.


'''Why should we choose you over other applicants?'''
'''Why should we choose you over other applicants?'''


The idea of open source has always motivated me. I would passionately  
The idea of open source has always motivated me. I would passionately work towards making significant contributions. I am very interested in this project and am good in python, the language I intend to use for  
work towards making significant contributions. I am very interested in  
this project. I am also good at C and Matlab. Most of my previous experience in projects is related to Computational Linguistics, NLP and Speech synthesis. I intend to do my masters and further studies related  
this project and am good in python, the language I intend to use for  
to speech technologies. So I am sure this would help me grow and understand a lot and motivate me more towards it along with the work that I could contribute.
this project. I am also good at C and Matlab. Most of my previous  
experience in projects is related to Computational Linguistics, NLP and  
Speech synthesis. I intend to do my masters and further studies related  
to speech technologies. So I am sure this would help me grow and  
understand a lot and motivate me more towards it along with the work  
that I could contribute.


(A list of my previous projects in this area are mentioned earlier.)
(A list of my previous projects in this area are mentioned earlier.)
I have been in contact with Deepa P Gopinath and am confident that I can achieve the desired task under her mentorship.

Revision as of 14:22, 20 March 2014

Project Title: Language model and Acoustic model for Malayalam language for speech recognition system in CMU Sphinx

What the work is about?

The main objective of speech recognition system is to transcript speech into text. Malayalam is spoken by approximately around 33 million people. Several of the revolutionizing technologies gifted by computer science can be incorporated in people's lives by making this technology more closer and native. One such prominent methods is bringing ease and nativity in modes of communication with computing systems. Speech recognition system contributes towards the same. Speech recognition systems can be broadly classified into two categories based on how they are built. The first type is the one with a limited vocabulary, and in specific a unit is considered to be a single word, which are also known as word recognition system. The second category of systems perform the task of recognition with large vocabulary. The main issue that we face is that prior knowledge of word boundaries are not known to the system. And even if this knowledge is provided, there would be several choices from which the algorithm has to choose the appropriate one, either based on context or statistical likelihood. The idea is that the system needs to check at every instant if that is the word boundary.

SYNOPSIS

The main aim of the project is to build an acoustic and language models for Malayalam. As per the proposal, CMU Sphinx toolkit is used. If the data modeled is sufficiently representative of the language, then efficiency improves. Hence rigorous processing and modeling of acoustic data is also aimed. Language models that are sophisticated enough to provide adequate context or semantics of the word are required to disambiguate between all the available hypotheses shortlisted. Another issue that needs to be addressed is the effects produced by co-articulatory effects. A particular sound of a word is affected by its predecessor and/or successor. In general natural language conversational speech, such effects are strong. Indic languages provide several instances to show how deep this problem is. The effectiveness of speech recognition relies on 4 primary pointers which are explained in detail on how I plan to address in implementation details. The first one is how the data is handled, i.e size of the vocabulary, speaker independence criterion based on the application needs to be developed. The second one is acoustic modeling which involves the choice of selection of features for each frame or a set of frames. Next is the modeling of language. Finally it is the search problem. Broadly till date, there are two schools of speech recognition technology. They are HMM statistical model and neural net model. I am interested in using statistical model and implement the speech recognition problem using Viterbi algorithm.

IMPLEMENTATON DETAILS

The aim of this project as a whole is to develop an acoustic model and language model for Malayalam with reasonable WER ( Word Error Rate ). The entire project can be subdivided in four parts :

1) Data compilation

2) Data handling

3) Acoustic modeling

4) Language modeling

5) Search using Viterbi

1) Data Compilation: Annotated speech data is collected. If transcriptions are not already available, we need to make worthy accurate transcription of the audio speech files. Text corpora of Malayalam also need to be collected from a

reliable source. Repetition of same words in all the training files do not contribute much. So, speech and text data should be carefully selected such that the collection sufficiently is representative of 

Malayalam. Once this is done to the best extent we can, the acoustic modeling improves. Data is to be collected from multiple speakers so that dependence on a particular style, dialect or speaker itself is avoided.

2) Data Handling and Acoustic Model:

CMU Sphinx includes a series of speech recognizers. The latest of them is Sphinx4 and an acoustic model trainer (SphinxTrain). Properly trained and designed acoustic model is very essential to improve the efficiencies of speech recognizers. A database of audio files which are annotated (I mean transcriptions available) are taken and are acoustically modeled to give their respective words along with their statistics. Criteria and choice of selecting an appropriate unit of speech is a very important parameter that determines the quality of speech recognition system. Since Indian languages are syllabic, choice of unit could be syllable. Words are made up of sequence of syllables. Another issue that the algorithm should address is the co-articulatory effects. With an increase in the vocabulary size, confusion in these aspects increases. Appropriate n gram can be chosen in the HMM model to address this issue, and can further be clustered into equivalence classes. Such context dependent model is to be developed. (most of the systems take trigrams). Firstly, speech input is sampled and pre-processing is done on them to obtain feature vectors for each frame. Sphinx characterizes frame by four features which are cepstra, del(cepstra), del(del(cepstra)), and power. (where del denotes the differential)

By acoustic modeling, we obtain the statistical representations of the feature vectors that we get from the given speech files. SphinxTrain, which is based on HMM, is used to train the acoustic model. By adjusting the parameters of SphinxTrain, efficiency can be improved. Selection of suitable parameters to adjust could be executed here. The training data includes multiple samples of each word from different speakers so that the system created is independent of the speaker.From implementation point of view, this can be achieved by modeling the constituent syllables that make up the word.

In the training phase, random searches show better results as compared to that of exhaustive search. I am also willing to learn and work upon incorporation of genetic algorithms to optimize this part.

3) Language model: Since explicit word boundaries are not present, the machine would be in a situation to make a selection of output from a large number of word sequence hypotheses. There is a possibility that the alternate hypothesis is syntactically correct. The idea of language model is to select the most likely sequence from all the options the system has. N gram model can be applied here but the issue is memory as large number can be possible. Language modeling can be done in 3 primary ways: 1)Context Free Grammars. But this model is highly restrictive and has to follow the prescription. Hence it is not a good idea to use it in large vocabulary systems. 2)N gram models: An n gram model need not contain the information about the probabilities of all possible n sequences of words. Instead a back off technique, by assigning weights can be applied when the required n gram is not present. 3)Class n gram models: They are similar to the n gram model, with the difference that the tokens are word classes such as months, Named entities, digits etc,. Experiments in the past has shown effective results using trigrams.

Finally, all of it is held on search! Search for the best word sequence given the complete speech input. Viterbi decoding, A* algorithm can be used to implement this task. This can be implemented by processing each frame or a fixed set of frames together and making required updating till that point. This is like time synchronous processing. As discussed earlier, the implementation could be proceeded with stack decoding or dynamic approach. A brief working of stack decoding is explained here.

4) Stack decoding: The possible hypotheses and their respective probabilities are stored. The best hypothesis is thus chosen. If it is complete, then it is the output or else expansion is done for each word and places them in the stack to check further. But this takes a lot of time based on its complexity, so it is better to proceed in dynamic programming based Viterbi approach.

5) Viterbi algorithm: It is based on Hidden Markov Model. The HMM states are traversed in a dynamic programming approach. In a list of dictionaries, the value at 't'th element of 's'th dictionary stores the value of probability corresponding to the best state sequence leading from the initial state at time 0 to state s at time t. In the training phase, the task of classification could be achieved better using SVM(Support Vector Machines). In order to optimize Viterbi, I also propose to use the Elitist model to decide upon the direction during the search phase. I am planning to implement the solution to this problem in Python language. I have also had the experience of building my own POS tagger using Viterbi algorithm. Some of the modules can be adopted here as well.

Let N be the total number of states and T be total duration or the states that are being checked for. Then the complexity of Viterbi decoding is O(N^2T).

Tree structured lexicon: The direct implementation of Viterbi decoding still remains expensive. The search space can be optimized by the use of lexical trees based on the fact that for each root, there are several words that share the same prefix. So the model for the system can also be shared.

Efficiency of the system can be calculated by Word Error Rate(WER) and based on these scores, the system can be improved further.


Expected Challenges

1) Properly studying CMU Sphinx engine to adapt it to the current problem and specifics of the language.

2) Most of the Indian languages, and in specific the Dravidian languages(including Malayalam) do not have proper speech database with reliable annotations(translations).

3) Data should be collected such that it should be sufficiently representative of the language.


PROJECT EXECUTION TIME LINE

INITIALIZING WORK AND COMMUNITY BONDING WITH MENTORS: (Attain the knowledge of some specifics in Malayalam language.) (I am having my end semester exams from 19th to 30th April. So it would be preferable to start the work from 2nd May) May 02 - May 06  : Take suggestions from the mentors and incorporating them into the implementation. I also would like to study on SVM to improve pattern recognition accuracies, and if so, I want to discuss with the mentors to incorporate the changes in the algorithm.

May 07 - May 16 : Collect speech data and text corpora from multiple speakers and which is sufficiently representative of the language. Making a choice of good speakers based on the feedback and statistics.

May 17 - May 18 : Getting feedback from the mentoring community to initiate the implementations.

CODING PERIOD:

May 19 - May 29 : Handle the CMU Sphinx dataset to be worked upon and extract feature vectors for each frame. cepstra, del(cepstra), del(del(cepstra)), and power are to be considered as feature vectors. Understanding the Sphinx engine along with efficient usages of SphinxTrain and cmuclmtk.

May 30 - June 10 : Selecting the parameters to train the language model. Understanding an optimum number of states to be used in Viterbi. Adapt the exact language model to be opted based on feedback from the mentoring community.

June 10 - June 24 : Implement the Viterbi algorithm to select the best possible word hypotheses for a given input.

June 25 - June 26 : Testing for consistency and efficiency of the models used. Discuss with the mentors regarding the progress of the work to get their feedback and incorporate the suggested changes. Planning to make further error corrections to improve upon the accuracy.

MID TERM DELIVERABLES:

June 27 : Mid term evaluation. Present a reasonably good working acoustic and language model with speech recognition.

June 29 - July 31 : General Bug fixing. Develop application specific details that are required to be performed by contacting the mentors. Do some proactive further reading if the application or the system could be improved upon in any ways. Get suggestions from mentors regarding possible errors and correcting them to improve performance and consistency.

August 1 - August 9 : Documentation of the details of the code. Testing the model repeatedly and correcting errors. Incorporate language specific touch up details in the code.

August 10 - August 21 : Backup time for delays not anticipated.

END DELIVERABLE: A comprehensive accomplished speech recognition system for Malayalam language in CMU Sphinx.

Post GSOC : Stay in touch with the mentors and the linguistics community to be a part of the further projects and actively contribute towards the same.

About me

I am a student of IIIT-H(International Institute of Information Technology, Hyderabad). I am an integrated dual degree student currently pursuing B.Tech in Computer Science and MS by Research in Computational Linguistics. I am working under the guidance of Dr.Kishore Prahallad for my research and MS. Currently I have begun studying Deep Neural Network based speech segmentation to work on my research.

Previous experience in the fields of Speech technologies and Computational Linguistics:

1) Earlier I have worked on Text to Speech conversion on mobile platforms for Indian languages, and in specific put effort along Telugu and Kannada, and am also in the process of developing an android application for the same. For improvement in the accuracy and consistency, backoff techniques are implemented.

2) I had the opportunity to learn from Professor Lakshmi Bhai and have worked on the project, Etymological reconstruction of Proto form of Dravidian languages by comparing Malayalam and Telugu.

3) I have previously implemented my own POS tagger based on Viterbi algorithm. Also developed an unsupervised working model of tagging as a part of course project.

4) Developed a plugin for Domain specific morph analysis.

5) Built Intra Chunk Expansion tool for English: Developed a tool to mark intra chunk dependencies of words in English with their expansions from Shakti Standard Format(SSF).

I have submitted a paper on “Domain Adaptation in Morphological Analysis” to ICON-2013: 10th International Conference on Natural Language Processing organized by CDAC. I have also submitted a paper titled “A dynamic programming based approach for generating syllable level templates in statistical parametric synthesis” to IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2014.

Why do you want to work with the Swathanthra Malayalam Computing?

I have always admired and appreciated the contributions of open source development. Moreover due to my mode of study, I have always deeply felt an interest towards linguistically motivated speech technologies. I want to work for SMC as it seems to be the combination of both my passions. I would be extremely enraptured to be able to contribute to Indian languages.

Do you have any past involvement with the Swathanthra Malayalam Computing or another open source project as a contributor?

No. But I am very much interested and love to be a part of it.

Did you participate with the past GSoC programs, if so which years, which organizations?

No, this is my first time.

Do you have other obligations between May and August ?

I recognize that GSOC expects a full time commitment. My vacation starts at the beginning of May and classes reopen in the first week of August. We do not have any examinations till the time GSOC coding is officially

completed for this year. So I will definitely be able to dedicate 40 hours per week and am ready to put in the best effort I can to make substantial contribution towards Indian languages speech technologies. I am confident that I can finish the targeted work.

Will you continue contributing/ supporting the Swathanthra Malayalam Computing after the GSOC 2014 program, if yes, which area(s), you are interested in?

Yes, I definitely would love to be a part of the coding and developing community of SMC even after GSOC. I am interested towards making my work and contributions towards speech related and NLP related problems. Moreover there is always a scope for improvement in this field by fusing the computational linguistics aspect to it.

Why should we choose you over other applicants?

The idea of open source has always motivated me. I would passionately work towards making significant contributions. I am very interested in this project and am good in python, the language I intend to use for this project. I am also good at C and Matlab. Most of my previous experience in projects is related to Computational Linguistics, NLP and Speech synthesis. I intend to do my masters and further studies related to speech technologies. So I am sure this would help me grow and understand a lot and motivate me more towards it along with the work that I could contribute.

(A list of my previous projects in this area are mentioned earlier.) I have been in contact with Deepa P Gopinath and am confident that I can achieve the desired task under her mentorship.