User:Ragha: Difference between revisions
Line 28: | Line 28: | ||
The output probability for state i at time t is usually denoted by bi(t). Let aij be the static transition probability that is independent of the speech input given. | The output probability for state i at time t is usually denoted by bi(t). Let aij be the static transition probability that is independent of the speech input given. | ||
T* = arg maxT | T* = arg maxT P(T | W) | ||
= arg maxT | = arg maxT P(W | T) P(T) / Pλ(W) | ||
(Bayes’ Theorem) | (Bayes’ Theorem) | ||
= arg maxT | = arg maxT P(W | T) P(T) | ||
(W is constant for all T) | (W is constant for all T) | ||
= arg maxT | = arg maxT \pi[a(ti-1ti) b(wi | ti) ] | ||
= arg maxT | = arg maxT \pi log[a(ti-1ti) b(wi | ti) ] | ||
Revision as of 14:16, 18 March 2014
What the work is about?
The main objective of speech recognition system is to transcript speech into text. Speech recognition systems can be broadly classified into two categories based on how they are built. The first type is the one with a limited vocabulary, and in specific a unit is considered to be a single word, which are also known as word recognition system. The second category of systems perform the task of recognition with large vocabulary. The main issue that we face is that prior knowledge of word boundaries are not known to the system. And even if this knowledge is provided, there would be several choices from which the algorithm has to choose the appropriate one, either based on context or statistical likelihood. The idea is that the system needs to check at every instant if that is the word boundary.
SYNOPSIS:
Language models that are sophisticated enough to provide adequate context or semantics of the word are required to disambiguate between all the available hypotheses shortlisted. Another issue that needs to be addressed is the effects produced by co-articulatory effects. It is the effect on a particular the sound of a word is influenced by the sound which is precceding or following it. In general natural language conversational speech, such effects are strong. Indic languages provide several situations to show how deep this problem is. The effectiveness of speech recogntion relies on 4 primary pointers which are explained in detail on how I plan to address in implemtation details. The first one is how the data is handled, i.e size of the vocabulary, speaker independence criterion based on the application neede to be developed. The second one is acoustic modelling which involves the choice of selection of features for each frame or a set of frames. Next is the modelling of laguage. Finally it is the search problem. Broadly till date, there are two schools of speech recognition technology. They are HMM statistical model and neural net model. I am interested in using statistical model and implement the speech recognition problem using Viterbi algorithm.
IMPLEMENTATON DETAILS:
1) Data handling
2) Acoustic modelling
3) Language modelling
4) Search using viterbi
1) Data Handling and Acoustic Model:
Criteria and choice of selecting an appropriate unit of speech is a very important parameter that determines the quality of speech recognition system. Since Indian languages are syllabic, choice of unit could be syllable. Words are made up of sequence of syllables. Some languages like Hindi have words that do not end in a vowel but other languages like Malayalam has several words that end in a vowel. Another issue that the algorithm should address is the co-articulatory effects. Confusability with these effects increases wit vocabulary size. Appropriate n gram can be chosen in the HMM model to address this issue, and can further be clustered into equivalence classes. Such context dependent models is to be developed. (most of the systems take trigrams). Firstly, speech input is sampled and pre-processing is done on them to obtain feature vectors for each frame. Sphinx characterizes frame by four features which are cepstra, Acepstra, AAcepstra, and power. The output probability for state i at time t is usually denoted by bi(t). Let aij be the static transition probability that is independent of the speech input given.
T* = arg maxT P(T | W) = arg maxT P(W | T) P(T) / Pλ(W) (Bayes’ Theorem) = arg maxT P(W | T) P(T) (W is constant for all T) = arg maxT \pi[a(ti-1ti) b(wi | ti) ] = arg maxT \pi log[a(ti-1ti) b(wi | ti) ]
The training data includes multiple samples of each word from different speakers so that the system created is independent of the speaker.From implementation point of view, this can be achieved by modelling the constituent syllables that make up the word.
2) Langauge model: Since explicit word boundaries are not present, the machine would be in a situation to make a selection of output from a large number of word sequence hypotheses. There is a possibility that the alternate hypothesis is syntactically correct. The idea of language model is to select the most likely sequence from all the options the system has. N gram model can be applied here but the issue is memory as large number can be possible. Language modelling can be done in 3 primary ways: 1)Context Free Grammars. But this model is highly restrictive and has to follow the prescription. Hence it is not a good idea to use it in large vocabulary systems. 2)N gram models: An n gram model need not contain the information about the proabilities of all possible n sequences of words. Instead a back off technique, by assigning weights can be applied when the required n gram is not present. 3)Class n gram models: They are similar to the n gram model, with the difference that the tokens are word classes such as months, Named entities, digits etc,. Experiments in the past has shown effective results using triphones.
Finally, all of it is held on search! Search for the best word sequence given the complete speech input. Viterbi decoding, A* algorithm can be used to implement this task. This can be implemented by processing each frame or a fixed set of frames together and making required updating till that point. This is like time synchronous processing. As discussed earlier, the implementation could be proceeded with stack decoding or dynamic approach. A brief working of stack decoding is explained here.
3) Stack decoding: The possible hypotheses and their respective probabilities are stored. The best hypothesis is thus chosen. If it is complete, then it is the output or else expansion is done for each word and places them in the stack to check further. But this takes a lot of time based on its complexity, so it is better to proceed in dynamic programming based Viterbi approach. I am planning to implement the solution to this problem in Python language. I have also had the experience of building my own POS tagger using Viterbi algorithm. Some of the modules can be adopted here as well.
4) Viterbi algorithm: It is based on Hidden Markov Model. The HMM states thus() obtained are traversed in a dynamic programming approach. In a list of dictionaries, the value at tth element of sth dictionary stores the value of probability corresponding to the best state sequence leading from the initial state at time 0 to state s at time t.
Let N be the total number of states and T be total duration or the states that are being checked for. Then the complexity of Viterbi decoding is O(N^2T).
Tree structured lexicon: The direct implementation of viterbi decoding still remains expensice. The search space can be optimized by the use of lexical trees based on the fact that for each root, there are several words that share the same prefix. So the model for the system can also be shared.
PROJECT EXECUTION TIMELEINE:
INITIALIZING WORK AND COMMUNITY BONDING WITH MENTORS: (Attain the knowledge of some pecularities in Malayalam language.) (I am having my end semester exams from 19th to 30th April. So it would be preferable to start the work from 2nd May) 20th April - 27th April : Take suggestions from the mentors and incorporating them into the implementation.
May 02 - May 06 : Design exact details of data structures to be used that are thought of till then to optimally implement the algorithm.
May 07 - May 10 : Planning out the exact imput and output details and intermediate deliverables that are expected.
May 11 - May 18 : Getting feedback from the mentoring community to initiate the implementations.
CODING PERIOD:
May 19 - May 24 : Handle the CMU Sphinx dataset to be worked upon and extract feature vectors for each frame. cepstra, Acepstra, AAcepstra, and power are to be considered as feature vectors.
May 25 - May 30 : Adapt the exact language model to be opted based on feedback from the mentoring community.
May 31 - June 19 : Implement the Viterbi algorithm to select the best possible word hypotheses for a given input.
June 20 - June 23 : Testing for consistency and efficiency of the models used. Testing()
June 24 - June 27 : Discuss with the mentors regarding the progress of the work to get their feedback and incorporate the suggested changes. Planning to make further error corrections to improve upon the accuracy.
MID TERM DELIVERABLES:
June 28 - July 11 : Develop application specific (if any) that are required to be performed by contacting the mentors.
July 12 - July 20 : Do some proactive further reading if the application or the system could be improved upon in any ways.
July 21 - August 1 : Get suggestions from mentors regarding possible errors and correcting them to improve performance and consistency.
August 2 - August 5 : Documentation of the details of the code. Testing the model repeatedly and correcting errors.
August 5 - August 22 : Backup time for delays not anticipated.
END DELIVERABLE: A comprehensive accomplished speech recognition system for Malayalam language in CMU Sphinx.
Post GSoC ----------> Maintain the web interface, and be the system admin in general listening to day to day issues. Become a part of the telepathy community.
About me:
I am a student of IIIT-H(International Institute of Information Technology, Hyderabad). I am an integrated dual degree student currently pursuing B.Tech in Computer Science and MS by Research in Computational Linguistics. I am working under the guidance of Dr.Kishore Prahallad for my research and MS. Currently I have begun studying Deep Neural Network based speech segmentation to work on my research.
Previous experience in the fields of Speech technologies and Computational Linguistics:
1) Earlier I have worked on Text to Speech conversion on mobile platforms for Indian languages, and in specific put effort along Telugu and Kannada, and am also in the process of developing an android application for the same. For improvement in the accuracy and consistency, backoff techniques are implemented.
2) I had the opportunity to learn from Professor Lakshmi Bhai and have worked on the project, Etymological reconstruction of Proto form of Dravidian languages by comparing Malayalam and Telugu.
3) I have previously implemented my own POS tagger based on Viterbi algorithm. Also developed an unsupervised working model of tagging as a part of course project.
4) Developed a plugin for Domain specific morph analysis.
5) Built Intra Chunk Expansion tool for English: Developed a tool to mark intra chunk dependencies of words in English with their expansions from Shakti Standard Format(SSF).
I have submitted a paper on “Domain Adaptation in Morphological Analysis” to ICON-2013: 10th International Conference on Natural Language Processing organized by CDAC. I have also submitted a paper titled “A dynamic programming based approach for generating syllable level templates in statistical parametric synthesis” to IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2014. I am very much excited by going through the project you are offering in GSOC this year, "Language model and Acoustic model for Malayalam language for speech recognition system in CMU Sphinx". I am keenly interested in pursuing this work in your guidance. I will be glad if I am considered for this work. Natural Language Processing and Speech synthesis have been my area of work till now and I definitely want to improve my worth in the subject and contribute to this field under able leadership.